The semantic technology sweet spot is information integration; that means (primarily, but not only) evaluating SPARQL queries over a set of distributed, heterogeneous data sources. Such query results should be sound; they can be, optionally, exact or approximative. SDQ is our product that provides this (and more) functionality.
The Big Picture
The architecture behind our approach is similar to the centralized approach to information integration, where there is one global, coherent view or model of the domain of interest, defined over a set of relevant sources. Applications—including query answering or, for that matter, arbitrary analytics and algorithms—can be developed on top of the global model without any knowledge of the underlying data sources, their location, or data model. Relevant sources can be structured, unstructured, or semi-structured; in case of unstructured data sources, we assume that some kind of ETL process has been carried out.
The global model is related to the sources by a set of mappings. There are a handful of standard approaches to mappings: Global-as-View (GaV), where each concept/property in the global model is associated with a query over the sources; Local-as-View (LaV), where each concept/property/table in the sources is associated with a query over the global model; and their generalization Global-Local-as-View (GLaV), where a query over the sources is associated with a query over the global model.
SDQ Overview
SDQ is our system for providing information integration; it performs SPARQL evaluation over distributed, heterogenous data sources:
- Supports SPARQL 1.0 (SPARQL 1.1 coming along later this year)
- Supports reasoning over mappings and global model in query evaluation
- Describes the global model as an OWL 2 ontology
- Handles queries over different types of data source, i.e., RDBMS, RDF, XML, etc.
- Supports expressive GLaV mappings
SDQ’s reasoning support is sound and complete for the OWL 2 QL profile and significant portions of EL and RL. If the global model goes beyond the OWL 2 profiles, SDQ performs approximate reasoning via Pellet.
Where’s the Magic?
Queries against the system are posed in terms of the global model; that is, they use the classes and properties defined in the global ontology. In order to be evaluated, a query must be rewritten in terms of the vocabulary used by the sources. The rewriting uses the mappings; and, since we want to take full advantage of the semantics captured by the global model, evaluating queries also involves reasoning.
How about a very simple example?
Suppose our global ontology captures concepts and properties of a corporate
domain: employees, departments, projects, etc. And suppose that we want to
evaluate the query “Give me the list of employees with their SSNs and the
department they work in” over the global ontology. In SPARQL, this query, which
we’re calling Q below, would look something like this:
#Q:
SELECT ?employee ?SSN ?department
WHERE {
?employee :worksIn ?department;
:ssn ?SSN.
}
In order to evaluate Q, the first step is to rewrite Q with
respect to the global ontology. This can be accomplished by reasoning with
the set of relevant statements of the ontology.
Suppose that the global ontology contains the axiom, :manages
owl:subPropertyOf :worksIn. Then SDQ would rewrite Q into
these two queries:
#Q1:
SELECT ?employee ?SSN ?department
WHERE {
?employee :worksIn ?department;
:ssn ?SSN;
}
#Q2:
SELECT ?employee ?SSN ?department
WHERE {
?employee :manages ?department;
:ssn ?SSN;
}
Once Q has been rewritten with respect to the global ontology,
the second step is to rewrite the resulting queries (Q1
and Q2) with respect to the mappings, which is also done by
reasoning about the mappings.
Suppose that we have two RDF sources—S1 and
S2—the former containing information about departments and
their managers and the latter containing information about employees (i.e.,
name, address, telephone number, SSN, etc.). Let’s also assume that the
mappings between the global ontology and the sources include two SWRL
rules:
S1:isManagerOf(?x, ?y) -> :manages(?x, ?y)
and
S2:hasSSN(?x, ?y) -> :ssn(?x, ?y)
In other words, if someone isManagerOf someone else in
S1, then the :manages relation also holds between
them in the global ontology; likewise, if a person hasSSN in
S2, then the :ssn relation holds in the global
ontology. Of course, these SWRL rules are very simple; but any SWRL rules,
of whatever complexity, can act as mappings.
Given our example, Q2 can be rewritten as Q3
(Q1 is not rewritten as there are no relevant mappings):
#Q3:
SELECT ?employee ?SSN ?department
WHERE {
?employee S1:isManagerOf ?department;
S2:hasSSN ?SSN;
}
Q3 is now expressed in terms of the sources. Note, however,
that it uses terms from more than one source; therefore, evaluating it
independently over S1 or S2 would yield an empty
answer. We solve this problem by producing a set of independent queries. For
example, instead of producing the query Q3, we can produce the
queries
#Q4
SELECT ?employee ?department
WHERE {
?employee S1:isManagerOf ?department;
}
and
#Q5:
SELECT ?employee ?SSN
WHERE {
?employee S2:hasSSN ?SSN;
}
Once we have obtained the independent set of queries, we can evaluate them. The final step consists of evaluating the resulting queries over the sources.
So now we have to efficiently evaluate two queries: Q4 and
Q5. Note that an intermediate join has to be performed between
the results to obtain correct answers; that means that evaluating the
resulting queries is not just a matter of evaluating each query in isolation
and returning the union of the results.
Finally, assume that S1 contains
Mary S1:isManagerOf Finance.
and that S2 contains
Mary S2:hasSSN "XXX-XX-XXXX".
It should be easy to see that the correct evaluation of Q4 and
Q5 (i.e., performing the intermediate join) will yield the
expected answer to the original query Q:
=============================== employee|SSN |department ------------------------------- Mary |XXX-XX-XXXX|Finance ===============================
In sum, to correctly evaluate a query posed in terms of the global ontology in our distributed scenario we have to
- rewrite the original query with respect to the global ontology,
- rewrite the resulting queries with respect to the mappings, and
- evaluate the resulting queries over the sources (by performing intermediate joins if needed).
Note: in this post we have used RDF and SPARQL as examples; but this isn’t usually the case in enterprise use cases. Usually data sources in SDQ are bog standard RDBMSes. SDQ works in these scenarios, too, and will rewrite SPARQL queries, composed against a global ontology, into SQL queries that can be evaluated by RDBMSes.
Conclusion
Efficient, distributed SPARQL query evaluation over heterogeneous data sources that are abstracted by a global model (i.e., ontology) is something like the holy grail of Enterprise Semantics. If you’re interested in applying this approach, please get in touch.
Feedback: Comments

