RDF-3X – Fast Searches for Semantic Data

Thomas Neumann

RDF-3X – Fast Searches for Semantic Data

Semantic data

RDF-3X is a database system to manage and search for semantic data, that is, data which contains information about relationships between objects. Such relationships are ubiquitous. For example, a book has one or more authors, a protein takes part in specific reactions and a user of web 2.0 web pages has connections to his friends. These relationships between things, or, more abstractly: Entities, together with the entities themselves, form a network or graph structure. Using formal semantics associated with it, the data graphs form semantic networks, i.e., networks which allow for automatic reasoning.

A data format, which was specially designed for graph-structured data, is the Resource Description Framework (RDF). An RDF data collection consists of triplets, each of which corresponds to an edge and an associated node pair in the data graph. A triplet consists in RDF writing of a subject, a predicate and an object. In the graph, this corresponds with the “predicate” labeled edge from subject to object. There are edges in the example <id1, bornOn, 1859>, which express that the subject “id1” is associated with the name Arthur Conan Doyle, born in 1859. Further edges, such as <id1, authorOf, id2>, <id1, bornIn, Edinburgh> and <Edinburgh, is in, Scotland> demonstrate that the object “id2” is written by id1, that id1 is born in Edinburgh, and that Edinburgh is in Scotland.

Search in RDF graphs

Also, complex relationships can be formulated relatively simply with this kind of notation. The simplicity and flexibility, which is achieved through the graph structure of the data, leads to the fact that RDF searches are relatively expensive. Search queries are normally formulated in a query language such as SPARQL, for example, and describe a pattern that must be searched for in the data. If one searches, for example, for the title of books by Scottish authors, one can describe this with the following triple pattern:

?author <bornIn> ?city
?city <locatedIn> Scotland
?author <authorOf> ?book
?book <titel> ?titel.

The parts beginning with question marks are individual variables whose values must be determined by database systems. The portion in brackets is the value given as search conditions by the user. In order to answer this query, the database system must find all triples that fulfill the patterns of the query. Because the data graphs are very large, there can be many millions of candidate triples, making efficient searches very difficult.

The database system we have developed, RDF-3X (RDF Triples Express), approaches these problems on several levels. First, the data itself is stored appropriately, so that individual triple patterns can be efficiently evaluated. We have compressed the data and indexed it using search trees, so that any triple pattern can be evaluated very quickly. But this is not enough, however; the users are mostly interested in larger inter- relationships and thus pose queries with linked triple patterns. Therefore, we map such interrelated queries into execution strategies using algebraic operators and choose the ones that can be expected to have the lowest execution time from the different execution alternatives. The different alternatives are assessed, and then the most efficient alternative is chosen with the assistance of statistical models. For the example query, one must estimate whether there are more authors or more cities in the data set, and how this affects the execution time. A careful choice of the execution strategy can often accelerate the query processing by a factor of 10 or more.

Example RDF graph

Efficiency and scalability Searching in semantic data is already difficult for even relatively small graphs because the graphs frequently show no known structure (i.e., they follow no schema). Despite this, it was important for us to use RDF-3X not only for small data graphs, but also to make sure that it scales efficiently to very large data sets with billions of edges. One big challenge is updating such large data sets, i.e., inserting new data and, if necessary, removing old data, all without having to stop the whole database. Over time, we have therefore incorporated many techniques, such as Sideways Information Passing and Triple Versioning in RDF-3X in order to scale to such data sizes. In experimental comparison with other systems, RDF-3X usually performs extremely well. Even for complex queries with many triple patterns, RDF-3X is frequently much faster than other systems.

Thomas Neumann

DEPT. 5 Databases and Information Systems
Phone
+49 681 9325-5007
Email
neumann@mpi-inf.mpg.de