##### Martin Theobald

## URDF – Efficient Reasoning in Uncertain RDF Knowledge Bases

#### Reasoning in uncertain knowledge bases with soft and hard rules

Despite the vast improvements in the field of information extraction in re- cent years, the very nature of the applied extraction techniques entails that the resulting knowledge bases may exhibit a high degree of uncertainty or even in- consistency. Often, these inconsistencies are not obvious even for a human and can only be uncovered through intricate and potentially expensive inference steps. In large RDF knowledge bases with millions of facts, an automated search for inconsistencies can be accomplished only through complicated, logic-based inference techniques, and an overly eager removal of inconsistencies can even lead to the loss of correct data. Automatic reasoning in large knowledge bases there- fore requires a high robustness with respect to incomplete, imprecise, or even inconsistent data, which we will here summarize under the term uncertain data. While traditional query processing techniques for RDF (based on the SPARQL query language) are limited to deterministic data, the research in our latest project, URDF, particularly focuses on efficient, rule-based, and statistical inference techniques for uncertain RDF knowledge bases.

In URDF, we can express, for example, that many people live in the same place as their spouses. Being a rather “soft” form of an inference rule, this rule will also be violated by a number of instances (in this case, people) in the real world, which means that any knowledge derived from such a soft rule will also be uncertain. On the other hand, we can surely exclude that a person could have been born in different geographic locations. This represents a strict constraint, i.e., a “hard” rule, which may not be violated by any instance in our knowledge base. In a logic-based representation, these rules are often formulated as implications, so-called Horn clauses, where a conjunctive condition of facts implies a new fact:

marriedTo(person 1 , person 2 ) and livesIn(person 2 , location 1 ) -> livesIn(person 1 , location 1 )

The above rule is formulated in first-order predicate logic. More generally, our URDF framework allows for identifying common patterns between the instances in the knowledge base, and we can generalize (i.e., learn) first-order rules from these patterns in an inductive way. Conversely, we can then deductively reason about individual instances in our knowledge base by applying these fi rst- order rules. In this case, we can, for ex- ample, apply the above fi rst-order rule to the persons “Angela Merkel”, “Joachim Sauer”, and the place “Berlin-Mitte”. This form of uncertain reasoning allows us to infer that “Joachim Sauer” might live in “Berlin-Mitte”, given that we know that his wife, “Angela Merkel”, also lives in “Berlin-Mitte”:

marriedTo(Joachim_Sauer, Angela_Merkel) and livesIn(Angela_Merkel, Berlin-Mitte) -> livesIn(Joachim_Sauer, Berlin-Mitte)

Moreover, a special type of Horn clause can be formulated by grouping purely negated facts into a disjunction, as in the following example:

¬ bornIn(Angela_Merkel, Hamburg) or ¬ bornIn(Angela_Merkel, Bremen)

With this Horn clause, we can express that “Angela Merkel” cannot have been born both in “Hamburg” and also in “Bremen”. By using negations as in the above example, we now have a formal means to identify inconsistencies. If, in the extraction step, both facts bornIn(Angela_Merkel, Hamburg) and bornIn(Angela_Merkel, Bremen) have been inserted erroneously into the knowledge base, we can formulate that only one of the two facts may be true by using negations of the above form.

#### Efficient evaluation strategies

URDF supports both logic-based (propositional) and probabilistic evaluation strategies to answer user queries. In propositional reasoning, the user is guaranteed to obtain a consistent over- view of a potentially inconsistent knowl- edge base in response to a query. This problem is reduced to the maximum satisfiability problem (Max-Sat), a classic problem in propositional logics. For URDF, we have developed a highly efficient method for solving a generalization of the Max-Sat problem, which is specifically tailored to the combination of soft and hard rules described above. In probabilistic reasoning, on the other hand, URDF does not only assign binary true/ false values to facts, but also confidence weights, which correspond to a probabilistic interpretation of the derivation of the facts via the rules.

This enormous expressivity of rule- based and probabilistic inference techniques certainly poses major challenges to the development of new and efficient evaluation strategies for user queries. Both logic-based and probabilistic evaluation strategies underlie a much higher combinatorial complexity than traditional query evaluation strategies in relational databases. Therefore, exact evaluation algorithms cannot be applied to large volumes of data, which we obtain through information extraction from the World Wide Web. Our research concentrates on efficient approximation algorithms with good approximation guarantees, which are specifically tailored to these evaluation strategies.

## Martin Theobald

DEPT. 5 Databases and Information Systems

Phone +49 681 9325-5015

Email martin.theobald@mpi-inf.mpg.de