Knowledge Base Recall

Knowledge bases about entities like people, places and products have become key assets of search and recommender systems. The largest of them contain many millions of entities and billions of facts about them. Nevertheless, they have major gaps and limitations in what they cover, thus posing the challenge of detecting and resolving these "unknown unknowns". In this research we use various approaches to mapping knowledge base recall, including 1) counting quantifier extraction from text, 2) relative recall measures, 3) rule mining and 4) logical foundations towards relative recall and informativeness.


Survey papers:

  • Simon Razniewski and Gerhard Weikum, Knowledge Base Recall: Detecting and Resolving the Unknown Unknowns, SIGWEB, 2018 [pdf]
  • Simon Razniewski, Fabian Suchanek and Werner Nutt, But What Do We Actually Know?, AKBC, 2016 [pdf]

Survey slides

  • Joint lecture series at Saarland University, Simon Razniewski, 2017 [slideshare]

1. Counting quantifier extraction

Information extraction traditionally focuses on extracting relations between identifiable entities, such as <Monterey, isLocatedIn, California>. Yet, texts often also contain counting information, stating that a subject is in a specific relation with a number of objects, without mentioning the objects themselves, for example, “The U.S. state of California is divided into 58 counties.”. Such counting quantifiers can help in a variety of tasks such as query answering or knowledge base curation, but are neglected by prior work. We develop the first full-fledged system for extracting counting information from text, called CINEX, which predict counting quantifiers given a pair of <subject, relation> and a text about the subject, such as <California, hasCounties, 58>. 

We employ distant supervision using fact counts from a knowledge base as training seeds, and leverage CRF-based sequence tagging models to identify counting information in the text. Experiments with five human-evaluated relations show that CINEX can achieve 60% average precision for extracting counting information. In a large-scale experiment, we demonstrate the potential for knowledge base enrichment by applying CINEX to 2,474 frequent relations in Wikidata. CINEX can assert the existence of 2.5M facts for 110 distinct relations, which is 28% more than the existing Wikidata facts for these relations.

The predicted counting quantifiers for (selected 37) Wikidata relations, by running the learned models on all entities in a class given a Wikidata property-class pair (e.g., all child of humans), can be queried at https://cinex.cs.ui.ac.id/ (e.g., List all humans w/o spouses in Wikidata, Does Wikidata contain all children of George HW Bush?).

Publications:

  • Paramita Mirza, Simon Razniewski, Fariz Darari and Gerhard Weikum, Enriching Knowledge Bases with Counting Quantifiers, ISWC, 2018 [pdf]  [code]  [experiments]  [results]

  • Paramita Mirza, Simon Razniewski, Fariz Darari, Gerhard Weikum, Cardinal Virtues: Extracting Relation Cardinalities from Text, ACL short paper, 2017 [pdf]
  • Paramita Mirza, Simon Razniewski and Werner Nutt, Expanding Wikidata’s Parenthood Information by 178%, or How To Mine Relation Cardinalities, ISWC Poster, 2016 [pdf]

2. Relative completeness

Several automated techniques have been adopted by Wikis to track and manage completeness, yet these techniques are generally subjective and do not provide a clear quality estimate at the level of entities. In this research we aim towards measuring Relative Completeness in Wikidata by comparison with data present for similar entities. This relative completeness approach is easily scalable with the introduction of new classes in the knowledge base, and has been implemented for all available entities in Wikidata. The results provide an intuition on the completeness of an entity comparing it with other similar entities. The results have been implemented in Wikidata as a plugin.

Publications

  • Vevake Balaraman, Simon Razniewski and Werner Nutt, Recoin: Relative Completeness in Wikidata, Wiki Workshop, 2018 [pdf]

  • Simon Razniewski, Vevake Balaraman, Werner Nutt, Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties, ADMA, 2017 [pdf]

Tools

  • Relative completeness indicator for Wikidata (Recoin)

3. Completeness rule mining

In this work, we investigate different signals to identify the areas where the knowledge base is complete. We show that we can combine these signals in a rule mining approach, which allows us to predict where facts may be missing. We also show that completeness predictions can help other applications such as fact inference.

Secondly, we propose to use (in-)completeness meta-information to better assess the quality of rules learned from incomplete KGs. We introduce completeness-aware scoring functions for relational association rules. Experimental evaluation both on real and synthetic datasets shows that the proposed rule ranking approaches have remarkably higher accuracy than the state-of-the-art methods in uncovering missing facts.

Publications

  • Luis Galárraga, Simon Razniewski, Antoine Amarilli, Fabian M. Suchanek, Predicting Completeness in Knowledge Bases, WSDM, 2017 [pdf]
  • Thomas Pellissier Tanon, Daria Stepanova, Simon Razniewski, Paramita Mirza and Gerhard Weikum, Completeness-aware Rule Learning from Knowledge Graphs, ISWC, 2017 [pdf]

 

4. Logical foundations of recall information

The Semantic Web is commonly interpreted under the open-world assumption, meaning that information available (e.g., in a data source) only captures a subset of the reality. Therefore, there is no certainty about whether the available information provides a complete representation of the reality. Our goal is to contribute a formal study of how to describe the completeness of parts of the Semantic Web stored in RDF data sources. We introduce a theoretical framework allowing to augment RDF data sources with statements, also expressed in RDF, about their completeness. One immediate benefit of this framework is that now query answers can be complemented with information about their completeness. We study the impact of completeness statements on the complexity of query answering by considering different fragments of the SPARQL language, including the RDFS entailment regime, and the federated scenario. We implement an efficient method for reasoning about query completeness and provide an experimental evaluation in the presence of large sets of completeness statements.

Publications

  • Fariz Darari, Werner Nutt, Giuseppe Pirro, Simon Razniewski, Completeness Management for RDF Data Sources, ACM Transactions on the Web (TWEB), 2018 [pdf]
  • Fariz Darari, Radityo Eko Prasojo, Simon Razniewski and Werner Nutt, COOL-WD: A Completeness Tool for Wikidata, ISWC demo, 2017 [pdf]
  • Simon Razniewski, Flip Korn, Werner Nutt, Divesh Srivastava, Identifying the Extent of Completeness of Query Answers over Partially Complete Databases, SIGMOD, 2015 [pdf]
  • Simon Razniewski and Werner Nutt, Completeness of Queries over Incomplete Databases, VLDB, 2011 [pdf]

Demo

  • Cool-WD: Completeness reasoning over Wikidata [link]

People involved at D5

External collaborators

  • Fariz Darari, Universitas Indonesia
  • Werner Nutt, Free University of Bozen-Bolzano
  • Fabian Suchanek, Télécom ParisTech University
  • Luis Galárraga, INRIA Rennes
  • Vevake Balaraman, Trento University