Knowledge Base Recall

Knowledge bases about entities like people, places and products have become key assets of search and recommender systems. The largest of them contain many millions of entities and billions of facts about them. Nevertheless, they have major gaps and limitations in what they cover, thus posing the challenge of detecting and resolving these "unknown unknowns". In this research we use various approaches to mapping knowledge base recall, including 1) counting quantifier extraction from text, 2) relative recall measures, 3) rule mining and 4) logical foundations towards relative recall and informativeness.

Survey papers

Simon Razniewski and Gerhard Weikum, Knowledge Base Recall: Detecting and Resolving the Unknown Unknowns, SIGWEB, 2018 [pdf]
Simon Razniewski, Fabian Suchanek and Werner Nutt, But What Do We Actually Know?, AKBC, 2016 [pdf]

Survey slides

Joint lecture series at Saarland University, Simon Razniewski, 2017 [slideshare]

1. Counting quantifier extraction

Information extraction traditionally focuses on extracting relations between identifiable entities, such as <Monterey, isLocatedIn, California>. Yet, texts often also contain counting information, stating that a subject is in a specific relation with a number of objects, without mentioning the objects themselves, for example, “The U.S. state of California is divided into 58 counties.”. Such counting quantifiers can help in a variety of tasks such as query answering or knowledge base curation, but are neglected by prior work. We develop the first full-fledged system for extracting counting information from text, called CINEX, which predict counting quantifiers given a pair of <subject, relation> and a text about the subject, such as <California, hasCounties, 58>.

We employ distant supervision using fact counts from a knowledge base as training seeds, and leverage CRF-based sequence tagging models to identify counting information in the text. Experiments with five human-evaluated relations show that CINEX can achieve 60% average precision for extracting counting information. In a large-scale experiment, we demonstrate the potential for knowledge base enrichment by applying CINEX to 2,474 frequent relations in Wikidata. CINEX can assert the existence of 2.5M facts for 110 distinct relations, which is 28% more than the existing Wikidata facts for these relations.

The predicted counting quantifiers for (selected 37) Wikidata relations, by running the learned models on all entities in a class given a Wikidata property-class pair (e.g., all child of humans), can be queried at https://cinex.cs.ui.ac.id/ (e.g., List all humans w/o spouses in Wikidata, Does Wikidata contain all children of George HW Bush?).

Publications

Paramita Mirza, Simon Razniewski, Fariz Darari and Gerhard Weikum, Enriching Knowledge Bases with Counting Quantifiers, ISWC, 2018 [pdf] [code] [experiments] [results]
Paramita Mirza, Simon Razniewski, Fariz Darari, Gerhard Weikum, Cardinal Virtues: Extracting Relation Cardinalities from Text, ACL short paper, 2017 [pdf]
Paramita Mirza, Simon Razniewski and Werner Nutt, Expanding Wikidata’s Parenthood Information by 178%, or How To Mine Relation Cardinalities, ISWC Poster, 2016 [pdf]

More details on extracting and aggregating count knowledge on the web.

2. Relative completeness

Several automated techniques have been adopted by Wikis to track and manage completeness, yet these techniques are generally subjective and do not provide a clear quality estimate at the level of entities. In this research we aim towards measuring Relative Completeness in Wikidata by comparison with data present for similar entities. This relative completeness approach is easily scalable with the introduction of new classes in the knowledge base, and has been implemented for all available entities in Wikidata. The results provide an intuition on the completeness of an entity comparing it with other similar entities. The results have been implemented in Wikidata as a plugin.

Publications

Vevake Balaraman, Simon Razniewski and Werner Nutt, Recoin: Relative Completeness in Wikidata, Wiki Workshop, 2018 [pdf]
Simon Razniewski, Vevake Balaraman, Werner Nutt, Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties, ADMA, 2017 [pdf]

Tools

Relative completeness indicator for Wikidata (Recoin)

3. Completeness rule mining

In this work, we investigate different signals to identify the areas where the knowledge base is complete. We show that we can combine these signals in a rule mining approach, which allows us to predict where facts may be missing. We also show that completeness predictions can help other applications such as fact inference.

Secondly, we propose to use (in-)completeness meta-information to better assess the quality of rules learned from incomplete KGs. We introduce completeness-aware scoring functions for relational association rules. Experimental evaluation both on real and synthetic datasets shows that the proposed rule ranking approaches have remarkably higher accuracy than the state-of-the-art methods in uncovering missing facts.

Publications

Luis Galárraga, Simon Razniewski, Antoine Amarilli, Fabian M. Suchanek, Predicting Completeness in Knowledge Bases, WSDM, 2017 [pdf] [data]
Thomas Pellissier Tanon, Daria Stepanova, Simon Razniewski, Paramita Mirza and Gerhard Weikum, Completeness-aware Rule Learning from Knowledge Graphs, ISWC, 2017 [pdf]

4. Logical foundations of recall information

The Semantic Web is commonly interpreted under the open-world assumption, meaning that information available (e.g., in a data source) only captures a subset of the reality. Therefore, there is no certainty about whether the available information provides a complete representation of the reality. Our goal is to contribute a formal study of how to describe the completeness of parts of the Semantic Web stored in RDF data sources. We introduce a theoretical framework allowing to augment RDF data sources with statements, also expressed in RDF, about their completeness. One immediate benefit of this framework is that now query answers can be complemented with information about their completeness. We study the impact of completeness statements on the complexity of query answering by considering different fragments of the SPARQL language, including the RDFS entailment regime, and the federated scenario. We implement an efficient method for reasoning about query completeness and provide an experimental evaluation in the presence of large sets of completeness statements.

Publications

Fariz Darari, Werner Nutt, Giuseppe Pirro, Simon Razniewski, Completeness Management for RDF Data Sources, ACM Transactions on the Web (TWEB), 2018 [pdf]
Fariz Darari, Radityo Eko Prasojo, Simon Razniewski and Werner Nutt, COOL-WD: A Completeness Tool for Wikidata, ISWC demo, 2017 [pdf]
Simon Razniewski, Flip Korn, Werner Nutt, Divesh Srivastava, Identifying the Extent of Completeness of Query Answers over Partially Complete Databases, SIGMOD, 2015 [pdf]
Simon Razniewski and Werner Nutt, Completeness of Queries over Incomplete Databases, VLDB, 2011 [pdf]

Demo

Cool-WD: Completeness reasoning over Wikidata [link]

5. Interesting Negations in KBs

Knowledge bases (KBs) about notable entities and their properties are an important asset in applications such as search, question answering and dialogue. All popular KBs capture virtually only positive statements, and abstain from taking any stance on statements not stored in the KB. In this work, we make the case for explicitly stating salient statements that do not hold. Negative statements are useful to overcome limitations of question answering, and can often contribute to informative summaries of entities. Due to the abundance of such invalid statements, any effort to compile them needs to address ranking by saliency.

Enriching knowledge bases with interesting negative statements. - AKBC 2020 audience choice best paper
Hiba Arnaout, Simon Razniewski, Gerhard Weikum - [PDF], [LINK]
Negative statements considered useful. - arXiv 2020.
Hiba Arnaout, Simon Razniewski, Gerhard Weikum - [PDF], [LINK]

(more details)

6. Linguistic theories for text coverage estimation

Scalar implicatures are language features thatimply the negation of stronger statements, e.g.,“She was married twice” typically implicates that she was not married thrice. In this work we discuss the importance of scalar implicatures in the context of textual information extraction. We investigate how textual features can be used to predict whether a given text segment mentions all objects standing in a certain relationship with a certain subject. Preliminary results on Wikipedia indicate that this prediction is feasible, and yields informative assessments relevant for assessing KB coverage.

Publication

Simon Razniewski, Nitisha Jain, Paramita Mirza, Gerhard Weikum. Coverage of Information Extraction from Sentences and Paragraphs, EMNLP, 2019 [pdf]
Sneha Singhania, Simon Razniewski, Gerhard Weikum. Predicting Document Coverage for Relation Extraction, TACL, 2021 [pdf]

People involved at D5

External collaborators

Fariz Darari, Universitas Indonesia
Werner Nutt, Free University of Bozen-Bolzano
Fabian Suchanek, Télécom ParisTech University
Luis Galárraga, INRIA Rennes
Vevake Balaraman, Trento University