YAGO-NAGA

Harvesting, Searching, and Ranking Knowledge from the Web

The YAGO-NAGA project started in 2006 with the goal of building a conveniently searchable, large-scale, highly accurate knowledge base of common facts in a machine-processible representation.

We have already harvested knowledge about millions of entities and facts about their relationships, from Wikipedia and WordNet with careful integration of these two sources. The resulting knowledge base, coined YAGO, has very high precision and is freely available. The facts are represented as RDF triples, and we have developed methods and prototype systems for querying, ranking, and exploring knowledge. Our search engine NAGA provides ranked answers to queries based on statistical models.

Several interlinked sub-projects are growing on the YAGO-NAGA basis. Our vision is a confluence of Semantic Web (Ontologies), Social Web (Web 2.0), and Statistical Web (Information Extraction) assets towards a comprehensive repository of human knowledge. Our methodologies combine concepts, models, and algorithms from several fields, including database systems, information retrieval, statistical learning, and logical reasoning.

AIDA
AIDA is a method for disambiguating mentions of named entities in text.
more
AMIE
AMIE: Association Rule Mining under Incomplete Evidence in Ontological Knowledge Bases. This project is developed jointly with the DBWeb team of Télécom ParisTech.
more
ANGIE
ANGIE is an active knowledge system for interactive exploration.
more
BriQ
ClausIE
DEANNA
DEANNA is a framework for natural language question answering over structured knowledge bases.
more
diaNED
Time-Aware Named Entity Disambiguation for Diachronic Corpora
more
Equity
Equity is an end-to-end system for canonicalizing mentions of entities, classes, concepts and quantities in ad-hoc tables and their surrounding contexts.
more
Espresso
Computation of semantically meaningful substructures from knowledge graphs.
more
Fiction and Fantasy
The long goal of this project is extracting interesting information, mainly related to characters in fictional stories, including personal information (e.g. name, birth/dead, title, etc.), interpersonal relationships (e.g. family relations, business relations, ally/enemy, etc.) and narratives (e.g. battles, who kills whom, etc.).
more
EVIN
EVIN (EVents In News) is a system that can extract named events from a news corpus, organizes them into ontological classes, and supports interactive exploration. EVIN exploits different kinds of similarities between news items referring to textual contents, entity occurrences, and temporal ordering, and captures these similarities in a multi-view attributed graph.
more
HIGGINS
HIGGINS project aims to combine Crowdsourcing with automated Information Extraction techniques to enable high-quality fact extraction from complex textual inputs.
more
HYENA
HYENA is a multi-label classifier for entity types based on hierarchical taxonomies derived from YAGO2.
more
IBEX
In IBEX, we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (books), GTINs (commercial products), DOIs (documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73-96% and a very high coverage.
more
Javatools
The Javatools are a suite of Java classes for a variety of small tasks, such as parsing, database interaction or file handling. They are used in the YAGO-NAGA project and available for download as well.
more
K2
Gathering and ranking photos of named entities with high precision, high recall, and diversity.
more
Know2Look
Know2look is an image retrieval framework that uses Commonsense Knowledge to bridge the semantic gap between the query keywords, textual descriptions and the visual content of the images.
more
Le Monde
Mining History with Le Monde. This project is developed jointly with the DBWeb team of Télécom ParisTech.
more
LEILA
LEILA is a system that extracts facts from Web sources by linguistic analysis.
more
NAGA
NAGA is a new semantic search engine supporting keyword search for the casual user as well as graph queries with regular expressions for the expert user.
more
PATTY
PATTY is a large collection of relations, arranged by synonyms and into subsumptions.
more
PRAVDA
PRAVDA is a system based on label propagation for knowledge harvesting especially temporal knowledge.
more
PROSPERA
Large-scale information extraction, a continuation of the SOFIE approach.
more
Quantity Search
Searching for entities with quantity constraints over web content.
more
RDF-3X
RDF-3X is an RDF storage and retrieval system that achieves excellent performance by following a RISC-style design philosophy.
more
RuLES
Rule Learning with Embedding Support
more
SOFIE
SOFIE extracts information from Web sources.
more
STICS
Still searching with keywords? "STICS: Searching with Strings, Things, and Cats" is a news search engine based on AIDA technology, which enables search for entities and categories!
more
TimeSEA
UWN
UWN is a multilingual version of WordNet, describing meanings of words in different languages and their relationships.
more
Watermarking
Watermarking and Provenance for Ontologies. This project is developed jointly with the DBWeb team of Télécom ParisTech.
more
YAGO
YAGO is a huge semantic knowledge base, derived from Wikipedia, WordNet, and GeoNames. YAGO knows almost 10 million entities (e.g. persons, organizations, cities), and 120 million facts about these entities. Unlike other automatically assembled knowledge bases, YAGO has a manually confirmed accuracy of 95%. YAGO is freely available at yago-knowledge.org.
more

Selected Publications

Johannes Hoffart, Fabian Suchanek, Klaus Berberich, Gerhard Weikum
YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia (pdf)
Special issue of the Artificial Intelligence Journal, 2012
Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Edwin Lewis Kelham, Gerard de Melo, and Gerhard Weikum
YAGO2: Exploring and Querying World Knowledge in Time, Space, Context, and Many Languages (pdf)
Demo paper in the proceedings of the 20th International World Wide Web Conference (WWW 2011)
Hyderabad, India, 2011
Ndapandula Nakashole, Martin Theobald and Gerhard Weikum
"Scalable Knowledge Harvesting with High Precision and High Recall" (pdf)
4th ACM International Conference on Web Search and Data Mining(WSDM 2011)
Martin Theobald and Gerhard Weikum
From Information to Knowledge: Harvesting Entities and Relationships from Web Sources
Tutorial at PODS 2012
Gerhard Weikum, Gjergji Kasneci, Maya Ramanath, Fabian Suchanek
Database and information-retrieval methods for knowledge discovery (PDF)
Commun. ACM 52(4): 56-64 (2009)
Fabian Suchanek, Gjergji Kasneci, Gerhard Weikum
Yago - A Large Ontology from Wikipedia and WordNet (PDF, BIB)
Elsevier Journal of Web Semantics
Gjergji Kasneci, Fabian Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum
NAGA: Searching and Ranking Knowledge (PDF, BIB)
24th IEEE International Conference on Data Engineering (ICDE 2008)
Fabian Suchanek Mauro Sozio, Gerhard Weikum
SOFIE: A Self-Organizing Framework for Information Extraction (PDF, BIB)
18th International World Wide Web conference (WWW 2009)
Thomas Neumann, Gerhard Weikum
RDF-3X: a RISC-style engine for RDF
Proc. VLDB Endowment 1:1, p. 647-659, August 2008.
More Publications