YAGO: a Collection of Digital Knowledge

YAGO: a Collection of Digital Knowledge

In recent years, the Internet has developed into a signifi cant source of in- formation. Train schedules, news, even entire encyclopedias are available online. Using search engines, we can query this information, but current search engines have limits. Assume we would like to know which scientists are also active in politics. This question can hardly be for- mulated in a way that it can be answered by Google. Queries like “politician scientist” only return results about opinions on political events. The problem here is that the computers we use today can store a tremendous amount of data, but are not able to relate this data to a given context or even to understand it. If it were possible to make computers understand data as knowledge, this knowledge could be helpful not only for Internet search, but also for many other tasks, such as understanding spoken language or the automatic translation of text into multiple languages. This is the goal of the “YAGO- NAGA” project at the Max Planck Institute for Informatics.

Before a computer can process knowledge, it must be stored in a structured fashion. Such a structured knowledge collection is called an ontology. The building blocks of an ontology are entities. An entity is every type of concrete or abstract object: the physicist Albert Einstein, the year 1879, or the Nobel Prize. Entities are connected by relations, for example, Albert Einstein is connected to the year 1879 by the relation “born” (see graph). We have developed an approach to automatically create such an ontology using the online encyclopedia Wikipedia. Wikipedia contains articles about thousands of personalities, products and organizations. Each of these articles be- comes an entity in our ontology.

There is, for example, an article about Albert Einstein, so the physicist can be recognized as an entity for the ontology. Each article in Wikipedia is clas- sifi ed into specific categories, the article about Einstein, for example, in the category “born in 1879”. The keyword “born” allows the computer to store the fact that Einstein was born in 1879. Using this approach, we get a very large ontology, in which all of the entities known to Wikipedia have their place. This ontology is called YAGO (Yet Another Great Ontology, www.mpi-inf.mpg.de/yago-naga/yago/ ). At the moment, YAGO contains nearly 10 million entities and about 80 million facts.

YAGO2, a recently created extension of the original knowledge base, pays particular attention to the organization of entities and facts in space and time – two dimensions that are highly useful when searching in a knowledge base. As an example, the great majority of the approximately 900,000 person-entities in YAGO2 are anchored in time by their birth and death date, allowing us to position them in their historical context. For example, one can ask questions about important historical events during the lifetime of a specific president, emperor or pope, or also ask the question of when the person in question actually was president.

Most of the approximately 7 million locations in YAGO2 have geographic coordinates which place them on Earth’s surface. Thus, spatial proximity between two locations can be used as a search criterion. An example of a search using the space and time criteria could be: Which 20th century scientists were awarded a Nobel Prize and were born in the vicinity of Stuttgart? In YAGO2 one finds, among others, Albert Einstein, as both his life- time (1879-1955) and his birthplace Ulm (70 kilometers from Stuttgart) are stored in YAGO2.

Johannes Hoffart

DEPT. 5 Databases and Information Systems
+49 681 9325-5028
Email jhoffart@mpi-inf.mpg.de

Gerhard Weikum

DEPT. 5 Databases and Information Systems
+49 681 9325-5000