Decoration
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

Research

Vision and Objectives

The group's long-term objective is to develop methodology for knowledge discovery: collecting, organizing, searching, exploring, and ranking facts from structured, semistructured, and textual information sources. Our approach towards this ambitious goal combines concepts, models, and algorithms from several fields, including database systems, information retrieval, statistical learning, and data mining.

Today, scientific results are available on the Internet in the form of publications, a main information source for scholars, and in encyclopedias like Wikipedia, a major source for students. Digital libraries and thematic portals combine multiple literature or data collections, but there is neither deep integration nor comprehensive coverage. Information search is limited to keywords and simple metadata, and media like video, images, or speech are most effectively searched by means of manually created annotations.

We envision a comprehensive, multimodal knowledge base of encyclopedic scope but in a formal (machine-processable) representation. This should encompass all human knowledge in terms of explicit facts referring to the concepts and entities of the underlying domains of discourse (e.g., concepts such as enzymes, quasars, or poets and specific entities such as Steapsin, 3C 273, or Bertolt Brecht), to definitions, theorems, and hypotheses, to measurements of natural phenomena and artefacts, and a wealth of video and speech material as well as sensor readings, all of which should be first-class citizens for effective search. In particular, we aim for efficient methods for large-scale information enrichment and knowledge extraction, powerful querying with semantic search capabilities, deployment in a distributed and social-network environment, and maintenance of information history and knowledge evolution over long time horizons.

A comprehensive knowledge base should know all individual entities of this world (e.g., Nicolas Sarkozy), their semantic classes (e.g., Sarkozy is a Politician), relationships between entities (e.g., Sarkozy presidentOf France), as well as validity times and confidence values for the correctness of such facts. Moreover, it should come with logical reasoning capabilities and rich support for querying and ranking. The benefits from solving this grand challenge would be enormous.

With a knowledge base that sublimates the valuable content from the Web, we could address difficult questions that are beyond the capabilities of today's keyword-based search engines. For example, one could ask for drugs that inhibit proteases and would obtain a precise and fairly comprehensive list of drugs for this HIV-relevant family of enzymes. Another example is searching for politicians who are also scientists; we would hope for ranked answers with individual names such as Angela Merkel or Benjamin Franklin, whereas today's search engines return web pages that contain the words "policiticans" and "scientists" (e.g., on environmental topics). Such advanced information requests are posed by knowledge workers like scientists, students, journalists, historians, or market researchers. Although it is possible today to find relevant answers, this process is extremely laborious and time-consuming as it often requires rephrasing queries and browsing through many potentially promising but eventually useless result pages.


Research Areas

Knowledge Harvesting (Coordinator: Gerhard Weikum)

The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, ReadTheWeb, and our YAGO-NAGA project, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Highlights of our ongoing work include the YAGO2 knowledge base, the AIDA tool for named entity disambiguation, and the RDF search engine RDF-3X. This research is supported by a Google Focused Research Award.

Ranking and Uncertain Data Management (Coordinator: Martin Theobald)

Ranking query results is a fundamental building block connecting database (DB) technology with information retrieval (IR) methods, the latter taking into account not only the efficiency but also the effectiveness of the query processing from a true user perspective. Our work on combining DB and IR includes topics on semistructured data management (most notably our TopX project) with efficient top-k keyword search in both semistructured (i.e., text-centric XML data) and structured data sources (such as ontological knowledge bases stored as RDF data). Moreover, with uncertain data management becoming an increasingly important issue also in the context of databases, we recently started looking into efficient support for first-order reasoning in these RDF knowledge bases. Here, our focus is on a unified probabilistic framework for uncertain reasoning with possible inconsistencies over such RDF facts, along with soft and hard rules in the form of Datalog-like Horn clauses. Thus, our approach for managing uncertainty involves (and aims to combine) various aspects from logics, probability theory, and machine learning, but also calls for an efficient and scalable database infrastructure (see our URDF project page).

Scalable Management of Uncertain Data (Coordinator: Rainer Gemulla)

Scalable uncertainty management is centered around the interplay of information systems and probability theory. Traditional database systems excel at managing certain data; they combine first-order logic with efficient data manipulation and query processing techniques. Probability theory studies randomness and uncertainty; it forms the backbone of uncertainty management in disciplines such as machine learning and artificial intelligence. Modern information systems are often faced with tasks that lie in the intersection of these two fields: They need to model, manipulate, query, and reason with large amounts of uncertain information. Our group studies---both from a theoretical viewpoint and by building systems---how to model this uncertainty, how to learn and train probabilistic models from data, and how to perform probabilistic inference online. We focus specifically on applications that involve very large datasets, such as social networks or information extraction from Web sources, and develop distributed algorithms and systems that scale to these large datasets.

Search and Mining over Web-Scale Graphs (Max Planck Partner Group in India Headed By: Srikanta Bedathur)

Graphs are a rich and flexible way to model a large variety of information - ranging from social structures to knowledge relationships, syntactic level co-occurrence of words/entities in a database to linkages between documents themselves, and many more. With the growing interest in Linked Open Data efforts, many of these datasets are being linked together, resulting in a huge graph. Given this growth of graph structured data, devising effective and efficient computational techniques for searching and mining over these graphs is of utmost importance. Our focus is mainly on developing scalable algorithms which can be integrated nicely into modern graph data management systems, and novel applications in the areas of knowledge graphs and social networks. We also address problems that arise when graphs are constantly undergoing changes.

Efficient Search in Semistructured Data Spaces (Independent Group Headed By: Ralf Schenkel)

We intensively collaborate with this independent research group at Saarland University.

Ontologies (Independent Group Headed By: Fabian M. Suchanek)

We intensively collaborate with this independent research group at the Max Planck Institute for Informatics.

Querying, Indexing, and Discovery in Dynamic Data (Independent Group Headed By: Sebastian Michel)

We intensively collaborate with this independent research group at Saarland University.


Participation in Externally Funded Projects