Decoration
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

Research

Vision and Objectives

The group's long-term objective is to develop methodology for knowledge discovery: collecting, organizing, searching, exploring, and ranking facts from structured, semistructured, and textual information sources. Our approach towards this ambitious goal combines concepts, models, and algorithms from several fields, including database systems, information retrieval, statistical learning, and data mining.

Today, scientific results are available on the Internet in the form of publications, a main information source for scholars, and in encyclopedias like Wikipedia, a major source for students. Digital libraries and thematic portals combine multiple literature or data collections, but there is neither deep integration nor comprehensive coverage. Information search is limited to keywords and simple metadata, and media like video, images, or speech are most effectively searched by means of manually created annotations.

We envision a comprehensive, multimodal knowledge base of encyclopedic scope but in a formal (machine-processable) representation. This should encompass all human knowledge in terms of explicit facts referring to the concepts and entities of the underlying domains of discourse (e.g., concepts such as enzymes, quasars, or poets and specific entities such as Steapsin, 3C 273, or Bertolt Brecht), to definitions, theorems, and hypotheses, to measurements of natural phenomena and artefacts, and a wealth of video and speech material as well as sensor readings, all of which should be first-class citizens for effective search. In particular, we aim for efficient methods for large-scale information enrichment and knowledge extraction, powerful querying with semantic search capabilities, deployment in a decentralized (peer-to-peer) and social-network environment, and maintenance of information history and knowledge evolution over long time horizons.

A comprehensive knowledge base should know all individual entities of this world (e.g., Nicolas Sarkozy), their semantic classes (e.g., Sarkozy is a Politician), relationships between entities (e.g., Sarkozy presidentOf France), as well as validity times and confidence values for the correctness of such facts. Moreover, it should come with logical reasoning capabilities and rich support for querying and ranking. The benefits from solving this grand challenge would be enormous.

With a knowledge base that sublimates the valuable content from the Web, we could address difficult questions that are beyond the capabilities of today's keyword-based search engines. For example, one could ask for drugs that inhibit proteases and would obtain a precise and fairly comprehensive list of drugs for this HIV-relevant family of enzymes. Another example is searching for politicians who are also scientists; we would hope for ranked answers with individual names such as Angela Merkel or Benjamin Franklin, whereas today's search engines return web pages that contain the words "policiticans" and "scientists" (e.g., on environmental topics). Such advanced information requests are posed by knowledge workers like scientists, students, journalists, historians, or market researchers. Although it is possible today to find relevant answers, this process is extremely laborious and time-consuming as it often requires rephrasing queries and browsing through many potentially promising but eventually useless result pages.


Research Areas

Knowledge Harvesting (Coordinator: Gerhard Weikum)

There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the proliferation of knowledge-sharing communities and the advances in large-scale information extraction from semistructured as well as natural-language Web sources. The grand vision is to turn the Web into a comprehensive knowledge base that can be efficiently searched with high precision. Our research towards this objective is centered around the YAGO knowledge base and the NAGA search engine. YAGO is a large collection of entities and relational facts that are harvested from Wikipedia and WordNet with high accuracy and reconciled into a consistent RDF-style "semantic" graph. For further growing YAGO from Web sources while retaining its high quality, pattern-based extraction is combined with logic-based consistency checking in a unified framework. NAGA provides graph-template-based search over this data, with powerful ranking capabilities based on a statistical language model for graphs. Advanced queries and the need for ranking approximate matches pose efficiency and scalability challenges that are addressed by algorithmic and indexing techniques.

Web and Text Mining (Coordinator: Srikanta Bedathur)

Increasing amounts of information pertaining almost all walks of life are published, archived and made accessible over the Web. In addition, the advent of Web 2.0 technologies has led to increased interactions, dynamics and diversity of content. Our goal is to detect and analyze patterns, trends, and salient properties in this huge wealth of information. This leads us to tackle a variety of research challenges, including, high-quality content classification, summarization of semi-structured content, learning from user preferences, predictions in complex social networks, understanding the evolution of information, and quantifying the coherence of information archives.

Ranking and Uncertain Data Management (Coordinator: Martin Theobald)

Ranking query results is a fundamental building block connecting database (DB) technology with information retrieval (IR) methods, the latter taking into account not only the efficiency but also the effectiveness of the query processing from a true user perspective. Our work on combining DB and IR includes topics on semistructured data management (most notably our TopX project) with efficient top-k keyword search in both semistructured (i.e., text-centric XML data) and structured data sources (such as ontological knowledge bases stored as RDF data). Moreover, with uncertain data management becoming an increasingly important issue also in the context of databases, we recently started looking into efficient support for first-order reasoning in these RDF knowledge bases. Here, our focus is on a unified probabilistic framework for uncertain reasoning with possible inconsistencies over such RDF facts, along with soft and hard rules in the form of Datalog-like Horn clauses. Thus, our approach for managing uncertainty involves (and aims to combine) various aspects from logics, probability theory, and machine learning, but also calls for an efficient and scalable database infrastructure (see our URDF project page).

Scalable Management of Uncertain Data (Coordinator: Rainer Gemulla)

Scalable uncertainty management is centered around the interplay of information systems and probability theory. Traditional database systems excel at managing certain data; they combine first-order logic with efficient data manipulation and query processing techniques. Probability theory studies randomness and uncertainty; it forms the backbone of uncertainty management in disciplines such as machine learning and artificial intelligence. Modern information systems are often faced with tasks that lie in the intersection of these two fields: They need to model, manipulate, query, and reason with large amounts of uncertain information. Our group studies---both from a theoretical viewpoint and by building systems---how to model this uncertainty, how to learn and train probabilistic models from data, and how to perform probabilistic inference online. We focus specifically on applications that involve very large datasets, such as social networks or information extraction from Web sources, and develop distributed algorithms and systems that scale to these large datasets.

Distributed Data and Communities (Coordinator: Mauro Sozio)

Popular Web 2.0 applications and rapidly increasing social online communities face the need for distributed data management for scalability reasons. They may serve millions of users and manage user-provided data that is distributed across many computers in a data center or even over a wide-area network. To ensure good search performance and high availability of services, such distributed systems must have mechanisms and intelligent strategies to cope with high failure rates of components and with high dynamics of data, workload, and the network itself. Peer-to-peer networks for data sharing pose similar challenges and additionally face the problem of misbehaving peers that aim to manipulate distributed computations to their advantage. This entails the need for modeling and computing authority, trust, and reputation measures to retrieve relevant content and identify trustworthy peers.

Efficient Search in Semistructured Data Spaces (Independent Group Headed By: Ralf Schenkel)

We intensively collaborate with this independent research group.

Ontologies (Independent Group Headed By: Fabian M. Suchanek)

We intensively collaborate with this independent research group.


Participation in Externally Funded Projects

Search MPII (type ? for help)