Vision and Objectives

The group's long-term objective is to develop methodology for knowledge discovery: collecting, organizing, searching, exploring, and ranking facts from structured, semistructured, and textual information sources. Our approach towards this ambitious goal combines concepts, models, and algorithms from several fields, including database systems, information retrieval, statistical learning, and data mining.

Today, scientific results are available on the Internet in the form of publications, a main information source for scholars, and in encyclopedias like Wikipedia, a major source for students. Digital libraries and thematic portals combine multiple literature or data collections, but there is neither deep integration nor comprehensive coverage. Information search is limited to keywords and simple metadata, and media like video, images, or speech are most effectively searched by means of manually created annotations.

We envision a comprehensive, multimodal knowledge base of encyclopedic scope but in a formal (machine-processable) representation. This should encompass all human knowledge in terms of explicit facts referring to the concepts and entities of the underlying domains of discourse (e.g., concepts such as enzymes, quasars, or poets and specific entities such as Steapsin, 3C 273, or Bertolt Brecht), to definitions, theorems, and hypotheses, to measurements of natural phenomena and artefacts, and a wealth of video and speech material as well as sensor readings, all of which should be first-class citizens for effective search. In particular, we aim for efficient methods for large-scale information enrichment and knowledge extraction, powerful querying with semantic search capabilities, deployment in a distributed and social-network environment, and maintenance of information history and knowledge evolution over long time horizons.

A comprehensive knowledge base should know all individual entities of this world (e.g., Nicolas Sarkozy), their semantic classes (e.g., Sarkozy is a Politician), relationships between entities (e.g., Sarkozy presidentOf France), as well as validity times and confidence values for the correctness of such facts. Moreover, it should come with logical reasoning capabilities and rich support for querying and ranking. The benefits from solving this grand challenge would be enormous.

With a knowledge base that sublimates the valuable content from the Web, we could address difficult questions that are beyond the capabilities of today's keyword-based search engines. For example, one could ask for drugs that inhibit proteases and would obtain a precise and fairly comprehensive list of drugs for this HIV-relevant family of enzymes. Another example is searching for politicians who are also scientists; we would hope for ranked answers with individual names such as Angela Merkel or Benjamin Franklin, whereas today's search engines return web pages that contain the words "politicians" and "scientists" (e.g., on environmental topics). Such advanced information requests are posed by knowledge workers like scientists, students, journalists, historians, or market researchers. Although it is possible today to find relevant answers, this process is extremely laborious and time-consuming as it often requires rephrasing queries and browsing through many potentially promising but eventually useless result pages.

Knowledge Harvesting

Coordinator: Gerhard Weikum

The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, ReadTheWeb, and our YAGO-NAGA project, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Highlights of our ongoing work include the YAGO2 knowledge base, the AIDA tool for named entity disambiguation, and the RDF search engine RDF-3X. This research is supported by a Google Focused Research Award.

Data Mining

Coordinator: Pauli Miettinen

Data mining-extracting novel and interesting information about the data-is a fundamental part of modern data sciences. Our group's goal is to develop methods and algorithms for data mining that are based on well-founded algorithmic principles. We work on every step of this process: analysing the computational complexity and other properties of the problem, developing algorithms that work in practice based on the theoretical understanding, and applying these algorithms to real-world problems. Many interesting data sets contain binary (or higher arity) relations, collections of sets of elements, or bipartite graphs. All these data sets can be expressed using binary matrices (or tensors). Currently our focus is mostly on studying the decomposition methods for binary matrices and tensors in all of the aforementioned three levels: theory, algorithms, and applications. We also work on variations of these problems, such as dynamically updating the decompositions and automatically selecting the decomposition rank using the minimum description length principle.

Text+Time Search and Analytics

Coordinator: Klaus Berberich

We focus on developing efficient and effective methods to search and analyze natural language texts that come with associated temporal information. This includes temporal expressions, which convey time periods a text refers to, as well as publication timestamps, which indicate when a text was published. Data of interest to us includes web archives, newspaper corpora, and other collections of born-digital or now-digital documents. Search, as one direction, targets situations when the user's information need is precise and can be satisfied by some piece of our data, typically a document. Analytics, as another direction, targets situations when the user's information need is vague or can be satisfied with some derived data, typically aggregated statistics. Implementing our methods in systems and experimentally evaluating them on real-world data is integral to our approach. Our recent and ongoing efforts include time-travel text search, algorithms to compute n-gram statistics at large scale, as well as redundancy-aware retrieval models.

Semantic Data: Reasoning and Learning

Coordinator:  Daria Stepanova

Means for enriching data with semantics and methods for reasoning about it are at the heart of any intelligent system. Especially now, with the maturation of knowledge graphs and their popularity in applications, such as semantic Web search, the need for deductive and inductive reasoning services is strongly increasing. Our group deals with a variety of issues in this context.  We work on new methods for knowledge representation and their applicability to Web data. We are concerned with handling incomplete and inconsistent information, data repair, diagnostic reasoning, and knowledge revision. Furthermore, we aim at combining learning and reasoning techniques for semantically-enhanced data processing in a variety of applications. In particular, we develop inductive learning approaches for deriving hidden insights from knowledge graphs in the form of logical rules.

Text Analysis

Coordinator: Jannik Strötgen

Despite the development of knowledge bases and their improvements in recent years, most human knowledge is still available only in unstructured format, in particular as natural language text. Thus, applications such as search engines and question answering systems benefit from knowledge extracted from texts. Our group aims at developing natural language processing tools for extracting valuable information from large document collections. In particular, we tackle the extraction and interpretation of temporal information, as time is an important dimension in any information space. For instance, we keep on extending and improving the temporal tagger HeidelTime. We also work on semantically refined tasks: we develop a time-aware search engine, which allows to formulate queries with temporal constraints on the documents' content, and perform exploratory corpus analysis. The field of digital humanities offers further opportunities. Instead of studying interesting phenomena based on manually analyzed examples, applying natural language processing techniques helps to perform large-scale analyses on big corpora and can lead to new insights in the respective research areas, such as literary science.

Collaborations with Independent Research Groups

Exploratory Data Analysis (Independent Group Headed By: Jilles Vreeken)

We intensively collaborate with this independent research group at Saarland University.

Participation in Externally Funded Projects