Vision and Objectives

The group's long-term objective is to develop methodology for knowledge discovery: collecting, organizing, searching, exploring, and ranking facts from structured, semistructured, and textual information sources. Our approach towards this ambitious goal combines concepts, models, and algorithms from several fields, including database systems, information retrieval, statistical learning, and data mining.

Today, scientific results are available on the Internet in the form of publications, a main information source for scholars, and in encyclopedias like Wikipedia, a major source for students. Digital libraries and thematic portals combine multiple literature or data collections, but there is neither deep integration nor comprehensive coverage. Information search is limited to keywords and simple metadata, and media like video, images, or speech are most effectively searched by means of manually created annotations.

We envision a comprehensive, multimodal knowledge base of encyclopedic scope but in a formal (machine-processable) representation. This should encompass all human knowledge in terms of explicit facts referring to the concepts and entities of the underlying domains of discourse (e.g., concepts such as enzymes, quasars, or poets and specific entities such as Steapsin, 3C 273, or Bertolt Brecht), to definitions, theorems, and hypotheses, to measurements of natural phenomena and artefacts, and a wealth of video and speech material as well as sensor readings, all of which should be first-class citizens for effective search. In particular, we aim for efficient methods for large-scale information enrichment and knowledge extraction, powerful querying with semantic search capabilities, deployment in a distributed and social-network environment, and maintenance of information history and knowledge evolution over long time horizons.

A comprehensive knowledge base should know all individual entities of this world (e.g., Nicolas Sarkozy), their semantic classes (e.g., Sarkozy is a Politician), relationships between entities (e.g., Sarkozy presidentOf France), as well as validity times and confidence values for the correctness of such facts. Moreover, it should come with logical reasoning capabilities and rich support for querying and ranking. The benefits from solving this grand challenge would be enormous.

With a knowledge base that sublimates the valuable content from the Web, we could address difficult questions that are beyond the capabilities of today's keyword-based search engines. For example, one could ask for drugs that inhibit proteases and would obtain a precise and fairly comprehensive list of drugs for this HIV-relevant family of enzymes. Another example is searching for politicians who are also scientists; we would hope for ranked answers with individual names such as Angela Merkel or Benjamin Franklin, whereas today's search engines return web pages that contain the words "politicians" and "scientists" (e.g., on environmental topics). Such advanced information requests are posed by knowledge workers like scientists, students, journalists, historians, or market researchers. Although it is possible today to find relevant answers, this process is extremely laborious and time-consuming as it often requires rephrasing queries and browsing through many potentially promising but eventually useless result pages.

Research areas

Knowledge Harvesting

Coordinator: Gerhard Weikum

The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, ReadTheWeb, and our YAGO-NAGA project, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Highlights of our ongoing work include the YAGO2 knowledge base, the AIDA tool for named entity disambiguation, and the RDF search engine RDF-3X. This research is supported by a Google Focused Research Award.

Text+Time Search and Analytics

Coordinator: Klaus Berberich

We focus on developing efficient and effective methods to search and analyze natural language texts that come with associated temporal information. This includes temporal expressions, which convey time periods a text refers to, as well as publication timestamps, which indicate when a text was published. Data of interest to us includes web archives, newspaper corpora, and other collections of born-digital or now-digital documents. Search, as one direction, targets situations when the user's information need is precise and can be satisfied by some piece of our data, typically a document. Analytics, as another direction, targets situations when the user's information need is vague or can be satisfied with some derived data, typically aggregated statistics. Implementing our methods in systems and experimentally evaluating them on real-world data is integral to our approach. Our recent and ongoing efforts include time-travel text search, algorithms to compute n-gram statistics at large scale, as well as redundancy-aware retrieval models.

Knowledge Base Construction and Quality

Coordinator: Simon Razniewski

Structured representations of encyclopedic and common-sense world knowledge are essential for applications such as search, question answering or dialogue, and especially important as interpretable and reusable components across tasks. We aim to construct knowledge bases that go beyond traditional entity-relationship assertions by adding information about expected numbers of objects, and by information about knowledge base completeness. Where traditional approaches to KB construction usually focus on correctness, we investigate methods towards measuring the extent to which a KB contains all facts on a selected topic, using techniques ranging from crowdsourcing to rule mining and text extraction. Such completeness information can then help both to target KB construction efforts, and to gain confidence about the truth of statements currently not contained in knowledge bases.

Searching, Mining, and Learning with Informal Text

Coordinator: Andrew Yates

In contrast to authoritative information sources, like encyclopedias, news articles, and academic papers, much of the information available on the Web is contained in informal text that requires different strategies to interpret. Our group aims to develop methods for searching, mining, and learning with such text so that it may be integrated with other knowledge. This goal spans both information retrieval and natural language processing tasks, such as mining health-related claims from social media, extracting information from dialogue, and learning to identify relevant spans of text. On the information retrieval side, we are particularly interested in leveraging recent advances in deep learning to develop more powerful retrieval models and to learn fine-grained types of relevance, including task-specific and passage-level relevance.

Question Answering

Coordinator: Rishiraj Saha Roy

Research on question answering (QA) aims to provide direct answers to natural language utterances over curated knowledge graphs, structured databases, unstructured Web text, or a combination of the above. In our group, we have tried to push the state-of-the-art in QA along multiple dimensions. The key driving criteria have been handling diversity in question formulations, complexity in information needs, and providing unsupervised, interpretable, and robust solutions that are not constrained to specific settings and benchmarks. Specific contributions include a method for automated template generation for QA, a continuous learning framework that extends these learnt templates with a layer of semantic similarity and user feedback, using predicted answer types for improving efficiency for complex questions, a resuable module that enables any QA system to answer temporal questions, and a graph-based method to answer complex questions by joining evidence from multiple documents on-the-fly. Current focus areas include bridging the structured and unstructured paradigms using powerful graph algorithms, and handling multi-turn question answering for information-seeking conversations using smart methods for context resolution. Detailed information on this research group can be found here.

Collaborations with Independent Research Groups

Exploratory Data Analysis (Independent Group Headed By: Jilles Vreeken)

We intensively collaborate with this independent research group at Saarland University.

Participation in Externally Funded Projects