The group's long-term objective is to develop methodology for knowledge discovery: collecting, organizing, searching, exploring, and ranking facts from structured, semistructured, and textual information sources. Our approach towards this ambitious goal combines concepts, models, and algorithms from several fields, including database systems, information retrieval, statistical learning, and data mining.
Today, scientific results are available on the Internet in the form of publications, a main information source for scholars, and in encyclopedias like Wikipedia, a major source for students. Digital libraries and thematic portals combine multiple literature or data collections, but there is neither deep integration nor comprehensive coverage. Information search is limited to keywords and simple metadata, and media like video, images, or speech are most effectively searched by means of manually created annotations.
We envision a comprehensive, multimodal knowledge base of encyclopedic scope but in a formal (machine-processable) representation. This should encompass all human knowledge in terms of explicit facts referring to the concepts and entities of the underlying domains of discourse (e.g., concepts such as enzymes, quasars, or poets and specific entities such as Steapsin, 3C 273, or Bertolt Brecht), to definitions, theorems, and hypotheses, to measurements of natural phenomena and artefacts, and a wealth of video and speech material as well as sensor readings, all of which should be first-class citizens for effective search. In particular, we aim for efficient methods for large-scale information enrichment and knowledge extraction, powerful querying with semantic search capabilities, deployment in a distributed and social-network environment, and maintenance of information history and knowledge evolution over long time horizons.
A comprehensive knowledge base should know all individual entities of this world (e.g., Nicolas Sarkozy), their semantic classes (e.g., Sarkozy is a Politician), relationships between entities (e.g., Sarkozy presidentOf France), as well as validity times and confidence values for the correctness of such facts. Moreover, it should come with logical reasoning capabilities and rich support for querying and ranking. The benefits from solving this grand challenge would be enormous.
With a knowledge base that sublimates the valuable content from the Web, we could address difficult questions that are beyond the capabilities of today's keyword-based search engines. For example, one could ask for drugs that inhibit proteases and would obtain a precise and fairly comprehensive list of drugs for this HIV-relevant family of enzymes. Another example is searching for politicians who are also scientists; we would hope for ranked answers with individual names such as Angela Merkel or Benjamin Franklin, whereas today's search engines return web pages that contain the words "politicians" and "scientists" (e.g., on environmental topics). Such advanced information requests are posed by knowledge workers like scientists, students, journalists, historians, or market researchers. Although it is possible today to find relevant answers, this process is extremely laborious and time-consuming as it often requires rephrasing queries and browsing through many potentially promising but eventually useless result pages.
Coordinator: Gerhard Weikum
The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, ReadTheWeb, and our YAGO-NAGA project, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Highlights of our ongoing work include the YAGO2 knowledge base, the AIDA tool for named entity disambiguation, and the RDF search engine RDF-3X. This research is supported by a Google Focused Research Award.
Coordinator: Klaus Berberich
We focus on developing efficient and effective methods to search and analyze natural language texts that come with associated temporal information. This includes temporal expressions, which convey time periods a text refers to, as well as publication timestamps, which indicate when a text was published. Data of interest to us includes web archives, newspaper corpora, and other collections of born-digital or now-digital documents. Search, as one direction, targets situations when the user's information need is precise and can be satisfied by some piece of our data, typically a document. Analytics, as another direction, targets situations when the user's information need is vague or can be satisfied with some derived data, typically aggregated statistics. Implementing our methods in systems and experimentally evaluating them on real-world data is integral to our approach. Our recent and ongoing efforts include time-travel text search, algorithms to compute n-gram statistics at large scale, as well as redundancy-aware retrieval models.
Coordinator: Simon Razniewski
Structured representations of encyclopedic and common-sense world knowledge are essential for applications such as search, question answering or dialogue, and especially important as interpretable and reusable components across tasks. We aim to construct knowledge bases that go beyond traditional entity-relationship assertions by adding information about expected numbers of objects, and by information about knowledge base completeness. Where traditional approaches to KB construction usually focus on correctness, we investigate methods towards measuring the extent to which a KB contains all facts on a selected topic, using techniques ranging from crowdsourcing to rule mining and text extraction. Such completeness information can then help both to target KB construction efforts, and to gain confidence about the truth of statements currently not contained in knowledge bases.
Coordinator: Andrew Yates
In contrast to authoritative information sources, like encyclopedias, news articles, and academic papers, much of the information available on the Web is contained in informal text that requires different strategies to interpret. Our group aims to develop methods for searching, mining, and learning with such text so that it may be integrated with other knowledge. This goal spans both information retrieval and natural language processing tasks, such as mining health-related claims from social media, extracting information from dialogue, and learning to identify relevant spans of text. On the information retrieval side, we are particularly interested in leveraging recent advances in deep learning to develop more powerful retrieval models and to learn fine-grained types of relevance, including task-specific and passage-level relevance.
Coordinator: Rishiraj Saha Roy
Research on question answering (QA) aims to provide direct answers to natural language utterances over curated knowledge graphs, structured databases, unstructured Web text, or a combination of the above. In our group, we have tried to push the state-of-the-art in QA along multiple dimensions. The key driving criteria have been handling diversity in question formulations, complexity in information needs, and providing unsupervised, interpretable, and robust solutions that are not constrained to specific settings and benchmarks. Specific contributions include a method for automated template generation for QA, a continuous learning framework that extends these learnt templates with a layer of semantic similarity and user feedback, using predicted answer types for improving efficiency for complex questions, a resuable module that enables any QA system to answer temporal questions, and a graph-based method to answer complex questions by joining evidence from multiple documents on-the-fly. Current focus areas include bridging the structured and unstructured paradigms using powerful graph algorithms, and handling multi-turn question answering for information-seeking conversations using smart methods for context resolution. Detailed information on this research group can be found here.
Coordinator: Pauli Miettinen
Data mining-extracting novel and interesting information about the data-is a fundamental part of modern data sciences. Our group's goal is to develop methods and algorithms for data mining that are based on well-founded algorithmic principles. We work on every step of this process: analysing the computational complexity and other properties of the problem, developing algorithms that work in practice based on the theoretical understanding, and applying these algorithms to real-world problems. Many interesting data sets contain binary (or higher arity) relations, collections of sets of elements, or bipartite graphs. All these data sets can be expressed using binary matrices (or tensors). Currently our focus is mostly on studying the decomposition methods for binary matrices and tensors in all of the aforementioned three levels: theory, algorithms, and applications. We also work on variations of these problems, such as dynamically updating the decompositions and automatically selecting the decomposition rank using the minimum description length principle.
Coordinator: Daria Stepanova
Means for enriching data with semantics and methods for reasoning about it are at the heart of any intelligent system. Especially now, with the maturation of knowledge graphs and their popularity in applications, such as semantic Web search, the need for deductive and inductive reasoning services is strongly increasing. Our group deals with a variety of issues in this context. We work on new methods for knowledge representation and their applicability to Web data. We are concerned with handling incomplete and inconsistent information, data repair, diagnostic reasoning, and knowledge revision. Furthermore, we aim at combining learning and reasoning techniques for semantically-enhanced data processing in a variety of applications. In particular, we develop inductive learning approaches for deriving hidden insights from knowledge graphs in the form of logical rules.
Coordinator: Jannik Strötgen
Despite the development of knowledge bases and their improvements in recent years, most human knowledge is still available only in unstructured format, in particular as natural language text. Thus, applications such as search engines and question answering systems benefit from knowledge extracted from texts. Our group aims at developing natural language processing tools for extracting valuable information from large document collections. In particular, we tackle the extraction and interpretation of temporal information, as time is an important dimension in any information space. For instance, we keep on extending and improving the temporal tagger HeidelTime. We also work on semantically refined tasks: we develop a time-aware search engine, which allows to formulate queries with temporal constraints on the documents' content, and perform exploratory corpus analysis. The field of digital humanities offers further opportunities. Instead of studying interesting phenomena based on manually analyzed examples, applying natural language processing techniques helps to perform large-scale analyses on big corpora and can lead to new insights in the respective research areas, such as literary science.