Databases and Information Systems

Software Projects


AIDA is a method, implemented in an online tool, for disambiguating mentions of named entities that occur in natural-language text or Web tables


AmbiverseNLU combines of state-of-the-art components for language understanding tasks in a single, easy-to-use, and scalable suite.


ClausIE is an open information extractor; it identifies and extracts relations and their arguments in natural language text.


Focused (thematic) crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It involves the automatic classification of visited documents into a user- or community-specific topic hierarchy (ontology). The quality of the training data for the classifier is the most critical issue and potential bottleneck for the effectivity and scale of a focused crawler.


The BINGO! implementation presents an approach to focused crawling that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic "archetypes" and uses them for periodically re-training the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far.


The INEX initiative for the evaluation of XML retrieval uses a collection of XMLified Wikipedia articles that has been contributed by the MPI-INF. Besides usual, article-style markup, the collection additionally provides semantic markup of articles and outgoing links, based on the semantic knowledge base YAGO, explicitly labeling more than 5,800 classes of entities like persons, movies, cities, and many more.


High availability of distributed data is an important prerequisite for the efficiency of enterprise-wide business processes, so-called workflows. These workflows consist of several work steps that access different, autonomously managed databases and other information services. The group is developing infrastructure software, so-called middleware, with the goal of coordinated and reliable execution of workflows in highly heterogeneous, distributed information systems. Workflows are specified, executed and monitored by means of state and activity charts. Further, administrative tools are developed on top of the workflow kernel, resulting in a flexible WFMS architecture. The distributed runtime environment being developed aims to provide fault tolerance based on transaction-oriented services as well as efficient access to the history and the current context of workflows.


MG-FSM is a scalable, distributed (i.e., shared nothing) algorithm for frequent sequence mining (FSM) on MapReduce. The algorithm can handle so-called "gap constraints", which can be used to limit the output to a controlled set of frequent sequences.


The peer-to-peer (P2P) approach, which has become popular in the context of file-sharing systems such as Gnutella or KaZaA, allows handling huge amounts of data in a distributed and self-organizing way. In such a system, all peers are equal and all of the functionality is shared among all peers so that there is no single point of failure and the load is evenly balanced across a large number of peers. These characteristics offer enormous potential benefits for search capabilities powerful in terms of scalability, efficiency, and resilience to failures and dynamics. Additionally, such a search engine can potentially benefit from the intellectual input (e.g., bookmarks, query logs, etc.) of a large user community.


RDF-3X is an RDF storage and retrieval system that achieves excellent performance by following a RISC-style design philosophy.


TopX is a search engine for ranked retrieval of XML (and plain-text) data, developed at the Max-Planck Institute for Informatics. TopX supports a probabilistic-IR scoring model for full-text content conditions and tag-term combinations, path conditions for all XPath axes as exact or relaxable constraints, and ontology-based relaxation of terms and tag names as similarity conditions for ranked retrieval. For speeding up top-k queries, various techniques are employed: probabilistic models as efficient score predictors for a variant of the threshold algorithm, judicious scheduling of sequential accesses for scanning index lists and random accesses to compute full scores, incremental merging of index lists for on-demand, self-tuning query expansion, and a suite of specifically designed, precomputed indexes to evaluate structural path conditions.


TPDBlearn is a Temporal Probabilistic DataBase system which supports learning of tuple probabilities.


TriAD is a distributed engine for processing various query models such as set reachabililty, basic graph patterns (BGP) and generalized graph patterns (GGP)


YAGO is a huge semantic knowledge base, derived from Wikipedia, WordNet, and GeoNames. YAGO knows almost 10 million entities (e.g. persons, organizations, cities), and 120 million facts about these entities. Unlike other automatically assembled knowledge bases, YAGO has a manually confirmed accuracy of 95%. YAGO is freely available at yago-knowledge.org.