Google Focused Research Award

Robust and Scalable Fact Discovery from Web Sources

Google selected Rainer Gemulla, Martin Theobald, and Gerhard Weikum as winners of a Google Focused Research Award 2011 , and is generously supporting our research on knowledge harvesting.

Goals and Approach

Knowledge bases with entity-relationship-oriented facts are valuable assets for making sense of Internet content and for supporting applications like semantic search or text disambiguation. Projects on automatically building such knowledge bases from high-quality Web sources have successfully applied two different paradigms: targeted information extraction with domain-model seeds for high-precision output, and explorative information extraction in an unsupervised manner with high recall but lower precision. Neither of the two has paid attention to the upcoming need of maintaining a knowledge base with evolving content and the entire life-cycle of knowledge management.

This project aims to reconcile the two information-extraction paradigms, combining their strengths and overcoming their limitations. Targeted extraction should become able to discover new relation types, and explorative extraction should be strengthened by expressive consistency reasoning. The combined form of "universal" extraction should be scalable and robust. The project will give particular emphasis to tapping into the long tail of entities and their relationships, and to coping with the dynamic evolution of factual knowledge.

People

Gemulla, Rainer
Weikum, Gerhard
Miliaraki, Iris
Teflioudi, Christina

Publications

Johannes Hoffart, Yasemin Altun, Gerhard Weikum:
Discovering emerging entities with ambiguous names.
WWW 2014: 385-396
Erdal Kuzey, Gerhard Weikum:
EVIN: building a knowledge base of events.
WWW (Companion Volume) 2014: 103-106
Niket Tandon, Gerard de Melo, Fabian M. Suchanek, Gerhard Weikum:
WebChild: harvesting and organizing commonsense knowledge from the web.
WSDM 2014: 523-532
Maximilian Dylla, Iris Miliaraki, Martin Theobald:
A Temporal-Probabilistic Database Model for Information Extraction.
Proceedings of the VLDB Endowment, Volume 6, Issue 14, 2013, presented at VLDB 2014
Maximilian Dylla, Iris Miliaraki, Martin Theobald:
Top-k Query Processing in Probabilistic Databases with Non-Materialized Views.
ICDE 2013: 122-133
Ndapandula Nakashole, Tomasz Tylenda, Gerhard Weikum:
Fine-grained Semantic Typing of Emerging Entities.
ACL (1) 2013: 1488-1497
Luciano Del Corro, Rainer Gemulla:
ClausIE: Clause-Based Open Information Extraction. [pdf, resources]
WWW 2013: 355-366. Rio de Janeiro, Brazil.
Yafang Wang, Maximilian Dylla, Marc Spaniol, Gerhard Weikum:
Coupling Label Propagation and Constraints for Temporal Fact Extraction.
ACL (2) 2012: 233-237
Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, Gerhard Weikum:
KORE: Keyphrase Overlap Relatedness for Entity Disambiguation.
21st ACM International Conference on Information and Knowledge Management (CIKM 2012) Maui, USA.
Ndapandula Nakashole, Gerhard Weikum, Fabian Suchanek:
PATTY: A Taxonomy of Relational Patterns with Semantic Types.
International Conference on Empirical Methods in Natural Language Processing (EMNLP 2012), Jeju, Korea.
(3rd place best-paper award) See PATTY webpage
Ndapandula Nakashole, Gerhard Weikum, Fabian Suchanek:
Discovering and Exploring Relations on the Web.
Demo at 38th International Conferences on Very Large Data Bases (VLDB2012), Istanbul, Turkey.
Rainer Gemulla, Erik Nijkamp, Peter J. Haas, Yannis Sismanis, Christina Teflioudi, Faraz Makari:
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent.
NIPS workshop: Big Learning - Algorithms, Systems, and Tools for Learning at Scale, Biglearn 2011, Granada, Spain.
Ndapandula Nakashole, Martin Theobald and Gerhard Weikum:
Scalable Knowledge Harvesting with High Precision and High Recall.
4th ACM International Conference on Web Search and Data Mining, WSDM 2011, Hong Kong.
Rainer Gemulla, Erik Nijkamp, Peter J. Haas, Yannis Sismanis:
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent.
17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2011, San Diego, CA.
Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau,Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, Gerhard Weikum:
Robust Disambiguation of Named Entities in Text.
Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, 2011.
Ground-truth data with mappings to Yago2 and Freebase, enhancing the CoNLL 2003 corpus for entity recognition, is publicly available (joint work with Massimiliano Ciaramita from Google Zurich); see the AIDA framework.