Research Reports

2017

Report

D. Gupta, K. Berberich, J. Strötgen, and D. Zeinalipour-Yazti

“Generating Semantic Aspects for Queries,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2017-5-001, 2017.

Abstract

Ambiguous information needs expressed in a limited number of keywords
often result in long-winded query sessions and many query reformulations.
In this work, we tackle ambiguous queries by providing automatically gen-
erated semantic aspects that can guide users to satisfying results regarding
their information needs. To generate semantic aspects, we use semantic an-
notations available in the documents and leverage models representing the
semantic relationships between annotations of the same type. The aspects in
turn provide us a foundation for representing text in a completely structured
manner, thereby allowing for a semantically-motivated organization of search
results. We evaluate our approach on a testbed of over 5,000 aspects on Web
scale document collections amounting to more than 450 million documents,
with temporal, geographic, and named entity annotations as example dimen-
sions. Our experimental results show that our general approach is Web-scale
ready and finds relevant aspects for highly ambiguous queries.

BibTeX

@techreport{Guptareport2007,
TITLE = {Generating Semantic Aspects for Queries},
AUTHOR = {Gupta, Dhruv and Berberich, Klaus and Str{\"o}tgen, Jannik and Zeinalipour-Yazti, Demetrios},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2017-5-001},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2017},
ABSTRACT = {Ambiguous information needs expressed in a limited number of keywords<br>often result in long-winded query sessions and many query reformulations.<br>In this work, we tackle ambiguous queries by providing automatically gen-<br>erated semantic aspects that can guide users to satisfying results regarding<br>their information needs. To generate semantic aspects, we use semantic an-<br>notations available in the documents and leverage models representing the<br>semantic relationships between annotations of the same type. The aspects in<br>turn provide us a foundation for representing text in a completely structured<br>manner, thereby allowing for a semantically-motivated organization of search<br>results. We evaluate our approach on a testbed of over 5,000 aspects on Web<br>scale document collections amounting to more than 450 million documents,<br>with temporal, geographic, and named entity annotations as example dimen-<br>sions. Our experimental results show that our general approach is Web-scale<br>ready and finds relevant aspects for highly ambiguous queries.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Gupta, Dhruv
%A Berberich, Klaus
%A Str&#246;tgen, Jannik
%A Zeinalipour-Yazti, Demetrios
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Generating Semantic Aspects for Queries : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002E-07DD-0
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2017
%P 39 p.
%X Ambiguous information needs expressed in a limited number of keywords<br>often result in long-winded query sessions and many query reformulations.<br>In this work, we tackle ambiguous queries by providing automatically gen-<br>erated semantic aspects that can guide users to satisfying results regarding<br>their information needs. To generate semantic aspects, we use semantic an-<br>notations available in the documents and leverage models representing the<br>semantic relationships between annotations of the same type. The aspects in<br>turn provide us a foundation for representing text in a completely structured<br>manner, thereby allowing for a semantically-motivated organization of search<br>results. We evaluate our approach on a testbed of over 5,000 aspects on Web<br>scale document collections amounting to more than 450 million documents,<br>with temporal, geographic, and named entity annotations as example dimen-<br>sions. Our experimental results show that our general approach is Web-scale<br>ready and finds relevant aspects for highly ambiguous queries.
%B Research Report
%@ false

2016

Report

D. Gupta and K. Berberich

“Diversifying Search Results Using Time,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2016-5-001, 2016.

Abstract

Getting an overview of a historic entity or event can be difficult in search results, especially if important dates concerning the entity or event are not known beforehand. For such information needs, users would benefit if returned results covered diverse dates, thus giving an overview of what has happened throughout history. Diversifying search results based on important dates can be a building block for applications, for instance, in digital humanities. Historians would thus be able to quickly explore

longitudinal document collections by querying for entities or events without knowing associated important dates apriori.

In this work, we describe an approach to diversify search results using temporal expressions (e.g., in the 1990s) from their

contents. Our approach first identifies time intervals of interest to the given keyword query based on pseudo-relevant documents. It then re-ranks query results so as to maximize the coverage of

identified time intervals. We present a novel and objective evaluation for our proposed

approach. We test the effectiveness of our methods on the New York Times Annotated corpus and the Living Knowledge corpus, collectively consisting of around 6 million documents. Using history-oriented queries and encyclopedic resources we show that our method indeed is able to present

search results diversified along time.

BibTeX

@techreport{GuptaReport2016-5-001,
TITLE = {Diversifying Search Results Using Time},
AUTHOR = {Gupta, Dhruv and Berberich, Klaus},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2016-5-001},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2016},
ABSTRACT = {Getting an overview of a historic entity or event can be difficult in search results, especially if important dates concerning the entity or event are not known beforehand. For such information needs, users would benefit if returned results covered diverse dates, thus giving an overview of what has happened throughout history. Diversifying search results based on important dates can be a building block for applications, for instance, in digital humanities. Historians would thus be able to quickly explore longitudinal document collections by querying for entities or events without knowing associated important dates apriori. In this work, we describe an approach to diversify search results using temporal expressions (e.g., in the 1990s) from their contents. Our approach first identifies time intervals of interest to the given keyword query based on pseudo-relevant documents. It then re-ranks query results so as to maximize the coverage of identified time intervals. We present a novel and objective evaluation for our proposed approach. We test the effectiveness of our methods on the New York Times Annotated corpus and the Living Knowledge corpus, collectively consisting of around 6 million documents. Using history-oriented queries and encyclopedic resources we show that our method indeed is able to present search results diversified along time.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Gupta, Dhruv
%A Berberich, Klaus
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Diversifying Search Results Using Time :
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002A-0AA4-C
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2016
%P 51 p.
%X Getting an overview of a historic entity or event can be difficult in search results, especially if important dates concerning the entity or event are not known beforehand. For such information needs, users would benefit if returned results covered diverse dates, thus giving an overview of what has happened throughout history. Diversifying search results based on important dates can be a building block for applications, for instance, in digital humanities. Historians would thus be able to quickly explore
longitudinal document collections by querying for entities or events without knowing associated important dates apriori.
In this work, we describe an approach to diversify search results using temporal expressions (e.g., in the 1990s) from their
contents. Our approach first identifies time intervals of interest to the given keyword query based on pseudo-relevant documents. It then re-ranks query results so as to maximize the coverage of
identified time intervals. We present a novel and objective evaluation for our proposed
approach. We test the effectiveness of our methods on the New York Times Annotated corpus and the Living Knowledge corpus, collectively consisting of around 6 million documents. Using history-oriented queries and encyclopedic resources we show that our method indeed is able to present
search results diversified along time.
%B Research Report
%@ false

Report

A. Mishra and K. Berberich

“Leveraging Semantic Annotations to Link Wikipedia and News Archives,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2016-5-002, 2016.

Abstract

The incomprehensible amount of information available online has made it difficult to retrospect on past events. We propose a novel linking problem to connect excerpts from Wikipedia summarizing events to online news articles elaborating on them.

To address the linking problem, we cast it into an information retrieval task by treating a given excerpt as a user query with the goal to retrieve a ranked list of relevant news articles. We find that Wikipedia excerpts often come with additional semantics, in their textual descriptions, representing the time, geolocations, and named entities involved in the event. Our retrieval model leverages text and semantic annotations as different dimensions of an event by estimating independent query models to rank documents. In our experiments on two datasets, we compare methods that consider different combinations of dimensions and find that the approach that leverages all dimensions suits our problem best.

BibTeX

@techreport{MishraBerberich16,
TITLE = {Leveraging Semantic Annotations to Link Wikipedia and News Archives},
AUTHOR = {Mishra, Arunav and Berberich, Klaus},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2016-5-002},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2016},
ABSTRACT = {The incomprehensible amount of information available online has made it difficult to retrospect on past events. We propose a novel linking problem to connect excerpts from Wikipedia summarizing events to online news articles elaborating on them. To address the linking problem, we cast it into an information retrieval task by treating a given excerpt as a user query with the goal to retrieve a ranked list of relevant news articles. We find that Wikipedia excerpts often come with additional semantics, in their textual descriptions, representing the time, geolocations, and named entities involved in the event. Our retrieval model leverages text and semantic annotations as different dimensions of an event by estimating independent query models to rank documents. In our experiments on two datasets, we compare methods that consider different combinations of dimensions and find that the approach that leverages all dimensions suits our problem best.},
TYPE = {Research Reports},
}

Endnote

%0 Report
%A Mishra, Arunav
%A Berberich, Klaus
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Leveraging Semantic Annotations to Link Wikipedia and News Archives : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0029-5FF0-A
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2016
%P 21 p.
%X The incomprehensible amount of information available online has made it difficult to retrospect on past events. We propose a novel linking problem to connect excerpts from Wikipedia summarizing events to online news articles elaborating on them. 
To address the linking problem, we cast it into an information retrieval task by treating a given excerpt as a user query with the goal to retrieve a ranked list of relevant news articles. We find that Wikipedia excerpts often come with additional semantics, in their textual descriptions, representing the time, geolocations, and named entities involved in the event. Our retrieval model leverages text and semantic annotations as different dimensions of an event by estimating independent query models to rank documents. In our experiments on two datasets, we compare methods that consider different combinations of dimensions and find that the approach that leverages all dimensions suits our problem best.
%B Research Reports
%@ false

2014

Report

A. Anand, I. Mele, S. Bedathur, and K. Berberich

“Phrase Query Optimization on Inverted Indexes,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2014-5-002, 2014.

Abstract

Phrase queries are a key functionality of modern search engines. Beyond that, they increasingly serve as an important building block for applications such as entity-oriented search, text analytics, and plagiarism detection. Processing phrase queries is costly, though, since positional information has to be kept in the index and all words, including stopwords, need to be considered.

We consider an augmented inverted index that indexes selected variable-length multi-word sequences in addition to single words. We study how arbitrary phrase queries can be processed efficiently on such an augmented inverted index. We show that the underlying optimization problem is NP-hard in the general case and describe an exact exponential algorithm and an approximation algorithm to its solution. Experiments on ClueWeb09 and The New York Times with different real-world query workloads examine the practical performance of our methods.

BibTeX

@techreport{AnandMeleBedathurBerberich2014,
TITLE = {Phrase Query Optimization on Inverted Indexes},
AUTHOR = {Anand, Avishek and Mele, Ida and Bedathur, Srikanta and Berberich, Klaus},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2014-5-002},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2014},
ABSTRACT = {Phrase queries are a key functionality of modern search engines. Beyond that, they increasingly serve as an important building block for applications such as entity-oriented search, text analytics, and plagiarism detection. Processing phrase queries is costly, though, since positional information has to be kept in the index and all words, including stopwords, need to be considered. We consider an augmented inverted index that indexes selected variable-length multi-word sequences in addition to single words. We study how arbitrary phrase queries can be processed efficiently on such an augmented inverted index. We show that the underlying optimization problem is NP-hard in the general case and describe an exact exponential algorithm and an approximation algorithm to its solution. Experiments on ClueWeb09 and The New York Times with different real-world query workloads examine the practical performance of our methods.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Anand, Avishek
%A Mele, Ida
%A Bedathur, Srikanta
%A Berberich, Klaus
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Phrase Query Optimization on Inverted Indexes :
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-022A-3
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2014
%P 20 p.
%X Phrase queries are a key functionality of modern search engines. Beyond that, they increasingly serve as an important building block for applications such as entity-oriented search, text analytics, and plagiarism detection. Processing phrase queries is costly, though, since positional information has to be kept in the index and all words, including stopwords, need to be considered.
We consider an augmented inverted index that indexes selected variable-length multi-word sequences in addition to single words. We study how arbitrary phrase queries can be processed efficiently on such an augmented inverted index. We show that the underlying optimization problem is NP-hard in the general case and describe an exact exponential algorithm and an approximation algorithm to its solution. Experiments on ClueWeb09 and The New York Times with different real-world query workloads examine the practical performance of our methods.
%B Research Report
%@ false

Report

M. Dylla and M. Theobald

“Learning Tuple Probabilities in Probabilistic Databases,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2014-5-001, 2014.

Abstract

Learning the parameters of complex probabilistic-relational models from labeled

training data is a standard technique in machine learning, which has been

intensively studied in the subfield of Statistical Relational Learning (SRL),

but---so far---this is still an under-investigated topic in the context of

Probabilistic Databases (PDBs). In this paper, we focus on learning the

probability values of base tuples in a PDB from query answers, the latter of

which are represented as labeled lineage formulas. Specifically, we consider

labels in the form of pairs, each consisting of a Boolean lineage formula and a

marginal probability that comes attached to the corresponding query answer. The

resulting learning problem can be viewed as the inverse problem to confidence

computations in PDBs: given a set of labeled query answers, learn the

probability values of the base tuples, such that the marginal probabilities of

the query answers again yield in the assigned probability labels. We analyze

the learning problem from a theoretical perspective, devise two

optimization-based objectives, and provide an efficient algorithm (based on

Stochastic Gradient Descent) for solving these objectives. Finally, we conclude

this work by an experimental evaluation on three real-world and one synthetic

dataset, while competing with various techniques from SRL, reasoning in

information extraction, and optimization.

BibTeX

@techreport{Dylla-Learning2014,
TITLE = {Learning Tuple Probabilities in Probabilistic Databases},
AUTHOR = {Dylla, Maximilian and Theobald, Martin},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2014-5-001},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2014},
ABSTRACT = {Learning the parameters of complex probabilistic-relational models from labeled training data is a standard technique in machine learning, which has been intensively studied in the subfield of Statistical Relational Learning (SRL), but---so far---this is still an under-investigated topic in the context of Probabilistic Databases (PDBs). In this paper, we focus on learning the probability values of base tuples in a PDB from query answers, the latter of which are represented as labeled lineage formulas. Specifically, we consider labels in the form of pairs, each consisting of a Boolean lineage formula and a marginal probability that comes attached to the corresponding query answer. The resulting learning problem can be viewed as the inverse problem to confidence computations in PDBs: given a set of labeled query answers, learn the probability values of the base tuples, such that the marginal probabilities of the query answers again yield in the assigned probability labels. We analyze the learning problem from a theoretical perspective, devise two optimization-based objectives, and provide an efficient algorithm (based on Stochastic Gradient Descent) for solving these objectives. Finally, we conclude this work by an experimental evaluation on three real-world and one synthetic dataset, while competing with various techniques from SRL, reasoning in information extraction, and optimization.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Dylla, Maximilian
%A Theobald, Martin
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Learning Tuple Probabilities in Probabilistic Databases : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0019-8492-6
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2014
%P 51 p.
%X Learning the parameters of complex probabilistic-relational models from labeled 
training data is a standard technique in machine learning, which has been 
intensively studied in the subfield of Statistical Relational Learning (SRL), 
but---so far---this is still an under-investigated topic in the context of 
Probabilistic Databases (PDBs). In this paper, we focus on learning the 
probability values of base tuples in a PDB from query answers, the latter of 
which are represented as labeled lineage formulas. Specifically, we consider 
labels in the form of pairs, each consisting of a Boolean lineage formula and a 
marginal probability that comes attached to the corresponding query answer. The 
resulting learning problem can be viewed as the inverse problem to confidence 
computations in PDBs: given a set of labeled query answers, learn the 
probability values of the base tuples, such that the marginal probabilities of 
the query answers again yield in the assigned probability labels. We analyze 
the learning problem from a theoretical perspective, devise two 
optimization-based objectives, and provide an efficient algorithm (based on 
Stochastic Gradient Descent) for solving these objectives. Finally, we conclude 
this work by an experimental evaluation on three real-world and one synthetic 
dataset, while competing with various techniques from SRL, reasoning in 
information extraction, and optimization.
%B Research Report
%@ false

2013

Report

D5D1

F. Makari, B. Awerbuch, R. Gemulla, R. Khandekar, J. Mestre, and M. Sozio

“A Distributed Algorithm for Large-scale Generalized Matching,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2013-5-002, 2013.

Abstract

Generalized matching problems arise in a number of applications, including

computational advertising, recommender systems, and trade markets. Consider,

for example, the problem of recommending multimedia items (e.g., DVDs) to

users such that (1) users are recommended items that they are likely to be

interested in, (2) every user gets neither too few nor too many

recommendations, and (3) only items available in stock are recommended to

users. State-of-the-art matching algorithms fail at coping with large

real-world instances, which may involve millions of users and items. We

propose the first distributed algorithm for computing near-optimal solutions

to large-scale generalized matching problems like the one above. Our algorithm

is designed to run on a small cluster of commodity nodes (or in a MapReduce

environment), has strong approximation guarantees, and requires only a

poly-logarithmic number of passes over the input. In particular, we propose a

novel distributed algorithm to approximately solve mixed packing-covering

linear programs, which include but are not limited to generalized matching

problems. Experiments on real-world and synthetic data suggest that our

algorithm scales to very large problem sizes and can be orders of magnitude

faster than alternative approaches.

BibTeX

@techreport{MakariAwerbuchGemullaKhandekarMestreSozio2013,
TITLE = {A Distributed Algorithm for Large-scale Generalized Matching},
AUTHOR = {Makari, Faraz and Awerbuch, Baruch and Gemulla, Rainer and Khandekar, Rohit and Mestre, Julian and Sozio, Mauro},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2013-5-002},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2013},
ABSTRACT = {Generalized matching problems arise in a number of applications, including computational advertising, recommender systems, and trade markets. Consider, for example, the problem of recommending multimedia items (e.g., DVDs) to users such that (1) users are recommended items that they are likely to be interested in, (2) every user gets neither too few nor too many recommendations, and (3) only items available in stock are recommended to users. State-of-the-art matching algorithms fail at coping with large real-world instances, which may involve millions of users and items. We propose the first distributed algorithm for computing near-optimal solutions to large-scale generalized matching problems like the one above. Our algorithm is designed to run on a small cluster of commodity nodes (or in a MapReduce environment), has strong approximation guarantees, and requires only a poly-logarithmic number of passes over the input. In particular, we propose a novel distributed algorithm to approximately solve mixed packing-covering linear programs, which include but are not limited to generalized matching problems. Experiments on real-world and synthetic data suggest that our algorithm scales to very large problem sizes and can be orders of magnitude faster than alternative approaches.},
TYPE = {Research Reports},
}

Endnote

%0 Report
%A Makari, Faraz
%A Awerbuch, Baruch
%A Gemulla, Rainer
%A Khandekar, Rohit
%A Mestre, Julian
%A Sozio, Mauro
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
Databases and Information Systems, MPI for Informatics, Max Planck Society
Algorithms and Complexity, MPI for Informatics, Max Planck Society
Algorithms and Complexity, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T A Distributed Algorithm for Large-scale Generalized Matching : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-03B4-3
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2013
%P 39 p.
%X Generalized matching problems arise in a number of applications, including
 computational advertising, recommender systems, and trade markets. Consider,
 for example, the problem of recommending multimedia items (e.g., DVDs) to
 users such that (1) users are recommended items that they are likely to be
 interested in, (2) every user gets neither too few nor too many
 recommendations, and (3) only items available in stock are recommended to
 users. State-of-the-art matching algorithms fail at coping with large
 real-world instances, which may involve millions of users and items. We
 propose the first distributed algorithm for computing near-optimal solutions
 to large-scale generalized matching problems like the one above. Our algorithm
 is designed to run on a small cluster of commodity nodes (or in a MapReduce
 environment), has strong approximation guarantees, and requires only a
 poly-logarithmic number of passes over the input. In particular, we propose a
 novel distributed algorithm to approximately solve mixed packing-covering
 linear programs, which include but are not limited to generalized matching
 problems. Experiments on real-world and synthetic data suggest that our
 algorithm scales to very large problem sizes and can be orders of magnitude
 faster than alternative approaches.
%B Research Reports
%@ false

2012

Report

F. Alvanaki, S. Michel, and A. Stupar

“Building and Maintaining Halls of Fame Over a Database,” Max-Plankc-Institute für Informatik, Saarbrücken, MPI-I-2012-5-004, 2012.

Abstract

Halls of Fame are fascinating constructs. They represent the elite of an often

very large amount of entities|persons, companies, products, countries etc.

Beyond their practical use as static rankings, changes to them are particularly

interesting|for decision making processes, as input to common media or

novel narrative science applications, or simply consumed by users. In this

work, we aim at detecting events that can be characterized by changes to a

Hall of Fame ranking in an automated way. We describe how the schema and

data of a database can be used to generate Halls of Fame. In this database

scenario, by Hall of Fame we refer to distinguished tuples; entities, whose

characteristics set them apart from the majority. We dene every Hall of

Fame as one specic instance of an SQL query, such that a change in its

result is considered a noteworthy event. Identied changes (i.e., events) are

ranked using lexicographic tradeos over event and query properties and

presented to users or fed in higher-level applications. We have implemented

a full-edged prototype system that uses either database triggers or a Java

based middleware for event identication. We report on an experimental

evaluation using a real-world dataset of basketball statistics.

BibTeX

@techreport{AlvanakiMichelStupar2012,
TITLE = {Building and Maintaining Halls of Fame Over a Database},
AUTHOR = {Alvanaki, Foteini and Michel, Sebastian and Stupar, Aleksandar},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2012-5-004},
INSTITUTION = {Max-Plankc-Institute f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2012},
ABSTRACT = {Halls of Fame are fascinating constructs. They represent the elite of an often very large amount of entities|persons, companies, products, countries etc. Beyond their practical use as static rankings, changes to them are particularly interesting|for decision making processes, as input to common media or novel narrative science applications, or simply consumed by users. In this work, we aim at detecting events that can be characterized by changes to a Hall of Fame ranking in an automated way. We describe how the schema and data of a database can be used to generate Halls of Fame. In this database scenario, by Hall of Fame we refer to distinguished tuples; entities, whose characteristics set them apart from the majority. We dene every Hall of Fame as one specic instance of an SQL query, such that a change in its result is considered a noteworthy event. Identied changes (i.e., events) are ranked using lexicographic tradeos over event and query properties and presented to users or fed in higher-level applications. We have implemented a full-edged prototype system that uses either database triggers or a Java based middleware for event identication. We report on an experimental evaluation using a real-world dataset of basketball statistics.},
TYPE = {Research Reports},
}

Endnote

%0 Report
%A Alvanaki, Foteini
%A Michel, Sebastian
%A Stupar, Aleksandar
%+ Cluster of Excellence Multimodal Computing and Interaction
Databases and Information Systems, MPI for Informatics, Max Planck Society
Cluster of Excellence Multimodal Computing and Interaction
%T Building and Maintaining Halls of Fame Over a Database : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-03E9-D
%Y Max-Plankc-Institute f&#252;r Informatik
%C Saarbr&#252;cken
%D 2012
%X Halls of Fame are fascinating constructs. They represent the elite of an often
very large amount of entities|persons, companies, products, countries etc.
Beyond their practical use as static rankings, changes to them are particularly
interesting|for decision making processes, as input to common media or
novel narrative science applications, or simply consumed by users. In this
work, we aim at detecting events that can be characterized by changes to a
Hall of Fame ranking in an automated way. We describe how the schema and
data of a database can be used to generate Halls of Fame. In this database
scenario, by Hall of Fame we refer to distinguished tuples; entities, whose
characteristics set them apart from the majority. We dene every Hall of
Fame as one specic instance of an SQL query, such that a change in its
result is considered a noteworthy event. Identied changes (i.e., events) are
ranked using lexicographic tradeos over event and query properties and
presented to users or fed in higher-level applications. We have implemented
a full-edged prototype system that uses either database triggers or a Java
based middleware for event identication. We report on an experimental
evaluation using a real-world dataset of basketball statistics.
%B Research Reports
%@ false

Report

K. Berberich and S. Bedathur

“Computing n-Gram Statistics in MapReduce,” Max-Planck-Institut für Informatik, Saa, MPI-I-2012-5-003, 2012.

@techreport{BerberichBedathur2012,
TITLE = {Computing n--Gram Statistics in {MapReduce}},
AUTHOR = {Berberich, Klaus and Bedathur, Srikanta},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2012-5-003},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saa},
YEAR = {2012},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Berberich, Klaus
%A Bedathur, Srikanta
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
%T Computing n-Gram Statistics in MapReduce : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-0416-A
%Y Max-Planck-Institut f&#252;r Informatik
%C Saa
%D 2012
%P 39 p.
%B Research Report
%@ false

Report

M. Dylla, I. Miliaraki, and M. Theobald

“Top-k Query Processing in Probabilistic Databases with Non-materialized Views,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2012-5-002, 2012.

@techreport{DyllaTopk2012,
TITLE = {Top-k Query Processing in Probabilistic Databases with Non-materialized Views},
AUTHOR = {Dylla, Maximilian and Miliaraki, Iris and Theobald, Martin},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2012-5-002},
LOCALID = {Local-ID: 62EC1C9C96B8EFF4C1257B560029F18C-DyllaTopk2012},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2012},
DATE = {2012},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Dylla, Maximilian
%A Miliaraki, Iris
%A Theobald, Martin
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Top-k Query Processing in Probabilistic Databases with Non-materialized Views : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-B02F-2
%F OTHER: Local-ID: 62EC1C9C96B8EFF4C1257B560029F18C-DyllaTopk2012
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2012
%B Research Report
%@ false

Report

P. Miettinen and J. Vreeken

“MDL4BMF: Minimum Description Length for Boolean Matrix Factorization,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2012-5-001, 2012.

Abstract

Matrix factorizations—where a given data matrix is approximated by a prod- uct of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices.

Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate.

We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

BibTeX

@techreport{MiettinenVreeken,
TITLE = {{MDL4BMF}: Minimum Description Length for Boolean Matrix Factorization},
AUTHOR = {Miettinen, Pauli and Vreeken, Jilles},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2012-5-001},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2012},
ABSTRACT = {Matrix factorizations---where a given data matrix is approximated by a prod- uct of two or more factor matrices---are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the {\textquoteleft}model order selection problem{\textquoteright} of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)---where data, factors, and matrix product are Boolean---has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general---making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Miettinen, Pauli
%A Vreeken, Jilles
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
%T MDL4BMF: Minimum Description Length for Boolean Matrix Factorization :
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-0422-E
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2012
%P 48 p.
%X Matrix factorizations&#8212;where a given data matrix is approximated by a prod- uct of two or more factor matrices&#8212;are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the &#8216;model order selection problem&#8217; of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices.

Boolean matrix factorization (BMF)&#8212;where data, factors, and matrix product are Boolean&#8212;has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate.

We formulate the description length function for BMF in general&#8212;making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.
%B Research Report
%@ false

2011

Report

A. Anand, S. Bedathur, K. Berberich, and R. Schenkel

“Temporal Index Sharding for Space-time Efficiency in Archive Search,” Universität des Saarlandes, Saarbrücken, MPI-I-2011-5-001, 2011.

Abstract

Time-travel queries that couple temporal constraints with keyword

queries are useful in searching large-scale archives of time-evolving

content such as the Web, document collections, wikis, and so

on. Typical approaches for efficient evaluation of these queries

involve \emph{slicing} along the time-axis either the entire

collection~\cite{253349}, or individual index

lists~\cite{kberberi:sigir2007}. Both these methods are not

satisfactory since they sacrifice compactness of index for processing

efficiency making them either too big or, otherwise, too slow.

We present a novel index organization scheme that \emph{shards} the

index with \emph{zero increase in index size}, still minimizing the

cost of reading index index entries during query processing. Based on

the optimal sharding thus obtained, we develop practically efficient

sharding that takes into account the different costs of random and

sequential accesses. Our algorithm merges shards from the optimal

solution carefully to allow for few extra sequential accesses while

gaining significantly by reducing the random accesses. Finally, we

empirically establish the effectiveness of our novel sharding scheme

via detailed experiments over the edit history of the English version

of Wikipedia between 2001-2005 ($\approx$ 700 GB) and an archive of

the UK governmental web sites ($\approx$ 400 GB). Our results

demonstrate the feasibility of faster time-travel query processing

with no space overhead.

BibTeX

@techreport{Bedathur2011,
TITLE = {Temporal Index Sharding for Space-time Efficiency in Archive Search},
AUTHOR = {Anand, Avishek and Bedathur, Srikanta and Berberich, Klaus and Schenkel, Ralf},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2011-5-001},
INSTITUTION = {Universit{\"a}t des Saarlandes},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2011},
DATE = {2011},
ABSTRACT = {Time-travel queries that couple temporal constraints with keyword queries are useful in searching large-scale archives of time-evolving content such as the Web, document collections, wikis, and so on. Typical approaches for efficient evaluation of these queries involve \emph{slicing} along the time-axis either the entire collection~\cite{253349}, or individual index lists~\cite{kberberi:sigir2007}. Both these methods are not satisfactory since they sacrifice compactness of index for processing efficiency making them either too big or, otherwise, too slow. We present a novel index organization scheme that \emph{shards} the index with \emph{zero increase in index size}, still minimizing the cost of reading index index entries during query processing. Based on the optimal sharding thus obtained, we develop practically efficient sharding that takes into account the different costs of random and sequential accesses. Our algorithm merges shards from the optimal solution carefully to allow for few extra sequential accesses while gaining significantly by reducing the random accesses. Finally, we empirically establish the effectiveness of our novel sharding scheme via detailed experiments over the edit history of the English version of Wikipedia between 2001-2005 ($\approx$ 700 GB) and an archive of the UK governmental web sites ($\approx$ 400 GB). Our results demonstrate the feasibility of faster time-travel query processing with no space overhead.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Anand, Avishek
%A Bedathur, Srikanta
%A Berberich, Klaus
%A Schenkel, Ralf
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Temporal Index Sharding for Space-time Efficiency in Archive Search : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0025-7311-D
%Y Universit&#228;t des Saarlandes
%C Saarbr&#252;cken
%D 2011
%X Time-travel queries that couple temporal constraints with keyword
queries are useful in searching large-scale archives of time-evolving
content such as the Web, document collections, wikis, and so
on. Typical approaches for efficient evaluation of these queries
involve \emph{slicing} along the time-axis either the entire
collection~\cite{253349}, or individual index
lists~\cite{kberberi:sigir2007}. Both these methods are not
satisfactory since they sacrifice compactness of index for processing
efficiency making them either too big or, otherwise, too slow.

We present a novel index organization scheme that \emph{shards} the
index with \emph{zero increase in index size}, still minimizing the
cost of reading index index entries during query processing. Based on
the optimal sharding thus obtained, we develop practically efficient
sharding that takes into account the different costs of random and
sequential accesses. Our algorithm merges shards from the optimal
solution carefully to allow for few extra sequential accesses while
gaining significantly by reducing the random accesses. Finally, we
empirically establish the effectiveness of our novel sharding scheme
via detailed experiments over the edit history of the English version
of Wikipedia between 2001-2005 ($\approx$ 700 GB) and an archive of
the UK governmental web sites ($\approx$ 400 GB). Our results
demonstrate the feasibility of faster time-travel query processing
with no space overhead.
%B Research Report
%@ false

Report

R. Gemulla, P. J. Haas, E. Nijkamp, and Y. Sismanis

“Large-scale Matrix Factorization with Distributed Stochastic Gradient Descent,” IBM Research Division, San Jose, CA, 2011.

Abstract

As Web 2.0 and enterprise-cloud applications have proliferated, data mining

algorithms increasingly need to be (re)designed to handle web-scale

datasets. For this reason, low-rank matrix factorization has received a lot

of attention in recent years, since it is fundamental to a variety of mining

tasks, such as topic detection and collaborative filtering, that are

increasingly being applied to massive datasets. We provide a novel algorithm

to approximately factor large matrices with millions of rows, millions of

columns, and billions of nonzero elements. Our approach rests on stochastic

gradient descent (SGD), an iterative stochastic optimization algorithm; the

idea is to exploit the special structure of the matrix factorization problem

to develop a new ``stratified'' SGD variant that can be fully distributed

and run on web-scale datasets using, e.g., MapReduce. The resulting

distributed SGD factorization algorithm, called DSGD, provides good speed-up

and handles a wide variety of matrix factorizations. We establish

convergence properties of DSGD using results from stochastic approximation

theory and regenerative process theory, and also describe the practical

techniques used to optimize performance in our DSGD

implementation. Experiments suggest that DSGD converges significantly faster

and has better scalability properties than alternative algorithms.

BibTeX

@techreport{gemulla11,
TITLE = {Large-scale Matrix Factorization with Distributed Stochastic Gradient Descent},
AUTHOR = {Gemulla, Rainer and Haas, Peter J. and Nijkamp, Erik and Sismanis, Yannis},
LANGUAGE = {eng},
URL = {http://www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf},
LOCALID = {Local-ID: C1256DBF005F876D-5B618B1FF070E981C125784D0044B0D1-gemulla11},
INSTITUTION = {IBM Research Division},
ADDRESS = {San Jose, CA},
YEAR = {2011},
ABSTRACT = {As Web 2.0 and enterprise-cloud applications have proliferated, data mining algorithms increasingly need to be (re)designed to handle web-scale datasets. For this reason, low-rank matrix factorization has received a lot of attention in recent years, since it is fundamental to a variety of mining tasks, such as topic detection and collaborative filtering, that are increasingly being applied to massive datasets. We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm; the idea is to exploit the special structure of the matrix factorization problem to develop a new ``stratified'' SGD variant that can be fully distributed and run on web-scale datasets using, e.g., MapReduce. The resulting distributed SGD factorization algorithm, called DSGD, provides good speed-up and handles a wide variety of matrix factorizations. We establish convergence properties of DSGD using results from stochastic approximation theory and regenerative process theory, and also describe the practical techniques used to optimize performance in our DSGD implementation. Experiments suggest that DSGD converges significantly faster and has better scalability properties than alternative algorithms.},
TYPE = {IBM Research Report},
VOLUME = {RJ10481},
}

Endnote

%0 Report
%A Gemulla, Rainer
%A Haas, Peter J.
%A Nijkamp, Erik
%A Sismanis, Yannis
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
External Organizations
%T Large-scale Matrix Factorization with Distributed Stochastic Gradient Descent : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0010-147F-E
%F EDOC: 618949
%U http://www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf
%F OTHER: Local-ID: C1256DBF005F876D-5B618B1FF070E981C125784D0044B0D1-gemulla11
%Y IBM Research Division
%C San Jose, CA
%D 2011
%X As Web 2.0 and enterprise-cloud applications have proliferated, data mining
algorithms increasingly need to be (re)designed to handle web-scale
datasets. For this reason, low-rank matrix factorization has received a lot
of attention in recent years, since it is fundamental to a variety of mining
tasks, such as topic detection and collaborative filtering, that are
increasingly being applied to massive datasets. We provide a novel algorithm
to approximately factor large matrices with millions of rows, millions of
columns, and billions of nonzero elements. Our approach rests on stochastic
gradient descent (SGD), an iterative stochastic optimization algorithm; the
idea is to exploit the special structure of the matrix factorization problem
to develop a new ``stratified'' SGD variant that can be fully distributed
and run on web-scale datasets using, e.g., MapReduce. The resulting
distributed SGD factorization algorithm, called DSGD, provides good speed-up
and handles a wide variety of matrix factorizations. We establish
convergence properties of DSGD using results from stochastic approximation
theory and regenerative process theory, and also describe the practical
techniques used to optimize performance in our DSGD
implementation. Experiments suggest that DSGD converges significantly faster
and has better scalability properties than alternative algorithms.
%B IBM Research Report
%N RJ10481

Report

B. Taneva, M. Kacimi El Hassani, and G. Weikum

“Finding Images of Rare and Ambiguous Entities,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2011-5-002, 2011.

@techreport{TanevaKacimiWeikum2011,
TITLE = {Finding Images of Rare and Ambiguous Entities},
AUTHOR = {Taneva, Bilyana and Kacimi El Hassani, M. and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2011-5-002},
NUMBER = {MPI-I-2011-5-002},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2011},
DATE = {2011},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Taneva, Bilyana
%A Kacimi El Hassani, M.
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Finding Images of Rare and Ambiguous Entities : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-6581-8
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2011-5-002
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2011
%P 30 p.
%B Research Report

2010

Report

A. Anand, S. Bedathur, K. Berberich, and R. Schenkel

“Efficient Temporal Keyword Queries over Versioned Text,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2010-5-003, 2010.

@techreport{AnandBedathurBerberichSchenkel2010,
TITLE = {Efficient Temporal Keyword Queries over Versioned Text},
AUTHOR = {Anand, Avishek and Bedathur, Srikanta and Berberich, Klaus and Schenkel, Ralf},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-003},
NUMBER = {MPI-I-2010-5-003},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2010},
DATE = {2010},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Anand, Avishek
%A Bedathur, Srikanta
%A Berberich, Klaus
%A Schenkel, Ralf
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Efficient Temporal Keyword Queries over Versioned Text : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-65A0-2
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-003
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2010
%P 39 p.
%B Research Report

Report

K. Berberich, S. Bedathur, O. Alonso, and G. Weikum

“A Language Modeling Approach for Temporal Information Needs,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2010-5-001, 2010.

Abstract

This work addresses information needs that have a temporal

dimension conveyed by a temporal expression in the

user's query. Temporal expressions such as \textsf{``in the 1990s''}

are

frequent, easily extractable, but not leveraged by existing

retrieval models. One challenge when dealing with them is their

inherent uncertainty. It is often unclear which exact time interval

a temporal expression refers to.

We integrate temporal expressions into a language modeling approach,

thus making them first-class citizens of the retrieval model and

considering their inherent uncertainty. Experiments on the New York

Times Annotated Corpus using Amazon Mechanical Turk to collect

queries and obtain relevance assessments demonstrate that

our approach yields substantial improvements in retrieval

effectiveness.

BibTeX

@techreport{BerberichBedathurAlonsoWeikum2010,
TITLE = {A Language Modeling Approach for Temporal Information Needs},
AUTHOR = {Berberich, Klaus and Bedathur, Srikanta and Alonso, Omar and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-001},
NUMBER = {MPI-I-2010-5-001},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2010},
DATE = {2010},
ABSTRACT = {This work addresses information needs that have a temporal dimension conveyed by a temporal expression in the user's query. Temporal expressions such as \textsf{``in the 1990s''} are frequent, easily extractable, but not leveraged by existing retrieval models. One challenge when dealing with them is their inherent uncertainty. It is often unclear which exact time interval a temporal expression refers to. We integrate temporal expressions into a language modeling approach, thus making them first-class citizens of the retrieval model and considering their inherent uncertainty. Experiments on the New York Times Annotated Corpus using Amazon Mechanical Turk to collect queries and obtain relevance assessments demonstrate that our approach yields substantial improvements in retrieval effectiveness.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Berberich, Klaus
%A Bedathur, Srikanta
%A Alonso, Omar
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T A Language Modeling Approach for Temporal Information Needs : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-65AB-C
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-001
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2010
%P 29 p.
%X This work addresses information needs that have a temporal
 dimension conveyed by a temporal expression in the
 user's query. Temporal expressions such as \textsf{``in the 1990s''}
 are
 frequent, easily extractable, but not leveraged by existing
 retrieval models. One challenge when dealing with them is their
 inherent uncertainty. It is often unclear which exact time interval
 a temporal expression refers to.

 We integrate temporal expressions into a language modeling approach,
 thus making them first-class citizens of the retrieval model and
 considering their inherent uncertainty. Experiments on the New York
 Times Annotated Corpus using Amazon Mechanical Turk to collect
 queries and obtain relevance assessments demonstrate that
 our approach yields substantial improvements in retrieval
 effectiveness.
%B Research Report

Report

A. Broschart and R. Schenkel

“Real-time Text Queries with Tunable Term Pair Indexes,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2010-5-006, 2010.

Abstract

Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This paper introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of several orders of magnitude over existing text-based processing techniques with reasonable index sizes.

BibTeX

@techreport{BroschartSchenkel2010,
TITLE = {Real-time Text Queries with Tunable Term Pair Indexes},
AUTHOR = {Broschart, Andreas and Schenkel, Ralf},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-006},
NUMBER = {MPI-I-2010-5-006},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2010},
DATE = {2010},
ABSTRACT = {Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This paper introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of several orders of magnitude over existing text-based processing techniques with reasonable index sizes.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Broschart, Andreas
%A Schenkel, Ralf
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Real-time Text Queries with Tunable Term Pair Indexes : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-658C-1
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-006
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2010
%P 41 p.
%X Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This paper introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of several orders of magnitude over existing text-based processing techniques with reasonable index sizes.
%B Research Report

Report

A. Das Sarma, M. Theobald, and J. Widom

“LIVE: A Lineage-Supported Versioned DBMS,” Standford University, Standford, ILPUBS-926, 2010.

@techreport{ilpubs-926,
TITLE = {{LIVE}: A Lineage-Supported Versioned {DBMS}},
AUTHOR = {Das Sarma, Anish and Theobald, Martin and Widom, Jennifer},
LANGUAGE = {eng},
URL = {http://ilpubs.stanford.edu:8090/926/},
NUMBER = {ILPUBS-926},
LOCALID = {Local-ID: C1256DBF005F876D-C48EC96138450196C12576B1003F58D3-ilpubs-926},
INSTITUTION = {Standford University},
ADDRESS = {Standford},
YEAR = {2010},
DATE = {2010},
TYPE = {Technical Report},
}

Endnote

%0 Report
%A Das Sarma, Anish
%A Theobald, Martin
%A Widom, Jennifer
%+ External Organizations
Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
%T LIVE: A Lineage-Supported Versioned DBMS : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-1512-A
%F EDOC: 536357
%U http://ilpubs.stanford.edu:8090/926/
%F OTHER: Local-ID: C1256DBF005F876D-C48EC96138450196C12576B1003F58D3-ilpubs-926
%Y Standford University
%C Standford
%D 2010
%P 13 p.
%B Technical Report

Report

S. Elbassuoni, M. Ramanath, and G. Weikum

“Query Relaxation for Entity-relationship Search,” Max-Planck Institut für Informatik, Saarbrücken, MPI-I-2010-5-008, 2010.

@techreport{Elbassuoni-relax2010,
TITLE = {Query Relaxation for Entity-relationship Search},
AUTHOR = {Elbassuoni, Shady and Ramanath, Maya and Weikum, Gerhard},
LANGUAGE = {eng},
NUMBER = {MPI-I-2010-5-008},
INSTITUTION = {Max-Planck Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2010},
DATE = {2010},
TYPE = {Report},
}

Endnote

%0 Report
%A Elbassuoni, Shady
%A Ramanath, Maya
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Query Relaxation for Entity-relationship Search : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0019-B30B-6
%Y Max-Planck Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2010
%B Report

Report

J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum

“YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2010-5-007, 2010.

Abstract

We present YAGO2, an extension of the YAGO knowledge base, in which entities,

facts, and events are anchored in both time and space. YAGO2 is built

automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million

facts about 9.8 million entities. Human evaluation confirmed an accuracy of

95\% of the facts in YAGO2. In this paper, we present the extraction

methodology, the integration of the spatio-temporal dimension, and our

knowledge representation SPOTL, an extension of the original SPO-triple model

to time and space.

BibTeX

@techreport{Hoffart2010,
TITLE = {{YAGO}2: A Spatially and Temporally Enhanced Knowledge Base from {Wikipedia}},
AUTHOR = {Hoffart, Johannes and Suchanek, Fabian M. and Berberich, Klaus and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-007},
NUMBER = {MPI-I-2010-5-007},
LOCALID = {Local-ID: C1256DBF005F876D-37A86CDFCE56B71DC125784800386E6A-Hoffart2010},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2010},
DATE = {2010},
ABSTRACT = {We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy of 95\% of the facts in YAGO2. In this paper, we present the extraction methodology, the integration of the spatio-temporal dimension, and our knowledge representation SPOTL, an extension of the original SPO-triple model to time and space.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Hoffart, Johannes
%A Suchanek, Fabian M.
%A Berberich, Klaus
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-155B-A
%F EDOC: 536412
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-007
%F OTHER: Local-ID: C1256DBF005F876D-37A86CDFCE56B71DC125784800386E6A-Hoffart2010
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2010
%P 55 p.
%X We present YAGO2, an extension of the YAGO knowledge base, in which entities, 
facts, and events are anchored in both time and space.  YAGO2 is built 
automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million 
facts about 9.8 million entities. Human evaluation confirmed an accuracy of 
95\% of the facts in YAGO2. In this paper, we present the extraction 
methodology, the integration of the spatio-temporal dimension, and our 
knowledge representation SPOTL, an extension of the original SPO-triple model 
to time and space.
%B Research Report

Report

N. Preda, F. Suchanek, W. Yuan, and G. Weikum

“Query Evaluation with Asymmetric Web Services,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2010-5-004, 2010.

@techreport{PredaSuchanekYuanWeikum2011,
TITLE = {Query Evaluation with Asymmetric Web Services},
AUTHOR = {Preda, Nicoleta and Suchanek, F. and Yuan, Wenjun and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-004},
NUMBER = {MPI-I-2010-5-004},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2010},
DATE = {2010},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Preda, Nicoleta
%A Suchanek, F.
%A Yuan, Wenjun
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Query Evaluation with Asymmetric Web Services : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-659D-C
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-004
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2010
%P 31 p.
%B Research Report

Report

D5D1

S. Seufert, S. Bedathur, J. Mestre, and G. Weikum

“Bonsai: Growing Interesting Small Trees,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2010-5-005, 2010.

Abstract

Graphs are increasingly used to model a variety of loosely structured data such

as biological or social networks and entity-relationships. Given this profusion

of large-scale graph data, efficiently discovering interesting substructures

buried

within is essential. These substructures are typically used in determining

subsequent actions, such as conducting visual analytics by humans or designing

expensive biomedical experiments. In such settings, it is often desirable to

constrain the size of the discovered results in order to directly control the

associated costs. In this report, we address the problem of finding

cardinality-constrained connected

subtrees from large node-weighted graphs that maximize the sum of weights of

selected nodes. We provide an efficient constant-factor approximation algorithm

for this strongly NP-hard problem. Our techniques can be applied in a wide

variety

of application settings, for example in differential analysis of graphs, a

problem that frequently arises in bioinformatics but also has applications on

the web.

BibTeX

@techreport{Seufert2010a,
TITLE = {Bonsai: Growing Interesting Small Trees},
AUTHOR = {Seufert, Stephan and Bedathur, Srikanta and Mestre, Julian and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-005},
NUMBER = {MPI-I-2010-5-005},
LOCALID = {Local-ID: C1256DBF005F876D-BC73995718B48415C12577E600538833-Seufert2010a},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2010},
DATE = {2010},
ABSTRACT = {Graphs are increasingly used to model a variety of loosely structured data such as biological or social networks and entity-relationships. Given this profusion of large-scale graph data, efficiently discovering interesting substructures buried within is essential. These substructures are typically used in determining subsequent actions, such as conducting visual analytics by humans or designing expensive biomedical experiments. In such settings, it is often desirable to constrain the size of the discovered results in order to directly control the associated costs. In this report, we address the problem of finding cardinality-constrained connected subtrees from large node-weighted graphs that maximize the sum of weights of selected nodes. We provide an efficient constant-factor approximation algorithm for this strongly NP-hard problem. Our techniques can be applied in a wide variety of application settings, for example in differential analysis of graphs, a problem that frequently arises in bioinformatics but also has applications on the web.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Seufert, Stephan
%A Bedathur, Srikanta
%A Mestre, Julian
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Algorithms and Complexity, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Bonsai: Growing Interesting Small Trees : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-14D8-7
%F EDOC: 536383
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-005
%F OTHER: Local-ID: C1256DBF005F876D-BC73995718B48415C12577E600538833-Seufert2010a
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2010
%P 32 p.
%X Graphs are increasingly used to model a variety of loosely structured data such 
as biological or social networks and entity-relationships. Given this profusion 
of large-scale graph data, efficiently discovering interesting substructures 
buried
within is essential. These substructures are typically used in determining 
subsequent actions, such as conducting visual analytics by humans or designing 
expensive biomedical experiments. In such settings, it is often desirable to 
constrain the size of the discovered results in order to directly control the 
associated costs. In this report, we address the problem of finding 
cardinality-constrained connected
subtrees from large node-weighted graphs that maximize the sum of weights of 
selected nodes. We provide an efficient constant-factor approximation algorithm 
for this strongly NP-hard problem. Our techniques can be applied in a wide 
variety
of application settings, for example in differential analysis of graphs, a 
problem that frequently arises in bioinformatics but also has applications on 
the web.
%B Research Report

Report

M. Theobald, M. Sozio, F. Suchanek, and N. Nakashole

“URDF: Efficient Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2010-5-002, 2010.

Abstract

We present URDF, an efficient reasoning framework for graph-based, nonschematic

RDF knowledge bases and SPARQL-like queries. URDF augments

first-order reasoning by a combination of soft rules, with Datalog-style

recursive

implications, and hard rules, in the shape of mutually exclusive sets of facts.

It incorporates

the common possible worlds semantics with independent base facts as

it is prevalent in most probabilistic database approaches, but also supports

semantically

more expressive, probabilistic first-order representations such as Markov

Logic Networks.

As knowledge extraction on theWeb often is an iterative (and inherently noisy)

process, URDF explicitly targets the resolution of inconsistencies between the

underlying

RDF base facts and the inference rules. Core of our approach is a novel

and efficient approximation algorithm for a generalized version of the Weighted

MAX-SAT problem, allowing us to dynamically resolve such inconsistencies

directly

at query processing time. Our MAX-SAT algorithm has a worst-case running

time of O(jCj jSj), where jCj and jSj denote the number of facts in grounded

soft and hard rules, respectively, and it comes with tight approximation

guarantees

with respect to the shape of the rules and the distribution of confidences of

facts

they contain. Experiments over various benchmark settings confirm a high

robustness

and significantly improved runtime of our reasoning framework in comparison

to state-of-the-art techniques for MCMC sampling such as MAP inference

and MC-SAT.

Keywords

BibTeX

@techreport{urdf-tr-2010,
TITLE = {{URDF}: Efficient Reasoning in Uncertain {RDF} Knowledge Bases with Soft and Hard Rules},
AUTHOR = {Theobald, Martin and Sozio, Mauro and Suchanek, Fabian and Nakashole, Ndapandula},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-002},
NUMBER = {MPI-I-2010-5-002},
LOCALID = {Local-ID: C1256DBF005F876D-4F6C2407136ECAA6C125770E003634BE-urdf-tr-2010},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2010},
DATE = {2010},
ABSTRACT = {We present URDF, an efficient reasoning framework for graph-based, nonschematic RDF knowledge bases and SPARQL-like queries. URDF augments first-order reasoning by a combination of soft rules, with Datalog-style recursive implications, and hard rules, in the shape of mutually exclusive sets of facts. It incorporates the common possible worlds semantics with independent base facts as it is prevalent in most probabilistic database approaches, but also supports semantically more expressive, probabilistic first-order representations such as Markov Logic Networks. As knowledge extraction on theWeb often is an iterative (and inherently noisy) process, URDF explicitly targets the resolution of inconsistencies between the underlying RDF base facts and the inference rules. Core of our approach is a novel and efficient approximation algorithm for a generalized version of the Weighted MAX-SAT problem, allowing us to dynamically resolve such inconsistencies directly at query processing time. Our MAX-SAT algorithm has a worst-case running time of O(jCj jSj), where jCj and jSj denote the number of facts in grounded soft and hard rules, respectively, and it comes with tight approximation guarantees with respect to the shape of the rules and the distribution of confidences of facts they contain. Experiments over various benchmark settings confirm a high robustness and significantly improved runtime of our reasoning framework in comparison to state-of-the-art techniques for MCMC sampling such as MAP inference and MC-SAT. Keywords},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Theobald, Martin
%A Sozio, Mauro
%A Suchanek, Fabian
%A Nakashole, Ndapandula
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T URDF: Efficient Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-1556-3
%F EDOC: 536366
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2010-5-002
%F OTHER: Local-ID: C1256DBF005F876D-4F6C2407136ECAA6C125770E003634BE-urdf-tr-2010
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2010
%P 48 p.
%X We present URDF, an efficient reasoning framework for graph-based, nonschematic
RDF knowledge bases and SPARQL-like queries. URDF augments
first-order reasoning by a combination of soft rules, with Datalog-style 
recursive
implications, and hard rules, in the shape of mutually exclusive sets of facts. 
It incorporates
the common possible worlds semantics with independent base facts as
it is prevalent in most probabilistic database approaches, but also supports 
semantically
more expressive, probabilistic first-order representations such as Markov
Logic Networks.
As knowledge extraction on theWeb often is an iterative (and inherently noisy)
process, URDF explicitly targets the resolution of inconsistencies between the 
underlying
RDF base facts and the inference rules. Core of our approach is a novel
and efficient approximation algorithm for a generalized version of the Weighted
MAX-SAT problem, allowing us to dynamically resolve such inconsistencies 
directly
at query processing time. Our MAX-SAT algorithm has a worst-case running
time of O(jCj jSj), where jCj and jSj denote the number of facts in grounded
soft and hard rules, respectively, and it comes with tight approximation 
guarantees
with respect to the shape of the rules and the distribution of confidences of 
facts
they contain. Experiments over various benchmark settings confirm a high 
robustness
and significantly improved runtime of our reasoning framework in comparison
to state-of-the-art techniques for MCMC sampling such as MAP inference
and MC-SAT.
Keywords
%B Research Report

2009

Report

S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum

“Scalable Phrase Mining for Ad-hoc Text Analytics,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2009-5-006, 2009.

@techreport{BedathurBerberichDittrichMamoulisWeikum2009,
TITLE = {Scalable Phrase Mining for Ad-hoc Text Analytics},
AUTHOR = {Bedathur, Srikanta and Berberich, Klaus and Dittrich, Jens and Mamoulis, Nikos and Weikum, Gerhard},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2009-5-006},
LOCALID = {Local-ID: C1256DBF005F876D-4E35301DBC58B9F7C12575A00044A942-TechReport-BBDMW2009},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2009},
DATE = {2009},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Bedathur, Srikanta
%A Berberich, Klaus
%A Dittrich, Jens
%A Mamoulis, Nikos
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Scalable Phrase Mining for Ad-hoc Text Analytics : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-194A-0
%F EDOC: 520425
%@ 0946-011X
%F OTHER: Local-ID: C1256DBF005F876D-4E35301DBC58B9F7C12575A00044A942-TechReport-BBDMW2009
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2009
%P 41 p. 
%B Research Report

Report

G. de Melo and G. Weikum

“Towards a Universal Wordnet by Learning from Combined Evidenc,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2009-5-005, 2009.

Abstract

Lexical databases are invaluable sources of knowledge about words and

their meanings,

with numerous applications in areas like NLP, IR, and AI.

We propose a methodology for the automatic construction of a large-scale

multilingual

lexical database where words of many languages are hierarchically

organized in terms of their

meanings and their semantic relations to other words. This resource is

bootstrapped from

WordNet, a well-known English-language resource. Our approach extends

WordNet with around

1.5 million meaning links for 800,000 words in over 200 languages,

drawing on evidence extracted

from a variety of resources including existing (monolingual) wordnets,

(mostly bilingual) translation

dictionaries, and parallel corpora.

Graph-based scoring functions and statistical learning techniques are

used to iteratively integrate

this information and build an output graph. Experiments show that this

wordnet has a high

level of precision and coverage, and that it can be useful in applied

tasks such as

cross-lingual text classification.

BibTeX

@techreport{deMeloWeikum2009,
TITLE = {Towards a Universal Wordnet by Learning from Combined Evidenc},
AUTHOR = {de Melo, Gerard and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2009-5-005},
NUMBER = {MPI-I-2009-5-005},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2009},
DATE = {2009},
ABSTRACT = {Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A de Melo, Gerard
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Towards a Universal Wordnet by Learning from Combined Evidenc : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-665C-5
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2009-5-005
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2009
%P 32 p.
%X Lexical databases are invaluable sources of knowledge about words and 
their meanings,
with numerous applications in areas like NLP, IR, and AI.
We propose a methodology for the automatic construction of a large-scale 
multilingual
lexical database where words of many languages are hierarchically 
organized in terms of their
meanings and their semantic relations to other words. This resource is 
bootstrapped from
WordNet, a well-known English-language resource. Our approach extends 
WordNet with around
1.5 million meaning links for 800,000 words in over 200 languages, 
drawing on evidence extracted
from a variety of resources including existing (monolingual) wordnets, 
(mostly bilingual) translation
dictionaries, and parallel corpora.
Graph-based scoring functions and statistical learning techniques are 
used to iteratively integrate
this information and build an output graph. Experiments show that this 
wordnet has a high
level of precision and coverage, and that it can be useful in applied 
tasks such as
cross-lingual text classification.
%B Research Report

Report

G. Kasneci, S. Elbassuoni, and G. Weikum

“MING: Mining Informative Entity-relationship Subgraphs,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2009-5-007, 2009.

@techreport{KasneciWeikumElbassuoni2009,
TITLE = {{MING}: Mining Informative Entity-relationship Subgraphs},
AUTHOR = {Kasneci, Gjergji and Elbassuoni, Shady and Weikum, Gerhard},
LANGUAGE = {eng},
NUMBER = {MPI-I-2009-5-007},
LOCALID = {Local-ID: C1256DBF005F876D-E977DDB8EDAABEE6C12576320036DBD9-KasneciMING2009},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2009},
DATE = {2009},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Kasneci, Gjergji
%A Elbassuoni, Shady
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T MING: Mining Informative Entity-relationship Subgraphs : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-1932-4
%F EDOC: 520416
%F OTHER: Local-ID: C1256DBF005F876D-E977DDB8EDAABEE6C12576320036DBD9-KasneciMING2009
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2009
%P 32 p.
%B Research Report

Report

T. Neumann and G. Weikum

“The RDF-3X Engine for Scalable Management of RDF Data,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2009-5-003, 2009.

Abstract

RDF is a data model for schema-free structured information that is gaining

momentum in the context of Semantic-Web data, life sciences, and also Web 2.0

platforms. The ``pay-as-you-go'' nature of RDF and the flexible

pattern-matching capabilities of its query language SPARQL entail efficiency

and scalability challenges for complex queries including long join paths. This

paper presents the RDF-3X engine, an implementation of SPARQL that achieves

excellent performance by pursuing a RISC-style architecture with streamlined

indexing and query processing.

The physical design is identical for all RDF-3X databases regardless of their

workloads, and completely eliminates the need for index tuning by exhaustive

indexes for all permutations of subject-property-object triples and their

binary and unary projections. These indexes are highly compressed, and the

query processor can aggressively leverage fast merge joins with excellent

performance of processor caches. The query optimizer is able to choose optimal

join orders even for complex queries, with a cost model that includes

statistical synopses for entire join paths. Although RDF-3X is optimized for

queries, it also provides good support for efficient online updates by means of

a staging architecture: direct updates to the main database indexes are

deferred, and instead applied to compact differential indexes which are later

merged into the main indexes in a batched manner.

Experimental studies with several large-scale datasets with more than 50

million RDF triples and benchmark queries that include pattern matching,

manyway star-joins, and long path-joins demonstrate that RDF-3X can outperform

the previously best alternatives by one or two orders of magnitude.

BibTeX

@techreport{Neumann2009report1,
TITLE = {The {RDF}-3X Engine for Scalable Management of {RDF} Data},
AUTHOR = {Neumann, Thomas and Weikum, Gerhard},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2009-5-003},
LOCALID = {Local-ID: C1256DBF005F876D-AD3DBAFA6FB90DD2C1257593002FF3DF-Neumann2009report1},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2009},
DATE = {2009},
ABSTRACT = {RDF is a data model for schema-free structured information that is gaining momentum in the context of Semantic-Web data, life sciences, and also Web 2.0 platforms. The ``pay-as-you-go'' nature of RDF and the flexible pattern-matching capabilities of its query language SPARQL entail efficiency and scalability challenges for complex queries including long join paths. This paper presents the RDF-3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISC-style architecture with streamlined indexing and query processing. The physical design is identical for all RDF-3X databases regardless of their workloads, and completely eliminates the need for index tuning by exhaustive indexes for all permutations of subject-property-object triples and their binary and unary projections. These indexes are highly compressed, and the query processor can aggressively leverage fast merge joins with excellent performance of processor caches. The query optimizer is able to choose optimal join orders even for complex queries, with a cost model that includes statistical synopses for entire join paths. Although RDF-3X is optimized for queries, it also provides good support for efficient online updates by means of a staging architecture: direct updates to the main database indexes are deferred, and instead applied to compact differential indexes which are later merged into the main indexes in a batched manner. Experimental studies with several large-scale datasets with more than 50 million RDF triples and benchmark queries that include pattern matching, manyway star-joins, and long path-joins demonstrate that RDF-3X can outperform the previously best alternatives by one or two orders of magnitude.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Neumann, Thomas
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T The RDF-3X Engine for Scalable Management of RDF Data : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-195A-A
%F EDOC: 520381
%@ 0946-011X
%F OTHER: Local-ID: C1256DBF005F876D-AD3DBAFA6FB90DD2C1257593002FF3DF-Neumann2009report1
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2009
%X RDF is a data model for schema-free structured information that is gaining 
momentum in the context of Semantic-Web data, life sciences, and also Web 2.0 
platforms. The ``pay-as-you-go'' nature of RDF and the flexible 
pattern-matching capabilities of its query language SPARQL entail efficiency 
and scalability challenges for complex queries including long join paths. This 
paper presents the RDF-3X engine, an implementation of SPARQL that achieves 
excellent performance by pursuing a RISC-style architecture with streamlined 
indexing and query processing.

The physical design is identical for all RDF-3X databases regardless of their 
workloads, and completely eliminates the need for index tuning by exhaustive 
indexes for all permutations of subject-property-object triples and their 
binary and unary projections. These indexes are highly compressed, and the 
query processor can aggressively leverage fast merge joins with excellent 
performance of processor caches. The query optimizer is able to choose optimal 
join orders even for complex queries, with a cost model that includes 
statistical synopses for entire join paths. Although RDF-3X is optimized for 
queries, it also provides good support for efficient online updates by means of 
a staging architecture: direct updates to the main database indexes are 
deferred, and instead applied to compact differential indexes which are later 
merged into the main indexes in a batched manner.

Experimental studies with several large-scale datasets with more than 50 
million RDF triples and benchmark queries that include pattern matching, 
manyway star-joins, and long path-joins demonstrate that RDF-3X can outperform 
the previously best alternatives by one or two orders of magnitude.
%B Research Report

Report

N. Preda, F. Suchanek, G. Kasneci, T. Neumann, and G. Weikum

“Coupling Knowledge Bases and Web Services for Active Knowledge,” MPI-I-2009-5-004, 2009.

Abstract

We present ANGIE, a system that can answer user queries by combining

knowledge

from a local database with knowledge retrieved from Web services. If a user

poses a query that cannot be answered by the local database alone, ANGIE

calls

the appropriate Web services to retrieve the missing information. In

ANGIE,Web

services act as dynamic components of the knowledge base that deliver

knowledge

on demand. To the user, this is fully transparent; the dynamically acquired

knowledge is presented as if it were stored in the local knowledge base.

We have developed a RDF based model for declarative definition of functions

embedded in the local knowledge base. The results of available Web

services are

cast into RDF subgraphs. Parameter bindings are automatically constructed by

ANGIE, services are invoked, and the semi-structured information returned by

the services are dynamically integrated into the knowledge base

We have developed a query rewriting algorithm that determines one or more

function composition that need to be executed in order to evaluate a

SPARQL style

user query. The key idea is that the local knowledge base can be used to

guide the selection of values used as input parameters of function

calls. This is in

contrast to the conventional approaches in the literature which would

exhaustively

materialize all values that can be used as binding values for the input

parameters.

BibTeX

@techreport{PredaSuchanekKasneciNeumannWeikum2009,
TITLE = {Coupling Knowledge Bases and Web Services for Active Knowledge},
AUTHOR = {Preda, Nicoleta and Suchanek, Fabian and Kasneci, Gjergji and Neumann, Thomas and Weikum, Gerhard},
LANGUAGE = {eng},
NUMBER = {MPI-I-2009-5-004},
LOCALID = {Local-ID: C1256DBF005F876D-BF2AB4A39F925BC8C125759800444744-PredaSuchanekKasneciNeumannWeikum2009},
YEAR = {2009},
DATE = {2009},
ABSTRACT = {We present ANGIE, a system that can answer user queries by combining knowledge from a local database with knowledge retrieved from Web services. If a user poses a query that cannot be answered by the local database alone, ANGIE calls the appropriate Web services to retrieve the missing information. In ANGIE,Web services act as dynamic components of the knowledge base that deliver knowledge on demand. To the user, this is fully transparent; the dynamically acquired knowledge is presented as if it were stored in the local knowledge base. We have developed a RDF based model for declarative definition of functions embedded in the local knowledge base. The results of available Web services are cast into RDF subgraphs. Parameter bindings are automatically constructed by ANGIE, services are invoked, and the semi-structured information returned by the services are dynamically integrated into the knowledge base We have developed a query rewriting algorithm that determines one or more function composition that need to be executed in order to evaluate a SPARQL style user query. The key idea is that the local knowledge base can be used to guide the selection of values used as input parameters of function calls. This is in contrast to the conventional approaches in the literature which would exhaustively materialize all values that can be used as binding values for the input parameters.},
TYPE = {Research Reports},
}

Endnote

%0 Report
%A Preda, Nicoleta
%A Suchanek, Fabian
%A Kasneci, Gjergji
%A Neumann, Thomas
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Coupling Knowledge Bases and Web Services for Active Knowledge : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-1901-1
%F EDOC: 520423
%F OTHER: Local-ID: C1256DBF005F876D-BF2AB4A39F925BC8C125759800444744-PredaSuchanekKasneciNeumannWeikum2009
%D 2009
%X We present ANGIE, a system that can answer user queries by combining
knowledge
from a local database with knowledge retrieved from Web services. If a user
poses a query that cannot be answered by the local database alone, ANGIE
calls
the appropriate Web services to retrieve the missing information. In
ANGIE,Web
services act as dynamic components of the knowledge base that deliver
knowledge
on demand. To the user, this is fully transparent; the dynamically acquired
knowledge is presented as if it were stored in the local knowledge base.
We have developed a RDF based model for declarative definition of functions
embedded in the local knowledge base. The results of available Web
services are
cast into RDF subgraphs. Parameter bindings are automatically constructed by
ANGIE, services are invoked, and the semi-structured information returned by
the services are dynamically integrated into the knowledge base
We have developed a query rewriting algorithm that determines one or more
function composition that need to be executed in order to evaluate a
SPARQL style
user query. The key idea is that the local knowledge base can be used to
guide the selection of values used as input parameters of function
calls. This is in
contrast to the conventional approaches in the literature which would
exhaustively
materialize all values that can be used as binding values for the input
parameters.
%B Research Reports

Report

M. Ramanath, K. Sarath Kumar, and G. Ifrim

“Generating Concise and Readable Summaries of XML documents,” MPI-I-2009-5-002, 2009.

@techreport{Ramanath2008a,
TITLE = {Generating Concise and Readable Summaries of {XML} documents},
AUTHOR = {Ramanath, Maya and Sarath Kumar, Kondreddi and Ifrim, Georgiana},
LANGUAGE = {eng},
NUMBER = {MPI-I-2009-5-002},
LOCALID = {Local-ID: C1256DBF005F876D-EA355A84178BB514C12575BA002A90E0-Ramanath2008},
YEAR = {2009},
DATE = {2009},
TYPE = {Research Reports},
}

Endnote

%0 Report
%A Ramanath, Maya
%A Sarath Kumar, Kondreddi
%A Ifrim, Georgiana
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Generating Concise and Readable Summaries of XML documents : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-1915-6
%F EDOC: 520419
%F OTHER: Local-ID: C1256DBF005F876D-EA355A84178BB514C12575BA002A90E0-Ramanath2008
%D 2009
%B Research Reports

2008

Report

A. Das Sarma, M. Theobald, and J. Widom

“Data Modifications and Versioning in Trio,” Standford University Infolab, Standford, CA, ILPUBS-849, 2008.

@techreport{ilpubs-849,
TITLE = {Data Modifications and Versioning in Trio},
AUTHOR = {Das Sarma, Anish and Theobald, Martin and Widom, Jennifer},
LANGUAGE = {eng},
URL = {http://ilpubs.stanford.edu:8090/849/},
NUMBER = {ILPUBS-849},
INSTITUTION = {Standford University Infolab},
ADDRESS = {Standford, CA},
YEAR = {2008},
TYPE = {Technical Report},
}

Endnote

%0 Report
%A Das Sarma, Anish
%A Theobald, Martin
%A Widom, Jennifer
%+ External Organizations
Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
%T Data Modifications and Versioning in Trio : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0027-AED6-D
%U http://ilpubs.stanford.edu:8090/849/
%Y Standford University Infolab
%C Standford, CA
%D 2008
%B Technical Report

Report

G. de Melo, F. Suchanek, and A. Pease

“Integrating Yago into the suggested upper merged ontology,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2008-5-003, 2008.

Abstract

Ontologies are becoming more and more popular as background knowledge

for intelligent applications.

Up to now, there has been a schism between manually assembled, highly

axiomatic ontologies

and large, automatically constructed knowledge bases.

This report discusses how the two worlds can be brought together by

combining the high-level axiomatizations from

the Standard Upper Merged Ontology (SUMO) with the extensive world

knowledge of the YAGO ontology.

On the theoretical side, it analyses the differences between the

knowledge representation in YAGO and SUMO.

On the practical side, this report explains how the two resources can

be merged. This yields a new

large-scale formal ontology, which provides information about millions

of entities such as people, cities,

organizations, and companies. This report is the detailed version of

our paper at ICTAI 2008.

BibTeX

@techreport{deMeloSuchanekPease2008,
TITLE = {Integrating Yago into the suggested upper merged ontology},
AUTHOR = {de Melo, Gerard and Suchanek, Fabian and Pease, Adam},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2008-5-003},
NUMBER = {MPI-I-2008-5-003},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2008},
DATE = {2008},
ABSTRACT = {Ontologies are becoming more and more popular as background knowledge for intelligent applications. Up to now, there has been a schism between manually assembled, highly axiomatic ontologies and large, automatically constructed knowledge bases. This report discusses how the two worlds can be brought together by combining the high-level axiomatizations from the Standard Upper Merged Ontology (SUMO) with the extensive world knowledge of the YAGO ontology. On the theoretical side, it analyses the differences between the knowledge representation in YAGO and SUMO. On the practical side, this report explains how the two resources can be merged. This yields a new large-scale formal ontology, which provides information about millions of entities such as people, cities, organizations, and companies. This report is the detailed version of our paper at ICTAI 2008.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A de Melo, Gerard
%A Suchanek, Fabian
%A Pease, Adam
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
%T Integrating Yago into the suggested upper merged ontology : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-66AB-6
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2008-5-003
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2008
%P 31 p.
%X Ontologies are becoming more and more popular as background knowledge
for intelligent applications.
Up to now, there has been a schism between manually assembled, highly
axiomatic ontologies
and large, automatically constructed knowledge bases.
This report discusses how the two worlds can be brought together by
combining the high-level axiomatizations from
the Standard Upper Merged Ontology (SUMO) with the extensive world
knowledge of the YAGO ontology.
On the theoretical side, it analyses the differences between the
knowledge representation in YAGO and SUMO.
On the practical side, this report explains how the two resources can
be merged. This yields a new
large-scale formal ontology, which provides information about millions
of entities such as people, cities,
organizations, and companies. This report is the detailed version of
our paper at ICTAI 2008.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

Report

G. Kasneci, M. Ramanath, M. Sozio, F. Suchanek, and G. Weikum

“STAR: Steiner tree approximation in relationship-graphs,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2008-5-001, 2008.

Abstract

Large-scale graphs and networks are abundant in modern information systems:

entity-relationship graphs over relational data or Web-extracted entities,

biological networks, social online communities, knowledge bases, and

many more. Often such data comes with expressive node and edge labels that

allow an interpretation as a semantic graph, and edge weights that reflect

the strengths of semantic relations between entities. Finding close

relationships between a given set of two, three, or more entities is an

important building block for many search, ranking, and analysis tasks.

From an algorithmic point of view, this translates into computing the best

Steiner trees between the given nodes, a classical NP-hard problem. In

this paper, we present a new approximation algorithm, coined STAR, for

relationship queries over large graphs that do not fit into memory. We

prove that for n query entities, STAR yields an O(log(n))-approximation of

the optimal Steiner tree, and show that in practical cases the results

returned by STAR are qualitatively better than the results returned by a

classical 2-approximation algorithm. We then describe an extension to our

algorithm to return the top-k Steiner trees. Finally, we evaluate our

algorithm over both main-memory as well as completely disk-resident graphs

containing millions of nodes. Our experiments show that STAR outperforms

the best state-of-the returns qualitatively better results.

BibTeX

@techreport{KasneciRamanathSozioSuchanekWeikum2008,
TITLE = {{STAR}: Steiner tree approximation in relationship-graphs},
AUTHOR = {Kasneci, Gjergji and Ramanath, Maya and Sozio, Mauro and Suchanek, Fabian and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2008-5-001},
NUMBER = {MPI-I-2008-5-001},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2008},
DATE = {2008},
ABSTRACT = {Large-scale graphs and networks are abundant in modern information systems: entity-relationship graphs over relational data or Web-extracted entities, biological networks, social online communities, knowledge bases, and many more. Often such data comes with expressive node and edge labels that allow an interpretation as a semantic graph, and edge weights that reflect the strengths of semantic relations between entities. Finding close relationships between a given set of two, three, or more entities is an important building block for many search, ranking, and analysis tasks. From an algorithmic point of view, this translates into computing the best Steiner trees between the given nodes, a classical NP-hard problem. In this paper, we present a new approximation algorithm, coined STAR, for relationship queries over large graphs that do not fit into memory. We prove that for n query entities, STAR yields an O(log(n))-approximation of the optimal Steiner tree, and show that in practical cases the results returned by STAR are qualitatively better than the results returned by a classical 2-approximation algorithm. We then describe an extension to our algorithm to return the top-k Steiner trees. Finally, we evaluate our algorithm over both main-memory as well as completely disk-resident graphs containing millions of nodes. Our experiments show that STAR outperforms the best state-of-the returns qualitatively better results.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Kasneci, Gjergji
%A Ramanath, Maya
%A Sozio, Mauro
%A Suchanek, Fabian
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T STAR: Steiner tree approximation in relationship-graphs : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-66B3-1
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2008-5-001
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2008
%P 37 p.
%X Large-scale graphs and networks are abundant in modern information systems:
entity-relationship graphs over relational data or Web-extracted entities,
biological networks, social online communities, knowledge bases, and
many more. Often such data comes with expressive node and edge labels that
allow an interpretation as a semantic graph, and edge weights that reflect
the strengths of semantic relations between entities. Finding close
relationships between a given set of two, three, or more entities is an
important building block for many search, ranking, and analysis tasks.
From an algorithmic point of view, this translates into computing the best
Steiner trees between the given nodes, a classical NP-hard problem. In
this paper, we present a new approximation algorithm, coined STAR, for
relationship queries over large graphs that do not fit into memory. We
prove that for n query entities, STAR yields an O(log(n))-approximation of
the optimal Steiner tree, and show that in practical cases the results
returned by STAR are qualitatively better than the results returned by a
classical 2-approximation algorithm. We then describe an extension to our
algorithm to return the top-k Steiner trees. Finally, we evaluate our
algorithm over both main-memory as well as completely disk-resident graphs
containing millions of nodes. Our experiments show that STAR outperforms
the best state-of-the returns qualitatively better results.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

Report

T. Neumann and G. Moerkotte

“Single phase construction of optimal DAG-structured QEPs,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2008-5-002, 2008.

Abstract

Traditionally, database management systems use tree-structured query

evaluation plans. They are easy to implement but not expressive enough

for some optimizations like eliminating common algebraic subexpressions

or magic sets. These require directed acyclic graphs (DAGs), i.e.

shared subplans.

Existing approaches consider DAGs merely for special cases

and not in full generality.

We introduce a novel framework to reason about sharing of subplans

and, thus, DAG-structured query evaluation plans.

Then, we present the first plan generator capable

of generating optimal DAG-structured query evaluation plans.

The experimental results show that with no or only a modest

increase of plan generation time, a major reduction

of query execution time can be

achieved for common queries.

BibTeX

@techreport{NeumannMoerkotte2008,
TITLE = {Single phase construction of optimal {DAG}-structured {QEPs}},
AUTHOR = {Neumann, Thomas and Moerkotte, Guido},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2008-5-002},
NUMBER = {MPI-I-2008-5-002},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2008},
DATE = {2008},
ABSTRACT = {Traditionally, database management systems use tree-structured query evaluation plans. They are easy to implement but not expressive enough for some optimizations like eliminating common algebraic subexpressions or magic sets. These require directed acyclic graphs (DAGs), i.e. shared subplans. Existing approaches consider DAGs merely for special cases and not in full generality. We introduce a novel framework to reason about sharing of subplans and, thus, DAG-structured query evaluation plans. Then, we present the first plan generator capable of generating optimal DAG-structured query evaluation plans. The experimental results show that with no or only a modest increase of plan generation time, a major reduction of query execution time can be achieved for common queries.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Neumann, Thomas
%A Moerkotte, Guido
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
%T Single phase construction of optimal DAG-structured QEPs : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-66B0-7
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2008-5-002
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2008
%P 73 p.
%X Traditionally, database management systems use tree-structured query
evaluation plans. They are easy to implement but not expressive enough
for some optimizations like eliminating common algebraic subexpressions
or magic sets. These require directed acyclic graphs (DAGs), i.e.
shared subplans.

Existing approaches consider DAGs merely for special cases
and not in full generality.
We introduce a novel framework to reason about sharing of subplans
and, thus, DAG-structured query evaluation plans.
Then, we present the first plan generator capable
of generating optimal DAG-structured query evaluation plans.
The experimental results show that with no or only a modest
increase of plan generation time, a major reduction
of query execution time can be
achieved for common queries.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

Report

F. Suchanek, M. Sozio, and G. Weikum

“SOFIE: A Self-Organizing Framework for Information Extraction,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2008-5-004, 2008.

Abstract

This paper presents SOFIE, a system for automated ontology extension.

SOFIE can parse natural language documents, extract ontological facts

from them and link the facts into an ontology. SOFIE uses logical

reasoning on the existing knowledge and on the new knowledge in order

to disambiguate words to their most probable meaning, to reason on the

meaning of text patterns and to take into account world knowledge

axioms. This allows SOFIE to check the plausibility of hypotheses and

to avoid inconsistencies with the ontology. The framework of SOFIE

unites the paradigms of pattern matching, word sense disambiguation

and ontological reasoning in one unified model. Our experiments show

that SOFIE delivers near-perfect output, even from unstructured

Internet documents.

BibTeX

@techreport{SuchanekMauroWeikum2008,
TITLE = {{SOFIE}: A Self-Organizing Framework for Information Extraction},
AUTHOR = {Suchanek, Fabian and Sozio, Mauro and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2008-5-004},
NUMBER = {MPI-I-2008-5-004},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2008},
DATE = {2008},
ABSTRACT = {This paper presents SOFIE, a system for automated ontology extension. SOFIE can parse natural language documents, extract ontological facts from them and link the facts into an ontology. SOFIE uses logical reasoning on the existing knowledge and on the new knowledge in order to disambiguate words to their most probable meaning, to reason on the meaning of text patterns and to take into account world knowledge axioms. This allows SOFIE to check the plausibility of hypotheses and to avoid inconsistencies with the ontology. The framework of SOFIE unites the paradigms of pattern matching, word sense disambiguation and ontological reasoning in one unified model. Our experiments show that SOFIE delivers near-perfect output, even from unstructured Internet documents.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Suchanek, Fabian
%A Sozio, Mauro
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T SOFIE: A Self-Organizing Framework for Information Extraction : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-668E-6
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2008-5-004
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2008
%P 49 p.
%X This paper presents SOFIE, a system for automated ontology extension.
SOFIE can parse natural language documents, extract ontological facts
from them and link the facts into an ontology. SOFIE uses logical
reasoning on the existing knowledge and on the new knowledge in order
to disambiguate words to their most probable meaning, to reason on the
meaning of text patterns and to take into account world knowledge
axioms. This allows SOFIE to check the plausibility of hypotheses and
to avoid inconsistencies with the ontology. The framework of SOFIE
unites the paradigms of pattern matching, word sense disambiguation
and ontological reasoning in one unified model. Our experiments show
that SOFIE delivers near-perfect output, even from unstructured
Internet documents.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

2007

Report

K. Berberich, S. Bedathur, T. Neumann, and G. Weikum

“A Time Machine for Text Search,” Max-Planck-Institut für Informatik, Saarbrücken, MPII-I-2007-5-02, 2007.

@techreport{TechReportBBNW-2007,
TITLE = {A Time Machine for Text Search},
AUTHOR = {Berberich, Klaus and Bedathur, Srikanta and Neumann, Thomas and Weikum, Gerhard},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPII-I-2007-5-02},
LOCALID = {Local-ID: C12573CC004A8E26-D444201EBAA5F95BC125731E00458A41-TechReportBBNW-2007},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2007},
DATE = {2007},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Berberich, Klaus
%A Bedathur, Srikanta
%A Neumann, Thomas
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T A Time Machine for Text Search : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-1E49-E
%F EDOC: 356443
%@ 0946-011X
%F OTHER: Local-ID: C12573CC004A8E26-D444201EBAA5F95BC125731E00458A41-TechReportBBNW-2007
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2007
%P 39 p.
%B Research Report

Report

G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum

“NAGA: Searching and Ranking Knowledge,” Max-Planck-Institut für Informatik, Saarbrücken, Germany, MPI-I-2007-5-001, 2007.

Abstract

The Web has the potential to become the world's largest knowledge base.

In order to unleash this potential, the wealth of information available on the

web needs to be extracted and organized. There is a need for new querying

techniques that are simple yet more expressive than those provided by standard

keyword-based search engines. Search for knowledge rather than Web pages needs

to consider inherent semantic structures like entities (person, organization,

etc.) and relationships (isA,locatedIn, etc.).

In this paper, we propose {NAGA}, a new semantic search engine. {NAGA}'s

knowledge base, which is organized as a graph with typed edges, consists of

millions of entities and relationships automatically extracted fromWeb-based

corpora. A query language capable of expressing keyword search for the casual

user as well as graph queries with regular expressions for the expert, enables

the formulation of queries with additional semantic information. We introduce a

novel scoring model, based on the principles of generative language models,

which formalizes several notions like confidence, informativeness and

compactness and uses them to rank query results. We demonstrate {NAGA}'s

superior result quality over current search engines by conducting a

comprehensive evaluation, including user assessments, for advanced queries.

BibTeX

@techreport{TechReportKSIRW-2007,
TITLE = {{NAGA}: Searching and Ranking Knowledge},
AUTHOR = {Kasneci, Gjergji and Suchanek, Fabian M. and Ifrim, Georgiana and Ramanath, Maya and Weikum, Gerhard},
LANGUAGE = {eng},
ISSN = {0946-011X},
NUMBER = {MPI-I-2007-5-001},
LOCALID = {Local-ID: C12573CC004A8E26-0C33A6E805909705C12572AE003DA15B-TechReportKSIRW-2007},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken, Germany},
YEAR = {2007},
DATE = {2007},
ABSTRACT = {The Web has the potential to become the world's largest knowledge base. In order to unleash this potential, the wealth of information available on the web needs to be extracted and organized. There is a need for new querying techniques that are simple yet more expressive than those provided by standard keyword-based search engines. Search for knowledge rather than Web pages needs to consider inherent semantic structures like entities (person, organization, etc.) and relationships (isA,locatedIn, etc.). In this paper, we propose {NAGA}, a new semantic search engine. {NAGA}'s knowledge base, which is organized as a graph with typed edges, consists of millions of entities and relationships automatically extracted fromWeb-based corpora. A query language capable of expressing keyword search for the casual user as well as graph queries with regular expressions for the expert, enables the formulation of queries with additional semantic information. We introduce a novel scoring model, based on the principles of generative language models, which formalizes several notions like confidence, informativeness and compactness and uses them to rank query results. We demonstrate {NAGA}'s superior result quality over current search engines by conducting a comprehensive evaluation, including user assessments, for advanced queries.},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Kasneci, Gjergji
%A Suchanek, Fabian M.
%A Ifrim, Georgiana
%A Ramanath, Maya
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T NAGA: Searching and Ranking Knowledge :
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-1FFC-1
%F EDOC: 356470
%@ 0946-011X
%F OTHER: Local-ID: C12573CC004A8E26-0C33A6E805909705C12572AE003DA15B-TechReportKSIRW-2007
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken, Germany
%D 2007
%P 42 p.
%X The Web has the potential to become the world's largest knowledge base.
In order to unleash this potential, the wealth of information available on the
web needs to be extracted and organized. There is a need for new querying
techniques that are simple yet more expressive than those provided by standard
keyword-based search engines. Search for knowledge rather than Web pages needs
to consider inherent semantic structures like entities (person, organization,
etc.) and relationships (isA,locatedIn, etc.).
In this paper, we propose {NAGA}, a new semantic search engine. {NAGA}'s
knowledge base, which is organized as a graph with typed edges, consists of
millions of entities and relationships automatically extracted fromWeb-based
corpora. A query language capable of expressing keyword search for the casual
user as well as graph queries with regular expressions for the expert, enables
the formulation of queries with additional semantic information. We introduce a
novel scoring model, based on the principles of generative language models,
which formalizes several notions like confidence, informativeness and
compactness and uses them to rank query results. We demonstrate {NAGA}'s
superior result quality over current search engines by conducting a
comprehensive evaluation, including user assessments, for advanced queries.
%B Research Report

Report

F. Suchanek, G. Kasneci, and G. Weikum

“Yago: a large ontology from Wikipedia and WordNet,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2007-5-003, 2007.

Abstract

This article presents YAGO, a large ontology with high coverage and precision.

YAGO has been automatically derived from Wikipedia and WordNet. It

comprises entities and relations, and currently contains more than 1.7

million

entities and 15 million facts. These include the taxonomic Is-A

hierarchy as well as semantic relations between entities. The facts

for YAGO have been extracted from the category system and the

infoboxes of Wikipedia and have been combined with taxonomic relations

from WordNet. Type checking techniques help us keep YAGO's precision

at 95% -- as proven by an extensive evaluation study. YAGO is based on

a clean logical model with a decidable consistency.

Furthermore, it allows representing n-ary relations in a natural way

while maintaining compatibility with RDFS. A powerful query model

facilitates access to YAGO's data.

BibTeX

@techreport{,
TITLE = {Yago: a large ontology from Wikipedia and {WordNet}},
AUTHOR = {Suchanek, Fabian and Kasneci, Gjergji and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2007-5-003},
NUMBER = {MPI-I-2007-5-003},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2007},
DATE = {2007},
ABSTRACT = {This article presents YAGO, a large ontology with high coverage and precision. YAGO has been automatically derived from Wikipedia and WordNet. It comprises entities and relations, and currently contains more than 1.7 million entities and 15 million facts. These include the taxonomic Is-A hierarchy as well as semantic relations between entities. The facts for YAGO have been extracted from the category system and the infoboxes of Wikipedia and have been combined with taxonomic relations from WordNet. Type checking techniques help us keep YAGO's precision at 95% -- as proven by an extensive evaluation study. YAGO is based on a clean logical model with a decidable consistency. Furthermore, it allows representing n-ary relations in a natural way while maintaining compatibility with RDFS. A powerful query model facilitates access to YAGO's data.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Suchanek, Fabian
%A Kasneci, Gjergji
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Yago: a large ontology from Wikipedia and WordNet : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-66CA-F
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2007-5-003
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2007
%P 67 p.
%X This article presents YAGO, a large ontology with high coverage and precision.
YAGO has been automatically derived from Wikipedia and WordNet. It
comprises entities and relations, and currently contains more than 1.7
million
entities and 15 million facts. These include the taxonomic Is-A
hierarchy as well as semantic relations between entities. The facts
for YAGO have been extracted from the category system and the
infoboxes of Wikipedia and have been combined with taxonomic relations
from WordNet. Type checking techniques help us keep YAGO's precision
at 95% -- as proven by an extensive evaluation study. YAGO is based on
a clean logical model with a decidable consistency.
Furthermore, it allows representing n-ary relations in a natural way
while maintaining compatibility with RDFS. A powerful query model
facilitates access to YAGO's data.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

2006

Report

R. Angelova and S. Siersdorfer

“A neighborhood-based approach for clustering of linked document collections,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2006-5-005, 2006.

Abstract

This technical report addresses the problem of automatically structuring

linked document collections by using clustering. In contrast to

traditional clustering, we study the clustering problem in the light of

available link structure information for the data set

(e.g., hyperlinks among web documents or co-authorship among

bibliographic data entries).

Our approach is based on iterative relaxation of cluster assignments,

and can be built on top of any clustering algorithm (e.g., k-means or

DBSCAN). These techniques result in higher cluster purity, better

overall accuracy, and make self-organization more robust. Our

comprehensive experiments on three different real-world corpora

demonstrate the benefits of our approach.

BibTeX

@techreport{AngelovaSiersdorfer2006,
TITLE = {A neighborhood-based approach for clustering of linked document collections},
AUTHOR = {Angelova, Ralitsa and Siersdorfer, Stefan},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-005},
NUMBER = {MPI-I-2006-5-005},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2006},
DATE = {2006},
ABSTRACT = {This technical report addresses the problem of automatically structuring linked document collections by using clustering. In contrast to traditional clustering, we study the clustering problem in the light of available link structure information for the data set (e.g., hyperlinks among web documents or co-authorship among bibliographic data entries). Our approach is based on iterative relaxation of cluster assignments, and can be built on top of any clustering algorithm (e.g., k-means or DBSCAN). These techniques result in higher cluster purity, better overall accuracy, and make self-organization more robust. Our comprehensive experiments on three different real-world corpora demonstrate the benefits of our approach.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Angelova, Ralitsa
%A Siersdorfer, Stefan
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T A neighborhood-based approach for clustering of linked document collections : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-670D-4
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-005
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2006
%P 32 p.
%X This technical report addresses the problem of automatically structuring 
linked document collections by using clustering. In contrast to 
traditional clustering, we study the clustering problem in the light of 
available link structure information for the data set
(e.g., hyperlinks among web documents or co-authorship among 
bibliographic data entries).
Our approach is based on iterative relaxation of cluster assignments, 
and can be built on top of any clustering algorithm (e.g., k-means or 
DBSCAN). These techniques result in higher cluster purity, better 
overall accuracy, and make self-organization more robust. Our 
comprehensive experiments on three different real-world corpora 
demonstrate the benefits of our approach. 
%B Research Report / Max-Planck-Institut f&#252;r Informatik

Report

D1D5D4

H. Bast, D. Majumdar, R. Schenkel, C. Theobalt, and G. Weikum

“IO-Top-k: index-access optimized top-k query processing,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2006-5-002, 2006.

Abstract

Top-k query processing is an important building block for ranked retrieval,

with applications ranging from text and data integration to distributed

aggregation of network logs and sensor data.

Top-k queries operate on index lists for a query's elementary conditions

and aggregate scores for result candidates. One of the best implementation

methods in this setting is the family of threshold algorithms, which aim

to terminate the index scans as early as possible based on lower and upper

bounds for the final scores of result candidates. This procedure

performs sequential disk accesses for sorted index scans, but also has the option

of performing random accesses to resolve score uncertainty. This entails

scheduling for the two kinds of accesses: 1) the prioritization of different

index lists in the sequential accesses, and 2) the decision on when to perform

random accesses and for which candidates.

The prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation.

The current paper takes an integrated view of the scheduling issues and develops

novel strategies that outperform prior proposals by a large margin.

Our main contributions are new, principled, scheduling methods based on a Knapsack-related

optimization for sequential accesses and a cost model for random accesses.

The methods can be further boosted by harnessing probabilistic estimators for scores,

selectivities, and index list correlations.

We also discuss efficient implementation techniques for the

underlying data structures.

In performance experiments with three different datasets (TREC Terabyte, HTTP server logs, and IMDB),

our methods achieved significant performance gains compared to the best previously known methods:

a factor of up to 3 in terms of execution costs, and a factor of 5

in terms of absolute run-times of our implementation.

Our best techniques are close to a lower bound for the execution cost of the considered class

of threshold algorithms.

BibTeX

@techreport{BastMajumdarSchenkelTheobaldWeikum2006,
TITLE = {{IO}-Top-k: index-access optimized top-k query processing},
AUTHOR = {Bast, Holger and Majumdar, Debapriyo and Schenkel, Ralf and Theobalt, Christian and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-002},
NUMBER = {MPI-I-2006-5-002},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2006},
DATE = {2006},
ABSTRACT = {Top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. Top-k queries operate on index lists for a query's elementary conditions and aggregate scores for result candidates. One of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. This procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. This entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. The prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. The current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. Our main contributions are new, principled, scheduling methods based on a Knapsack-related optimization for sequential accesses and a cost model for random accesses. The methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. We also discuss efficient implementation techniques for the underlying data structures. In performance experiments with three different datasets (TREC Terabyte, HTTP server logs, and IMDB), our methods achieved significant performance gains compared to the best previously known methods: a factor of up to 3 in terms of execution costs, and a factor of 5 in terms of absolute run-times of our implementation. Our best techniques are close to a lower bound for the execution cost of the considered class of threshold algorithms.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Bast, Holger
%A Majumdar, Debapriyo
%A Schenkel, Ralf
%A Theobalt, Christian
%A Weikum, Gerhard
%+ Algorithms and Complexity, MPI for Informatics, Max Planck Society
Algorithms and Complexity, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Computer Graphics, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T IO-Top-k: index-access optimized top-k query processing : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-6716-E
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-002
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2006
%P 49 p.
%X Top-k query processing is an important building block for ranked retrieval,
with applications ranging from text and data integration to distributed 
aggregation of network logs and sensor data.
Top-k queries operate on index lists for a query's elementary conditions
and aggregate scores for result candidates. One of the best implementation
methods in this setting is the family of threshold algorithms, which aim
to terminate the index scans as early as possible based on lower and upper
bounds for the final scores of result candidates. This procedure 
performs sequential disk accesses for sorted index scans, but also has the option
of performing random accesses to resolve score uncertainty. This entails
scheduling for the two kinds of accesses: 1) the prioritization of different
index lists in the sequential accesses, and 2) the decision on when to perform
random accesses and for which candidates.

The prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation.
The current paper takes an integrated view of the scheduling issues and develops
novel strategies that outperform prior proposals by a large margin.
Our main contributions are new, principled, scheduling methods based on a Knapsack-related
optimization for sequential accesses and a cost model for random accesses.
The methods can be further boosted by harnessing probabilistic estimators for scores,
selectivities, and index list correlations. 
We also discuss efficient implementation techniques for the
underlying data structures. 
In performance experiments with three different datasets (TREC Terabyte, HTTP server logs, and IMDB),
our methods achieved significant performance gains compared to the best previously known methods:
a factor of up to 3 in terms of execution costs, and a factor of 5
in terms of absolute run-times of our implementation.
Our best techniques are close to a lower bound for the execution cost of the considered class
of threshold algorithms.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

Report

M. Bender, S. Michel, G. Weikum, and P. Triantafilou

“Overlap-aware global df estimation in distributed information retrieval systems,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2006-5-001, 2006.

Abstract

Peer-to-Peer (P2P) search engines and other forms of distributed

information retrieval (IR) are gaining momentum. Unlike in centralized

IR, it is difficult and expensive to compute statistical measures about

the entire document collection as it is widely distributed across many

computers in a highly dynamic network. On the other hand, such

network-wide statistics, most notably, global document frequencies of

the individual terms, would be highly beneficial for ranking global

search results that are compiled from different peers.

This paper develops an efficient and scalable method for estimating

global document frequencies in a large-scale, highly dynamic P2P network

with autonomous peers. The main difficulty that is addressed in this

paper is that the local collections of different peers

may arbitrarily overlap, as many peers may choose to gather popular

documents that fall into their specific interest profile.

Our method is based on hash sketches as an underlying technique for

compact data synopses, and exploits specific properties of hash sketches

for duplicate elimination in the counting process.

We report on experiments with real Web data that demonstrate the

accuracy of our estimation method and also the benefit for better search

result ranking.

BibTeX

@techreport{BenderMichelWeikumTriantafilou2006,
TITLE = {Overlap-aware global df estimation in distributed information retrieval systems},
AUTHOR = {Bender, Matthias and Michel, Sebastian and Weikum, Gerhard and Triantafilou, Peter},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-001},
NUMBER = {MPI-I-2006-5-001},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2006},
DATE = {2006},
ABSTRACT = {Peer-to-Peer (P2P) search engines and other forms of distributed information retrieval (IR) are gaining momentum. Unlike in centralized IR, it is difficult and expensive to compute statistical measures about the entire document collection as it is widely distributed across many computers in a highly dynamic network. On the other hand, such network-wide statistics, most notably, global document frequencies of the individual terms, would be highly beneficial for ranking global search results that are compiled from different peers. This paper develops an efficient and scalable method for estimating global document frequencies in a large-scale, highly dynamic P2P network with autonomous peers. The main difficulty that is addressed in this paper is that the local collections of different peers may arbitrarily overlap, as many peers may choose to gather popular documents that fall into their specific interest profile. Our method is based on hash sketches as an underlying technique for compact data synopses, and exploits specific properties of hash sketches for duplicate elimination in the counting process. We report on experiments with real Web data that demonstrate the accuracy of our estimation method and also the benefit for better search result ranking.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Bender, Matthias
%A Michel, Sebastian
%A Weikum, Gerhard
%A Triantafilou, Peter
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
External Organizations
%T Overlap-aware global df estimation in distributed information retrieval systems : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-6719-8
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-001
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2006
%P 25 p.
%X Peer-to-Peer (P2P) search engines and other forms of distributed
information retrieval (IR) are gaining momentum. Unlike in centralized
IR, it is difficult and expensive to compute statistical measures about
the entire document collection as it is widely distributed across many
computers in a highly dynamic network. On the other hand, such
network-wide statistics, most notably, global document frequencies of
the individual terms, would be highly beneficial for ranking global
search results that are compiled from different peers.
This paper develops an efficient and scalable method for estimating
global document frequencies in a large-scale, highly dynamic P2P network
with autonomous peers. The main difficulty that is addressed in this
paper is that the local collections of different peers
may arbitrarily overlap, as many peers may choose to gather popular
documents that fall into their specific interest profile.
Our method is based on hash sketches as an underlying technique for
compact data synopses, and exploits specific properties of hash sketches
for duplicate elimination in the counting process.
We report on experiments with real Web data that demonstrate the
accuracy of our estimation method and also the benefit for better search
result ranking.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

Report

G. Kasneci, F. Suchanek, and G. Weikum

“Yago - a core of semantic knowledge,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2006-5-006, 2006.

Abstract

We present YAGO, a light-weight and extensible ontology with high coverage and quality.

YAGO builds on entities and relations and currently contains roughly 900,000 entities and 5,000,000 facts.

This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as relation{hasWonPrize}).

The facts have been automatically extracted from the unification of Wikipedia and WordNet,

using a carefully designed combination of rule-based and heuristic methods described in this paper.

The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about

individuals like persons, organizations, products, etc. with their semantic relationships --

and in quantity by increasing the number of facts by more than an order of magnitude.

Our empirical evaluation of fact correctness shows an accuracy of about 95%.

YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS.

Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques.

BibTeX

@techreport{,
TITLE = {Yago -- a core of semantic knowledge},
AUTHOR = {Kasneci, Gjergji and Suchanek, Fabian and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-006},
NUMBER = {MPI-I-2006-5-006},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2006},
DATE = {2006},
ABSTRACT = {We present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains roughly 900,000 entities and 5,000,000 facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as relation{hasWonPrize}). The facts have been automatically extracted from the unification of Wikipedia and WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships -- and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Kasneci, Gjergji
%A Suchanek, Fabian
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Yago - a core of semantic knowledge :
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-670A-A
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-006
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2006
%P 39 p.
%X We present YAGO, a light-weight and extensible ontology with high coverage and quality.
YAGO builds on entities and relations and currently contains roughly 900,000 entities and 5,000,000 facts.
This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as relation{hasWonPrize}).
The facts have been automatically extracted from the unification of Wikipedia and WordNet,
using a carefully designed combination of rule-based and heuristic methods described in this paper.
The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about
individuals like persons, organizations, products, etc. with their semantic relationships --
and in quantity by increasing the number of facts by more than an order of magnitude.
Our empirical evaluation of fact correctness shows an accuracy of about 95%.
YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS.
Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

Report

J. Luxenburger and G. Weikum

“Exploiting Community Behavior for Enhanced Link Analysis and Web Search,” University of Paderborn, Heinz Nixdorf Institute, Paderborn, Germany, DELIS-TR-0447, 2006.

Abstract

Methods for Web link analysis and authority ranking such as PageRank are based

on the assumption that a user endorses a Web page when creating a hyperlink to

this page. There is a wealth of additional user-behavior information that could

be considered for improving authority analysis, for example, the history of

queries that a user community posed to a search engine over an extended time

period, or observations about which query-result pages were clicked on and

which ones were not clicked on after a user saw the summary snippets of the

top-10 results. This paper enhances link analysis methods by incorporating

additional user assessments based on query logs and click streams, including

negative feedback when a query-result page does not satisfy the user demand or

is even perceived as spam. Our methods use various novel forms of advanced

Markov models whose states correspond to users and queries in addition to Web

pages and whose links also reflect the relationships derived from query-result

clicks, query refinements, and explicit ratings. Preliminary experiments are

presented as a proof of concept.

BibTeX

@techreport{TechReportDelis0447_2006,
TITLE = {Exploiting Community Behavior for Enhanced Link Analysis and Web Search},
AUTHOR = {Luxenburger, Julia and Weikum, Gerhard},
LANGUAGE = {eng},
NUMBER = {DELIS-TR-0447},
INSTITUTION = {University of Paderborn, Heinz Nixdorf Institute},
ADDRESS = {Paderborn, Germany},
YEAR = {2006},
DATE = {2006},
ABSTRACT = {Methods for Web link analysis and authority ranking such as PageRank are based on the assumption that a user endorses a Web page when creating a hyperlink to this page. There is a wealth of additional user-behavior information that could be considered for improving authority analysis, for example, the history of queries that a user community posed to a search engine over an extended time period, or observations about which query-result pages were clicked on and which ones were not clicked on after a user saw the summary snippets of the top-10 results. This paper enhances link analysis methods by incorporating additional user assessments based on query logs and click streams, including negative feedback when a query-result page does not satisfy the user demand or is even perceived as spam. Our methods use various novel forms of advanced Markov models whose states correspond to users and queries in addition to Web pages and whose links also reflect the relationships derived from query-result clicks, query refinements, and explicit ratings. Preliminary experiments are presented as a proof of concept.},
}

Endnote

%0 Report
%A Luxenburger, Julia
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Exploiting Community Behavior for Enhanced Link Analysis and Web Search : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0019-BC47-9
%Y University of Paderborn, Heinz Nixdorf Institute
%C Paderborn, Germany
%D 2006
%X Methods for Web link analysis and authority ranking such as PageRank are based 
on the assumption that a user endorses a Web page when creating a hyperlink to 
this page. There is a wealth of additional user-behavior information that could 
be considered for improving authority analysis, for example, the history of 
queries that a user community posed to a search engine over an extended time 
period, or observations about which query-result pages were clicked on and 
which ones were not clicked on after a user saw the summary snippets of the 
top-10 results. This paper enhances link analysis methods by incorporating 
additional user assessments based on query logs and click streams, including 
negative feedback when a query-result page does not satisfy the user demand or 
is even perceived as spam. Our methods use various novel forms of advanced 
Markov models whose states correspond to users and queries in addition to Web 
pages and whose links also reflect the relationships derived from query-result 
clicks, query refinements, and explicit ratings. Preliminary experiments are 
presented as a proof of concept.

Report

F. Suchanek, G. Ifrim, and G. Weikum

“Combining linguistic and statistical analysis to extract relations from web documents,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2006-5-004, 2006.

Abstract

Search engines, question answering systems and classification systems

alike can greatly profit from formalized world knowledge.

Unfortunately, manually compiled collections of world knowledge (such

as WordNet or the Suggested Upper Merged Ontology SUMO) often suffer

from low coverage, high assembling costs and fast aging. In contrast,

the World Wide Web provides an endless source of knowledge, assembled

by millions of people, updated constantly and available for free. In

this paper, we propose a novel method for learning arbitrary binary

relations from natural language Web documents, without human

interaction. Our system, LEILA, combines linguistic analysis and

machine learning techniques to find robust patterns in the text and to

generalize them. For initialization, we only require a set of examples

of the target relation and a set of counterexamples (e.g. from

WordNet). The architecture consists of 3 stages: Finding patterns in

the corpus based on the given examples, assessing the patterns based on

probabilistic confidence, and applying the generalized patterns to

propose pairs for the target relation. We prove the benefits and

practical viability of our approach by extensive experiments, showing

that LEILA achieves consistent improvements over existing comparable

techniques (e.g. Snowball, TextToOnto).

BibTeX

@techreport{Suchanek2006,
TITLE = {Combining linguistic and statistical analysis to extract relations from web documents},
AUTHOR = {Suchanek, Fabian and Ifrim, Georgiana and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-004},
NUMBER = {MPI-I-2006-5-004},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2006},
DATE = {2006},
ABSTRACT = {Search engines, question answering systems and classification systems alike can greatly profit from formalized world knowledge. Unfortunately, manually compiled collections of world knowledge (such as WordNet or the Suggested Upper Merged Ontology SUMO) often suffer from low coverage, high assembling costs and fast aging. In contrast, the World Wide Web provides an endless source of knowledge, assembled by millions of people, updated constantly and available for free. In this paper, we propose a novel method for learning arbitrary binary relations from natural language Web documents, without human interaction. Our system, LEILA, combines linguistic analysis and machine learning techniques to find robust patterns in the text and to generalize them. For initialization, we only require a set of examples of the target relation and a set of counterexamples (e.g. from WordNet). The architecture consists of 3 stages: Finding patterns in the corpus based on the given examples, assessing the patterns based on probabilistic confidence, and applying the generalized patterns to propose pairs for the target relation. We prove the benefits and practical viability of our approach by extensive experiments, showing that LEILA achieves consistent improvements over existing comparable techniques (e.g. Snowball, TextToOnto).},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Suchanek, Fabian
%A Ifrim, Georgiana
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Combining linguistic and statistical analysis to extract relations from web documents :
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-6710-9
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2006-5-004
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2006
%P 37 p.
%X Search engines, question answering systems and classification systems
alike can greatly profit from formalized world knowledge.
Unfortunately, manually compiled collections of world knowledge (such
as WordNet or the Suggested Upper Merged Ontology SUMO) often suffer
from low coverage, high assembling costs and fast aging. In contrast,
the World Wide Web provides an endless source of knowledge, assembled
by millions of people, updated constantly and available for free. In
this paper, we propose a novel method for learning arbitrary binary
relations from natural language Web documents, without human
interaction. Our system, LEILA, combines linguistic analysis and
machine learning techniques to find robust patterns in the text and to
generalize them. For initialization, we only require a set of examples
of the target relation and a set of counterexamples (e.g. from
WordNet). The architecture consists of 3 stages: Finding patterns in
the corpus based on the given examples, assessing the patterns based on
probabilistic confidence, and applying the generalized patterns to
propose pairs for the target relation. We prove the benefits and
practical viability of our approach by extensive experiments, showing
that LEILA achieves consistent improvements over existing comparable
techniques (e.g. Snowball, TextToOnto).
%B Research Report / Max-Planck-Institut f&#252;r Informatik

2005

Report

S. Siersdorfer and G. Weikum

“Automated retraining methods for document classification and their parameter tuning,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2005-5-002, 2005.

Abstract

This paper addresses the problem of semi-supervised classification on

document collections using retraining (also called self-training). A

possible application is focused Web

crawling which may start with very few, manually selected, training

documents

but can be enhanced by automatically adding initially unlabeled,

positively classified Web pages for retraining.

Such an approach is by itself not robust and faces tuning problems

regarding parameters

like the number of selected documents, the number of retraining

iterations, and the ratio of positive

and negative classified samples used for retraining.

The paper develops methods for automatically tuning these parameters,

based on

predicting the leave-one-out error for a re-trained classifier and

avoiding that the classifier is diluted by selecting too many or weak

documents for retraining.

Our experiments

with three different datasets

confirm the practical viability of the approach.

BibTeX

@techreport{SiersdorferWeikum2005,
TITLE = {Automated retraining methods for document classification and their parameter tuning},
AUTHOR = {Siersdorfer, Stefan and Weikum, Gerhard},
LANGUAGE = {eng},
URL = {http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2005-5-002},
NUMBER = {MPI-I-2005-5-002},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2005},
DATE = {2005},
ABSTRACT = {This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. Such an approach is by itself not robust and faces tuning problems regarding parameters like the number of selected documents, the number of retraining iterations, and the ratio of positive and negative classified samples used for retraining. The paper develops methods for automatically tuning these parameters, based on predicting the leave-one-out error for a re-trained classifier and avoiding that the classifier is diluted by selecting too many or weak documents for retraining. Our experiments with three different datasets confirm the practical viability of the approach.},
TYPE = {Research Report / Max-Planck-Institut f&#252;r Informatik},
}

Endnote

%0 Report
%A Siersdorfer, Stefan
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Automated retraining methods for document classification and their parameter tuning : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0014-6823-8
%U http://domino.mpi-inf.mpg.de/internet/reports.nsf/NumberView/2005-5-002
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2005
%P 23 p.
%X This paper addresses the problem of semi-supervised classification on
document collections using retraining (also called self-training). A 
possible application is focused Web
crawling which may start with very few, manually selected, training 
documents
but can be enhanced by automatically adding initially unlabeled,
positively classified Web pages for retraining.
Such an approach is by itself not robust and faces tuning problems 
regarding parameters
like the number of selected documents, the number of retraining 
iterations, and the ratio of positive
and negative classified samples used for retraining.
The paper develops methods for automatically tuning these parameters, 
based on
predicting the leave-one-out error for a re-trained classifier and
avoiding that the classifier is diluted by selecting too many or weak 
documents for retraining.
Our experiments
with three different datasets
confirm the practical viability of the approach.
%B Research Report / Max-Planck-Institut f&#252;r Informatik

2004

Report

S. Siersdorfer, S. Sizov, and G. Weikum

“Goal-oriented methods and meta methods for document classification and their parameter tuning,” Max-Planck-Institut für Informatik, Saarbrücken, MPI-I-2004-5-001, May 2004.

@techreport{,
TITLE = {Goal-oriented methods and meta methods for document classification and their parameter tuning},
AUTHOR = {Siersdorfer, Stefan and Sizov, Sergej and Weikum, Gerhard},
LANGUAGE = {eng},
NUMBER = {MPI-I-2004-5-001},
INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2004},
DATE = {2004-05},
TYPE = {Research Report},
}

Endnote

%0 Report
%A Siersdorfer, Stefan
%A Sizov, Sergej
%A Weikum, Gerhard
%+ Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
Databases and Information Systems, MPI for Informatics, Max Planck Society
%T Goal-oriented methods and meta methods for document classification and their parameter tuning : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-000F-28F3-A
%F EDOC: 237842
%Y Max-Planck-Institut f&#252;r Informatik
%C Saarbr&#252;cken
%D 2004
%P 36 p.
%B Research Report