LEILA: Learning to Extract Information by Linguistic Analysis
What LEILA is
LEILA is a system that can extract pairs of a relation from a set of HTML documents. For example, it can extract pairs of persons with their birthdate, pairs of companies with their headquarters or pairs of entities with their concept. LEILA is part of the YAGO-NAGA project at the Max-Planck Institute for Informatics in Saarbrücken/Germany. LEILA is no longer actively maintained, so there is no more software support for it.
What LEILA needs
As input, LEILA needs a set of pairs that are in the relation (the examples) and a set of pairs that are not in the relation (the counterexamples). For instance, for the birthdate-relation, the examples could be
|Wolfgang Amadeus Mozart||1756|
The counterexamples could be
|Wolfgang Amadeus Mozart||2010|
The examples and counterexamples are given by a Java-method. This means that the counterexamples need not be enumerated. For instance, the Java-method can simply say that any pair of a person that is listed in the examples and a "wrong" birthdate is a counterexample.
How LEILA works
LEILA works in 3 phases:
- It finds all sentences in which an example pair appears. It collects the pattern in which the example pair appears (the positive patterns). For instance, for the following sentence "Chopin was born in 1810", it would extract the pattern "X was born in Y". Then, LEILA runs again through the documents and finds all sentences in which a positive pattern matches (possibly approximately), but a counterexample stands in the place of X and Y. The corresponding pattern is collected as a negative pattern.
- LEILA generalizes the positive patterns by machine learning techniques.
- LEILA runs through the documents again and finds all sentences, in which a generalized positive pattern matches. It proposes the words in place of X and Y as a new pair. For example, if it finds the sentence "Vivaldi was born in 1678", it will propose the pair Vivaldi/1678.
What is special about LEILA
The pattern matching approach is very simple and widely used. Different from previous systems, LEILA uses a deep linguistic analysis of the documents by the Link Grammar Parser. Thus, its patterns are deeper and more robust than simple surface patterns. Furthermore, LEILA bounds its patterns by counterexamples and generalizes them by machine learning.
Please find more information about LEILA in the tabs below.
- LEILA source code (Java) and documentation
This code is licensed under the Creative Commons Attribution License by the author Fabian M. Suchanek. If you use the code, please cite our <link publications.html>paper.
- Browse the documentation
- See our corpora.
How to use LEILA
- Download the Java tools
- Download the Link Grammar Parser
- Download some recent version of Java (1.5+) if you don't have it
- Download the Java-source, the class-files and the documentation of LEILA here
Leila.class. LEILA will tell you how to set it up.
How data flows in LEILA
The flow of data with LEILA is as follows:
The corpus can be any set of text or HTML documents. These documents can be spread across different folders or subfolders. The class
HTML2LGI.java extracts the proper sentences from from the corpus documents. Each document generates one LGI file containing the sentences. These LGI-files are given to the Link Grammar Parser (called by
LGParse.java), which produces parse trees for the sentences. Each LGI-file generates one LGO-file containing the parse trees. The class
Train.java tries to find patterns for the target relation in the LGO-files. It generalizes these patterns and stores them as a model in a MDL-file. The class
Test.java applies the model to extract output pairs for the target relation from the LGO-files. It stores them in one large plain text file. All of these steps are done automatically in the right order by
Train.java must know the target relation. The target relation is given by a function that decides whether a pair of words is an example, a counterexample or a candidate for the relation. This function should be implemented in a class that extends
Relation.java. To LEILA, it does not matter how the function actually works internally. The most common way is to load a list of example pairs from a text file. To decide whether a pair of words is an example pair, the function can just check whether the pair is in the list. Often, the counterexamples need not be present in a list, but they can be deduced algorithmically on the fly. See the experimental section of "LEILA: Learning to Extract Information by Linguistic Analysis" (pdf, ppt, bib) for examples.
Existing relations in LEILA
The following relations ship with LEILA:
InstanceOf.java (extends Relation.java)is the relation between an entity and its concept. The example pairs come from WordNet and are included in the distribution.
Synonymy.java (extends Relation.java)is the relation between synonymous words. The example pairs come from WordNet and are included in the distribution.
Headquarters.java (extends Relation.java)is the relation between a company and the city of its headquarters. The example pairs are not included in the distribution, because they depend on the corpus, which is copyright restricted.
Birthdates.java (extends Relation.java)is the relation between a person and her birth date. The example pairs are not included in the distribution due to copyright restrictions.
SimpleFunction.java (extends Relation.java)is a many-to-one relation for demonstration purposes. The example pairs are included in the distribution.
StupidRelation.java (extends Relation.java)is a relation of just one pair for debugging purposes. The example pair is hard-coded in the source.
What this is
This is a set of corpora for relation extraction. Relation extraction is the task of, given a semantic target relation and given a natural language corpus, extracting all pairs of entities in the corpus that stand in the target relation. For example, if the target relation is instanceOf and the corpus contains the following passage
"President Mickey M. Mouse was happy to visit the city of Washington D.C., which is the capital of the United States."
then the goal is to extract the following pairs:
|Mickey M. Mouse||president|
This web site provides corpora for evaluating Relation Extraction systems. For each document in the corpus, we provide a list of manually extracted ideal pairs that should be extracted by the system. Note that these pairs are not linked to the original sentence, but only to the document. The corpora were used with LEILA.
What types of files we have
|html||the original document|
|lgi||the proper sentences of the original document (Link Grammar Input), as extracted by HTML2LGI.java|
|ll||the non-grammatical parts of the sentences of the original document, as extracted by HTML2LGI.java|
|lgo||the parsed version of the proper sentences (Link Grammar Output), as produced by LGParse.java by calling the Link Grammar Parser|
|inst/birt/syn||the manually extracted ideal pairs, as produced by human annotatators with HandTag.java.|
inst-files contain instanceOf-pairs, birt-files contain person/birthdate-pairs and syn-files contain synonymy pairs.
The manually extracted pairs are licensed under the Creative Commons Attribution License by the author Fabian M. Suchanek. If you use them, please cite our paper. The other files are licensed under the GNU Free Documentation License, unless they underly different terms by their authors.
|Corpus||# Docs||Relation||# annotated||Remarks|
|Googlecomposers||492||instanceOf||100||We used Google to search for the baroque, classical and romantic composers of Wikipedia. We downloaded the first page in the result list (using the "I'm feeling lucky" button) excluding Wikipedia pages. This corpus is highly incoherent, containing advertisements as well as pages with no proper sentences at all|
|Wikicomposers||872||instanceOf, person/birthdate||87||All Wikipedia articles about composers|
|Wikigeography||313||synonymy||130||All Wikipedia articles about the geography of countries|
|Wikigeneral (-)||223||instanceOf||223||Some random Wikipedia articles.|