LEILA: Learning to Extract Information by Linguistic Analysis

What LEILA is

LEILA is a system that can extract pairs of a relation from a set of HTML documents. For example, it can extract pairs of persons with their birthdate, pairs of companies with their headquarters or pairs of entities with their concept. LEILA is part of the YAGO-NAGA project at the Max-Planck Institute for Informatics in Saarbrücken/Germany. LEILA is no longer actively maintained, so there is no more software support for it.

What LEILA needs

As input, LEILA needs a set of pairs that are in the relation (the examples) and a set of pairs that are not in the relation (the counterexamples). For instance, for the birthdate-relation, the examples could be

Frederic Chopin	1810
Wolfgang Amadeus Mozart	1756
...	...

The counterexamples could be

Frederic Chopin	1980
Wolfgang Amadeus Mozart	2010
...	...

The examples and counterexamples are given by a Java-method. This means that the counterexamples need not be enumerated. For instance, the Java-method can simply say that any pair of a person that is listed in the examples and a "wrong" birthdate is a counterexample.

How LEILA works

LEILA works in 3 phases:

It finds all sentences in which an example pair appears. It collects the pattern in which the example pair appears (the positive patterns). For instance, for the following sentence "Chopin was born in 1810", it would extract the pattern "X was born in Y". Then, LEILA runs again through the documents and finds all sentences in which a positive pattern matches (possibly approximately), but a counterexample stands in the place of X and Y. The corresponding pattern is collected as a negative pattern.
LEILA generalizes the positive patterns by machine learning techniques.
LEILA runs through the documents again and finds all sentences, in which a generalized positive pattern matches. It proposes the words in place of X and Y as a new pair. For example, if it finds the sentence "Vivaldi was born in 1678", it will propose the pair Vivaldi/1678.

What is special about LEILA

The pattern matching approach is very simple and widely used. Different from previous systems, LEILA uses a deep linguistic analysis of the documents by the Link Grammar Parser. Thus, its patterns are deeper and more robust than simple surface patterns. Furthermore, LEILA bounds its patterns by counterexamples and generalizes them by machine learning.

Please find more information about LEILA in the tabs below.

People

Suchanek, Fabian
Ifrim, Georgiana
Weikum, Gerhard

Publications

Main paper: "Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents" (pdf, bib, Technical Report)
"LEILA: Learning to Extract Information by Linguistic Analysis" (pdf, ppt, bib)

Downloads

LEILA source code (Java) and documentation
This code is licensed under the Creative Commons Attribution License by the author Fabian M. Suchanek. If you use the code, please cite our paper.
Browse the documentation
See our corpora.

How to use LEILA

Download the Java tools
Download the Link Grammar Parser
Download some recent version of Java (1.5+) if you don't have it
Download the Java-source, the class-files and the documentation of LEILA here
Run Leila.class. LEILA will tell you how to set it up.

How data flows in LEILA

The flow of data with LEILA is as follows:

The corpus can be any set of text or HTML documents. These documents can be spread across different folders or subfolders. The class HTML2LGI.java extracts the proper sentences from from the corpus documents. Each document generates one LGI file containing the sentences. These LGI-files are given to the Link Grammar Parser (called by LGParse.java), which produces parse trees for the sentences. Each LGI-file generates one LGO-file containing the parse trees. The class Train.java tries to find patterns for the target relation in the LGO-files. It generalizes these patterns and stores them as a model in a MDL-file. The class Test.java applies the model to extract output pairs for the target relation from the LGO-files. It stores them in one large plain text file. All of these steps are done automatically in the right order by Leila.java.

Train.java must know the target relation. The target relation is given by a function that decides whether a pair of words is an example, a counterexample or a candidate for the relation. This function should be implemented in a class that extends Relation.java. To LEILA, it does not matter how the function actually works internally. The most common way is to load a list of example pairs from a text file. To decide whether a pair of words is an example pair, the function can just check whether the pair is in the list. Often, the counterexamples need not be present in a list, but they can be deduced algorithmically on the fly. See the experimental section of "LEILA: Learning to Extract Information by Linguistic Analysis" (pdf, ppt, bib) for examples.

Existing relations in LEILA

The following relations ship with LEILA:

InstanceOf.java (extends Relation.java) is the relation between an entity and its concept. The example pairs come from WordNet and are included in the distribution.
Synonymy.java (extends Relation.java) is the relation between synonymous words. The example pairs come from WordNet and are included in the distribution.
Headquarters.java (extends Relation.java) is the relation between a company and the city of its headquarters. The example pairs are not included in the distribution, because they depend on the corpus, which is copyright restricted.
Birthdates.java (extends Relation.java) is the relation between a person and her birth date. The example pairs are not included in the distribution due to copyright restrictions.
SimpleFunction.java (extends Relation.java) is a many-to-one relation for demonstration purposes. The example pairs are included in the distribution.
StupidRelation.java (extends Relation.java) is a relation of just one pair for debugging purposes. The example pair is hard-coded in the source.

Corpora

What this is

This is a set of corpora for relation extraction. Relation extraction is the task of, given a semantic target relation and given a natural language corpus, extracting all pairs of entities in the corpus that stand in the target relation. For example, if the target relation is instanceOf and the corpus contains the following passage

"President Mickey M. Mouse was happy to visit the city of Washington D.C., which is the capital of the United States."

then the goal is to extract the following pairs:

`instanceOf`

	Mickey M. Mouse	president
	Washington D.C.	city
	Washington D.C.	captial

This web site provides corpora for evaluating Relation Extraction systems. For each document in the corpus, we provide a list of manually extracted ideal pairs that should be extracted by the system. Note that these pairs are not linked to the original sentence, but only to the document. The corpora were used with LEILA.

What types of files we have

	`html`	the original document
	`lgi`	the proper sentences of the original document (Link Grammar Input), as extracted by HTML2LGI.java
	`ll`	the non-grammatical parts of the sentences of the original document, as extracted by HTML2LGI.java
	`lgo`	the parsed version of the proper sentences (Link Grammar Output), as produced by LGParse.java by calling the Link Grammar Parser
	`inst/birt/syn`	the manually extracted ideal pairs, as produced by human annotatators with HandTag.java. `inst`-files contain instanceOf-pairs, `birt`-files contain person/birthdate-pairs and `syn`-files contain synonymy pairs.

The manually extracted pairs are licensed under the Creative Commons Attribution License by the author Fabian M. Suchanek. If you use them, please cite our paper. The other files are licensed under the GNU Free Documentation License, unless they underly different terms by their authors.

Corpora

Corpus	# Docs	Relation	# annotated	Remarks
Googlecomposers	492	instanceOf	100	We used Google to search for the baroque, classical and romantic composers of Wikipedia. We downloaded the first page in the result list (using the "I'm feeling lucky" button) excluding Wikipedia pages. This corpus is highly incoherent, containing advertisements as well as pages with no proper sentences at all
Wikicomposers	872	instanceOf, person/birthdate	87	All Wikipedia articles about composers
Wikigeography	313	synonymy	130	All Wikipedia articles about the geography of countries
Wikigeneral (-)	223	instanceOf	223	Some random Wikipedia articles.