IBEX: Id-Based Entity Extraction

The goal of the IBEX project is to extract entities from the Web. We focus on entities with unique identifiers. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with the human-readable name of the entities at large scale. By making use of the properties of unique identifiers in some simple steps, we can filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73-96% and a very high coverage compared to existing knowledge bases. We use this database to compute interesting statistics on the presence of products, people, and other entities on the Web.

Publications

Aliaksandr Talaika, Joanna “Asia” Biega, Antoine Amarilli, Fabian M. Suchanek:
“IBEX: Harvesting Entities from the Web Using Unique Identifiers”
Workshop paper at Web and Databases (WebDB) at SIGMOD , 2015
Technical report

Results

Chemical substances: gold standard, data
Chemical formulas: gold standard, data
Documents: gold standard, data
Emails: gold standard, (too large, see sample)
Products: gold standard, data

Analyses

We provide here the raw data of our analyses. See the technical report for a detailed discussion of our results.