IBEX: Id-Based Entity Extraction
The goal of the IBEX project is to extract entities from the Web. We focus on entities with unique identifiers. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with the human-readable name of the entities at large scale. By making use of the properties of unique identifiers in some simple steps, we can filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73-96% and a very high coverage compared to existing knowledge bases. We use this database to compute interesting statistics on the presence of products, people, and other entities on the Web.
- Aliaksandr Talaika, Joanna “Asia” Biega, Antoine Amarilli, Fabian M. Suchanek:
“IBEX: Harvesting Entities from the Web Using Unique Identifiers”
Workshop paper at Web and Databases (WebDB) at SIGMOD , 2015
- Technical report
- Chemical substances: gold standard, data
- Chemical formulas: gold standard, data
- Documents: gold standard, data
- Emails: gold standard, (too large, see sample)
- Products: gold standard, data
We provide here the raw data of our analyses. See the technical report for a detailed discussion of our results.