EDRAK is an entity-centric resource that contains a catalog of 2.4M entities that are resolved from both the Arabic and English Wikipedia's. EDRAK offers a comprehensive dictionary of Arabic names for entities. In addition, each entity has a contextual characteristic description in the form of keyphrases. Keyphyases and keywords are assigned scores based on their popularity and correlation with different entities. The dictionaries of EDRAK are built using automatic techniques beyond harvesting manually crafted data in Wikipedia.
- edrak_en20150112_ar20141218.sql.bz2 (22GB) [MD5: 20ce8b446a0cbea5329a6c960dc841]
The dataset used in the experiments in our EMNLP 2011 paper, Robust Disambiguation of Named Entities in Text, can be downloaded here:
- aida-yago2-dataset.zip (419 KB)
- The dataset has been updated on 2013-11-21, adding all but 7 Freebase MIDs, as well as Wikipedia IDs.
It contains assignments of entities to the mentions of named entities annotated for the original CoNLL 2003 entity recognition task. The entities are identified by YAGO2 entity name, by Wikipedia URL, or by Freebase mid (Thanks to Massimiliano Ciaramita from Google Zürich for creating the Wikipedia/Freebase mapping and making it available to us). The zip contains a README.txt with details about the format, as well as instructions how to create it from the original CoNLL 2003 dataset (this is required).
We also provide the mention-entity candidate mapping which was used in our experiments in Robust Disambiguation of Named Entities in Text, which is an extension of the YAGO2 means relation:
- aida_means.tsv.bz2 (156 MB)
This file contains two tab-separated colums. The first column is a quoted string, denoting a potential mention which can be recognized in the input text, and the second column is one entity candidate for this mention. Both columns are encoded in the YAGO2 format, go to the YAGO2 downloads for decoding utils.
The dataset used in the experiments in our WWW 2014 paper, Discovering Emerging Entities with Ambiguous Names, can be downloaded here:
- AIDA-EE.tar.gz (119 KB)
The AIDA-EE Dataset contains 300 documents with 9,976 entity names linked to Wikipedia (2010-08-17 dump). The documents themselves are taken from the APW part of the GIGAWORD5 dataset, with 150 documents from 2010-10-01 (development data) and 150 documents from 2010-11-01 (test data). Due to licensing issues, we do not provide the document content, just the offsets with the entity annotations.
The datasets used in the experiments in our CIKM 2012 paper, KORE: Keyphrase Overlap Relatedness for Entity Disambiguation, can be downloaded here:
- KORE_entity_relatedness.tar.gz (5 KB): We selected 20 seed entities from 4 domains (IT companies, Hollywood celebrities, video games, television series). For each of these entities, we selected 20 entities linked from their Wikipedia article, ranked by human annotators on mechanical turk.
- KORE50.tar.gz (5 KB): 50 hand-crafted, difficult sentences containing a large number of very ambiguous mentions.
All datasets are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.