We developed a new dataset with an emphasis on the long-tail challenge, called MALT (for “Multi-token, Ambiguous, Long-Tailed facts”).
The dataset contains 65.3% triple facts where the O entity is a multi-word phrase, and 58.6% ambiguous facts where the S or O entities share identical alias names in Wikidata.
For example, the two ambiguous entities ,“Birmingham, West Midlands (Q2256)” and “Birmingham, Alabama (Q79867)”, have the same Label value “BirminghamBirmingham”.
In total, 87.0% of the sample facts have entities in the long tail, where we define long-tail entities to have at most 13 Wikidata triples.
You can download the dataset zip file (217.5 MB) here: download