Databases and Information Systems

Knowledge Base Completion for Long-Tail Entities

In this paper, we propose an unsupervised framework for knowledge base completion and a new dataset named MALT for evaluation. The main benefits of our approach are:

  • fully prompt-based. To extract a new relation, the only thing needed in this framework is to design a prompt.
  • can deal with multi-token and ambiguous entities
  • work well on long-tail entities

Our method employs two different LMs in a two-stage pipeline as shown in the above Figure. The first stage generates candidate answers to input prompts and gives cues to retrieve informative sentences from Wikipedia and other sources. The second stage validates the candidates and disambiguates the retained answer strings onto entities in the underlying KG (e.g., mapping “Yves Desrosiers” to Yves Desrosiers (guitarist)).


The repo structure is shown below. The two_stage_pipeline.py file is the code of our two-stage framework, the GENRE path is a submodule that for the corroboration step, and the data path contains the MALT dataset


¦   candidate_generation.py
¦   corroboration.py
¦   evaluate.py
¦   README.md
¦   template.py
¦   two_stage_pipeline.py
¦   utils.py
¦       entity_name_qid.txt
¦       gold_wikidata.json
¦       malt_eval.txt
¦       malt_hold_out.txt
¦       wikipedia.json
¦       framework.png
¦       malt.png
¦       prompt.png

Data Preparation

We developed a new dataset with an emphasis on the long-tail challenge, called MALT (for “Multi-token, Ambiguous, Long-Tailed facts”). After downloading, put the MALT file in the root path. There are five files in the MALT dataset:

  • malt_eval.txt contains entity IDs for evaluation
  • malt_hold_out.txt contains entity IDs for adjusting the hyper-parameters
  • gold_wikidata.json contains the gold facts
  • wikipedia.json contains the corresponding Wikipedia pages
  • entity_name_qid.txt

The below table shows the stats of the MALT dataset, which contains many multi-token, ambiguous, and long-tail facts.

Run Example

Given the input document:

Lhasa de Sela said that the song was about inner happiness and "feeling my feet in the earth, having a place in the world, of things taking care of themselves.“ In May 2009, her collaboration with Patrick Watson was released.

We'd like to extract the collaborators for the singer "Lhasa de Sela".


python two_stage_pipeline.py -run_example True


After, the output is shown below:

( Lhasa de Sela, collaborator, Patrick Watson, 0.4763992584808626 )
( Lhasa de Sela, collaborator, Patrick Watson (musician), 0.3224404241174992 )
( Lhasa de Sela, collaborator, Patrick Watson (producer), 0.2401321410226018 )



python two_stage_pipeline.py -run_example False


After, you can get two files:

  • extracted_facts.txt stores the facts extracted by our framework
  • score.txt stores the evaluation scores (precision, recall, and f1)


This work was partially funded by ANR-20-CHIA0012-01 (“NoRDF”).


The MALT dataset is licensed under CC BY-SA 4.0 and the code is licensed under the MIT license