Knowledge Extraction on Fictional Domains

Fiction and fantasy are a core part of human culture, spanning from traditional literature to movies, TV series and video games. There contains many hundreds or even thousands of entities and types, and are subject of search-engine queries – by fans as well as cultural analysts. For example, fans may query about Muggles who are students of the House of Gryffindor (within the Harry Potter universe). Analysts may be interested in understanding character relationships, learning story patterns or investigating gender bias in different cultures. 

The long goal of this project is extracting interesting information, mainly related to characters in fictional stories, including personal information (e.g. name, birth/dead, title, etc.), interpersonal relationships (e.g. family relations, business relations, ally/enemy, etc.) and narratives (e.g. battles, who kills whom, etc.). The output is modeled in a form of triples or n-ary tuples. Our project includes three main steps:

  1. Taxonomy induction, building type systems for fictional domains, using noisy category systems from fan wikis or text extraction as input. We developed a fiction-targeted approach, called TiFi, consists of three phases: (i) category cleaning, by identifying candidate categories that truly represent classes in the domain of interest, (ii) edge cleaning, by selecting subcategory relationships that correspond to class subsumption, and (iii) top-level construction, by mapping classes onto a subset of high-level WordNet categories. 
  2. Entity typing, fine-grained type labelling for entities in fictional texts. Our method, called ENTYFI, builds on 205 automatically induced high-quality type systems for popular fictional domains, and exploits the overlap and reuse of these fictional domains for fine-grained typing in previously unseen texts. ENTYFI comprises five steps: type system induction, domain relatedness ranking, mention detection, mention typing, and type consolidation. The recall-oriented typing module combines a supervised neural model, unsupervised Hearst-style and dependency patterns, and knowledge base lookups. The precision-oriented consolidation stage utilizes co-occurrence statistics in order to remove noise and to identify the most relevant types.
  3. Knowledge extraction, automatically extracting facts about entities (e.g. characters, locations, organizations, events, etc.) in fictional texts.

Approaches to fictional domains have great potential for being carried over to real-life settings, such as enterprise-specific domain, medieval history, neurodegenerative diseases or nanotechnology material science. 

TIFI

TiFi: Taxonomy Induction for Fictional Domains

ENTYFI

ENTYFI: Entity Typing in Fictional Texts

Publications

TiFi: Taxonomy Induction for Fictional Domains
Cuong Xuan Chu, Simon Razniewski, Gerhard Weikum
In Proc. WWW 2019

ENTYFI: Entity Typing in Fictional Texts
Cuong Xuan Chu, Simon Razniewski, Gerhard Weikum
In Proc. WSDM 2020