Information extraction

Advanced lecture, 6 ECTS credits, winter semester 2019–20

Basic Information

This advanced lecture focuses on how to construct knowledge bases using information extraction techniques. Topics will be automated information extraction using patterns, supervised extractors and open information extraction, infobox crawling, entity disambiguation and normalization, learning over knowledge bases, and their use in question answering. We will also touch upon crowdsourced KB construction, evaluation measures, and some state-of-the-art knowledge bases. In the labs, participants will implement step-by-step cor components of information extraction, using Wikipedia and Wikis from the Wikia fan community site as source.

Schedule

 DateLectureLab 
115.10.Introduction (pdf)Dataset familiarization (pdf) 
222.10.Knowledge representation (pdf)Domain modelling (pdf) (sample solution) 
329.10.Crawling and Scraping (pdf)Scraping (pdf) 
412.11.*Entity typing (pdf)Entity typing from Wikipedia first sentence (pdf, files) 
519.11.Taxonomy induction, coreference and disambiguation (pdf)Taxonomy induction (pdf) 
626.11.Relation extraction (pdf)Relation extraction (pdf, files) 
73.12.Relation extraction II (pdf)OpenIE coding (pdf, files) 
810.12.Knowledge consolidation (pdf)Rule mining (pdf, file) 
917.12.Applications (pdf)Exam preparation 
 (7.1.2020)(Backup slot)  
 14.+15.1.2020Oral exam (E1 4 room 433, schedule)  
 24.3.2020Reexam (online, schedule)  

* Attention: No lecture/lab on 5.11.

Rules and Grading

Assignments

  • There will be 8 weekly assignments
  • Each assignment submission receives a binary pass/fail score
  • To be admitted to take the final exam, at least 6 assignments have to be passed.
  • Weekly timeline:
    • Assignments are posted on Tuesday morning
    • The lab on Tuesday afternoon is intended to get started on the assignments
    • Assignments are due Saturday in the same week, at 23:59
    • Assessments are available Tuesday morning
  • Assignment results (link)

Exam

Further reading

Industry relevance (lecture 1):

  • Industry-Scale Knowledge Graphs: Lessons and Challenges, Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson, Jamie Taylor, CACM, 2019 (link)

Knowledge representation (lecture 2):

  • Knowledge Representation and Rule Mining in Entity-Centric Knowledge Bases, Fabian M. Suchanek, Jonathan Lajus, Armand Boschin, Gerhard Weikum, RW, 2019 (link)

Crawling and scraping (lecture 3):

  • Resource efficiency in web crawling: Optimizing Update Frequencies for Decaying Information, Simon Razniewski, CIKM, 2016 (link)
  • Large-scale scraping of Wikipedia: DBpedia: A nucleus for a web of open data, Auer, Sören, et al., ISWC 2007 (link)

Typing (lecture 4):

  • Information extraction (Chapter 2), Sunita Sarawagi, FnT, 2007 (link)
  • ENTYFI: Entity Typing in Fictional Texts, Chu et al., WSDM 2020 (link)

Taxonomy induction (lecture 5):

  • Panchenko, Alexander, et al. Taxi at SEMEVAL-2016 Task 13: A taxonomy induction method based on lexico-syntactic patterns, substrings and focused crawling. SemEval 2016 (link)
  • Gupta, Amit, et al. "Taxonomy induction using hypernym subsequences." CIKM 2017 (link)
  • TiFi: Taxonomy Induction for Fictional Domains, Chu et al., WWW 2019 (link)

Coreference (lecture 5):

Disambiguation (lecture 5):

  • Robust Disambiguation of Named Entities in Text, Hoffart et al., EMNLP 2011 (link)