Group Leader: Simon Razniewski
Commonsense knowledge about object properties, human behavior and general concepts is crucial for robust AI applications. However, automatic acquisition of this knowledge is challenging because of sparseness and bias in online sources.
Structured knowledge is important for many AI applications. Commonsense knowledge, which is crucial for robust human-centric AI, is covered by a small number of structured knowledge projects. However, they lack knowledge about human traits and behaviors conditioned on socio-cultural contexts, which is crucial for situative AI. In this project, we present Candle, an end-to-end methodology for extracting high-quality cultural commonsense knowledge (CCSK) at scale. Candle extracts CCSK assertions from a huge web corpus and organizes them into coherent clusters, for 3 domains of subjects (geography, religion, occupation) and several cultural facets (food, drinks, clothing, traditions, rituals, behaviors). Candle includes judicious techniques for classification-based filtering and scoring of interestingness. Experimental evaluations show the superiority of the Candle CCSK collection over prior works, and an extrinsic use case demonstrates the benefits of CCSK for the GPT-3 language model.
The output of Candle is a set of 1.1M CCSK assertions, organized into 60K coherent clusters. The set is organized by 3 domains of interest – geography, religion, occupation – with a total of 386 instances, referred to as subjects (or cultural groups). Per subject, the assertions cover 5 facets of culture: food, drinks, clothing, rituals, traditions (for geography and religion) or behaviors (for occupations). In addition, we also annotate each assertion with its salient concepts.
- Demo: https://candle.mpi-inf.mpg.de
- Download: https://candle.mpi-inf.mpg.de/download
- Code: https://github.com/cultural-csk/candle
- Tuan-Phong Nguyen, Simon Razniewski, Aparna Varde, and Gerhard Weikum. Extracting Cultural Commonsense Knowledge at Scale. WWW 2023. [pdf]
Commonsense knowledge about everyday concepts is an important asset for AI applications, such as question answering and chatbots. Recently, we have seen an increasing interest in the construction of structured commonsense knowledge bases (CSKBs). An important part of human commonsense is about properties that do not apply to concepts, yet existing CSKBs only store positive statements. Moreover, since CSKBs operate under the open-world assumption, absent statements are considered to have unknown truth rather than being invalid. We present the UNCOMMONSENSE framework for materializing informative negative commonsense statements. Given a target concept, comparable concepts are identified in the CSKB, for which a local closed-world assumption is postulated. This way, positive statements about comparable concepts that are absent for the target concept become seeds for negative statement candidates. The large set of candidates is then scrutinized, pruned and ranked by informativeness.
Project page: https://www.mpi-inf.mpg.de/uncommonsense
- Hiba Arnaout, Simon Razniewski, Gerhard Weikum, and Jeff Z. Pan, UnCommonSense: Informative Negative Knowledge about Everyday Concepts. CIKM'22 [PDF]
- Hiba Arnaout, Tuan-Phong Nguyen, Simon Razniewski, Gerhard Weikum, and Jeff Z. Pan, UnCommonSense in Action! Informative Negations for Commonsense Knowledge Bases. WSDM'23 [DEMO] [VIDEO] [PDF]
Ascent++, a successor of the previous Ascent method, is a pipeline for automatically collecting, extracting and consolidating commonsense knowledge (CSK) from any English text corpus. Ascent++ is capable of extracting facet-enriched assertions, overcoming the common limitations of the triple-based knowledge model in traditional knowledge bases (KBs). Ascent++ also captures composite concepts with subgroups and related aspects, supplying even more expressiveness to CSK assertions.
Ascent++ KB is a CSKB extracted from the C4 crawl using the Ascent++ pipeline. It consists of 2 million CSK assertions about 10K popular concepts. The CSKB comes with two variants: one with open predicates (e.g., "be", "have", "live in", etc.) and one with the established ConceptNet schema with 19 pre-specified predicates (e.g., AtLocation, CapableOf, HasProperty, etc.).
- Tuan-Phong Nguyen, Simon Razniewski, Julien Romero, Gerhard Weikum. Refined Commonsense Knowledge from Large-Scale Web Contents. In IEEE Transactions on Knowledge and Data Engineering, 2022, doi: 10.1109/TKDE.2022.3206505. [pdf]
Ascent (Advanced Semantics for Commonsense Knowledge Extraction) is a pipeline for automatically collecting, extracting and consolidating commonsense knowledge (CSK) from the web. Ascent is capable of extracting facet-enriched assertions, overcoming the common limitations of the triple-based knowledge model in traditional knowledge bases (KBs). Ascent also captures composite concepts with subgroups and related aspects, supplying even more expressiveness to CSK assertions.
- Demo: https://ascent.mpi-inf.mpg.de
- Download: https://ascent.mpi-inf.mpg.de/download
- Code: https://github.com/phongnt570/ascent
- Tuan-Phong Nguyen, Simon Razniewski, Gerhard Weikum. Advanced Semantics for Commonsense Knowledge Extraction. WWW 2021. [pdf]
- Tuan-Phong Nguyen, Simon Razniewski, Gerhard Weikum. Inside ASCENT: Exploring a Deep Commonsense Knowledge Base and its Usage in Question Answering. ACL 2021 - System Demonstrations. [pdf]
For compiling CSK based on text extraction, many concerns revolve around the issue of reporting bias, i.e., that frequency in text sources is not a good proxy for relevance or truth, especially for fundamental pieces of knowledge. This paper explores whether children's texts hold the key to commonsense knowledge extraction, based on the hypothesis that such content might make fewer assumptions on the reader's knowledge and therefore spell out commonsense more explicitly. An analysis with several corpora shows that children's texts indeed contain much more, and more typical commonsense assertions. Moreover, experiments show that this advantage can be leveraged in popular language-model-based commonsense knowledge extraction settings, where task-unspecific fine-tuning on small amounts of children texts already yields significant improvements. This provides a refreshing perspective different from the common trend of deriving progress from ever larger models and corpora.
Do Children Texts Hold The Key To Commonsense Knowledge? Julien Romero and Simon Razniewski, EMNLP 2022 [pdf]
Commonsense knowledge about object properties, human behavior and general concepts is crucial for robust AI applications. However, automatic acquisition of this knowledge is challenging because of sparseness and bias in online sources. This paper presents Quasimodo, a methodology and tool suite for distilling commonsense properties from non-standard web sources. We devise novel ways of tapping into search-engine query logs and QA forums, and combining the resulting candidate assertions with statistical cues from encyclopedias, books and image tags in a corroboration step. Unlike prior work on commonsense knowledge bases, Quasimodo focuses on salient properties that are typically associated with certain objects or concepts. Extensive evaluations, including extrinsic use-case studies, show that Quasimodo provides better coverage than state-of-the-art baselines with comparable quality.
Commonsense knowledge (CSK) supports a variety of AI applications, from visual understanding to chatbots. Prior works on acquiring CSK, such as ConceptNet, have compiled statements that associate concepts, like everyday objects or activities, with properties that hold for most or some instances of the concept. Each concept is treated in isolation from other concepts, and the only quantitative measure (or ranking) of properties is a confidence score that the statement is valid. This paper aims to overcome these limitations by introducing a multi-faceted model of CSK statements and methods for joint reasoning over sets of inter-related statements. Our model captures four different dimensions of CSK statements: plausibility, typicality, remarkability and salience, with scoring and ranking along each dimension. For example, hyenas drinking water is typical but not salient, whereas hyenas eating carcasses is salient. For reasoning and ranking, we develop a method with soft constraints, to couple the inference over concepts that are related in in a taxonomic hierarchy. The reasoning is cast into an integer linear programming (ILP), and we leverage the theory of reduction costs of a relaxed LP to compute informative rankings. This methodology is applied to several large CSK collections. Our evaluation shows that we can consolidate these inputs into much cleaner and more expressive knowledge.
- Paper: Joint Reasoning for Multi-Faceted Commonsense Knowledge, Yohan Chalier, Simon Razniewski and Gerhard Weikum, AKBC, 2020 [pdf]
- Demo: https://dice.mpi-inf.mpg.de/
- Code: https://github.com/ychalier/dice
- Data: https://www.dropbox.com/sh/yqn3o1ngnx8c8fz/AADD2jHxBZm31IZ0n3U_Dnf8a?dl=0
WebChild is a large collection of commonsense knowledge, automatically extracted and disambiguated from Web contents. WebChild contains triples that connect nouns with adjectives via fine-grained relations like hasShape, hasTaste, evokesEmotion, etc. The arguments of these assertions, nouns and adjectives, are disambiguated by mapping them onto their proper WordNet senses.
Large-scale experiments demonstrate the high accuracy (more than 80 percent) and coverage (more than four million fine grained disambiguated assertions) of WebChild.
HowToKB is the first large-scale knowledge base which represents how-to (task) knowledge. Each task is represented by a frame with attributes for parent task, preceding sub-task, following sub-task, required tools or other items, and linkage to visual illustrations.
- Distilling Task Knowledge from How-to Communities, Cuong Xuan Chu, Niket Tandon, Gerhard Weikum, WWW 2017 [pdf]