Personal knowledge about users’ professions, hobbies, favorite food, and travel preferences, among others, is a valuable asset for individualized AI, such as recommenders or chatbots. In this line of research we explore the task of acquiring such knowledge from conversational utterances in order to build a Personal Knowledge Base (PKB) that is scrutable and explainable. This problem is more challenging than the established task of information extraction from scientific publications or Wikipedia articles, because dialogues or user-generated content in social media often give merely implicit cues about the user/speaker.
Prior work developed supervised methods to extract personal knowledge from conversational data, including users' utterances in social media, but these approaches can not generalize beyond attribute values with ample labeled training samples. We overcome this limitation by devising CHARM: a zero-shot learning method that creatively leverages keyword extraction and document retrieval in order to predict attribute values that were never seen during training. Experiments with large datasets from Reddit show the viability of CHARM for open-ended attributes, such as professions and hobbies.
Anna Tigunova, Andrew Yates, Paramita Mirza and Gerhard Weikum. CHARM: Conversational Hidden Attribute Retrieval Model. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 (to appear). [pdf] [code]
We propose methods for inferring personal knowledge, such as profession, age or family status, from conversations using deep learning. Specifically, we propose several Hidden Attribute Models, which are neural networks leveraging attention mechanisms and embeddings. Our methods are trained on a per-predicate basis to output rankings of object values for a given subject-predicate combination (e.g., ranking the doctor and nurse professions high when speakers talk about patients, emergency rooms, etc). Experiments with various conversational texts including Reddit discussions, movie scripts and a collection of crowdsourced personal dialogues demonstrate the viability of our methods and their superior performance compared to state-of-the-art baselines.
Anna Tigunova, Andrew Yates, Paramita Mirza and Gerhard Weikum. Listening between the Lines: Learning Personal Attributes from Conversations. In Proceedings of The Web Conference (WWW) 2019, pages 1818-1828, San Francisco, CA, United States. [pdf]
Social media is a rich source of assertions about personal attributes, such as "I am a doctor" or "my hobby is playing tennis". Precisely identifying explicit assertions is difficult, though, because of the users’ highly varied vocabulary and language expressions. Identifying personal attributes from implicit assertions like "I've been at work treating patients all day" is even more challenging.
We present RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age, and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users’ personal attribute labels, along with users’ post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.
Anna Tigunova, Andrew Yates, Paramita Mirza and Gerhard Weikum. RedDust: a Large Reusable Dataset of Reddit User Traits. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020). [pdf] [bib]
RedDust contains labeled Reddit users with the following attributes:
Please refer to the README file for more details.
License (for files): Creative Commons Attribution 4.0 International (CC BY 4.0)