RedDust: a Large Reusable Dataset of Reddit User Traits

Social media is a rich source of assertions about personal attributes, such as "I am a doctor" or "my hobby is playing tennis". Precisely identifying explicit assertions is difficult, though, because of the users’ highly varied vocabulary and language expressions. Identifying personal attributes from implicit assertions like "I've been at work treating patients all day" is even more challenging.

We present RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age, and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users’ personal attribute labels, along with users’ post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.

Publications

Anna Tigunova, Andrew Yates, Paramita Mirza and Gerhard Weikum. RedDust: a Large Reusable Dataset of Reddit User Traits. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)[pdf] [bib]

Downloads

RedDust contains labeled Reddit users with the following attributes:

Please refer to the README file for more details. 
License (for files): Creative Commons Attribution 4.0 International (CC BY 4.0)