Online content moderation

There is No War in Ba Sing Se: A Global Analysis of Content Moderation in Large Language Models

Investigators: Friedemann Lipphardt, Moonis Ali, Martin Banzer, Anja Feldmann in cooperation with Devashish Gosain (IIT Bombay, India) 

Recent growth of GenAI applications like ChatGPT, Gemini times has indicated that LLMs are increasingly becoming the de-facto primary information source for many users across the globe. Therefore, it is critical to demystify if the information offered by these systems is consistent and accurate. Given, this important research frontier, we perform the first comprehensive analysis of content moderation patterns detected in over 700,000 replies from 15 leading LLMs evaluated from 12 locations using 1,118 sensitive queries spanning five categories in 13 languages. We find substantial geographic variation, with moderation rates showing relative differences up to 60% across locations—for instance, soft moderation (e.g., evasive replies) appears in 14.3% of German contexts versus 24.9% in Zulu contexts. Category-wise, misc. (generally unsafe), hate speech, and sexual content are more heavily moderated than political or religious content, with political content showing the most geographic variability. We also observe discrepancies between online and offline model versions, such as DeepSeek exhibiting 15.2% higher relative soft moderation rates when deployed locally than via API. The response length (and time) analysis reveals that moderated responses are, on average, about 50% shorter than the unmoderated ones. These findings have important implications for AI fairness and digital equity, as users in different locations receive inconsistent access to information. We provide the first systematic evidence of geographic cross-language bias in LLM content moderation and showcase how model selection vastly impacts user experience.

References:
Lipphardt, F., Ali, M., Banzer, M., Feldmann, A. and Gosain, D. 2026.There is No War in Ba Sing Se: A Global Analysis of Content Moderation in Large Language Models (To appear in Conference Proceedings of NDSS 2026).


From Isolation to Desolation: Investigating Self-Harm Discussions in Incel Communities

Investigators: Moonis Ali in cooperation with Savvas Zannettou (TU Delft)

Understanding Online sub-cultures especially in extremely sensitive communities often is a challenging task. In this regard, researchers often rely on data-driven techniques to demystify and understand prevalence of specific themes in these communities. To this end, we design a comparative study that focuses on understanding self-harm related linguistic signatures among the Incel communities on the Internet. These are compared with mainstream Reddit mental-health communities. From our findings, we observe that over time, language related to self-harm evolves considerably more among Incels than in mainstream communities. Also, we observe that negative perception of their physical appearance is the most recurrent theme in self-harm conversations for Incels, which does not feature in mainstream communities. Finally, by analyzing social factors, we find that substance abuse is the most closely associated social factor to self-harm in Incel and mainstream communities and that physical appearance, over time, is becoming increasingly closely related to self-harm discussions in Incel communities

References:
Ali, M. and Zannettou, S. 2024. From Isolation to Desolation: Investigating Self-Harm Discussions in Incel Communities. Proceedings of the International AAAI Conference on Web and Social Media. 18, 1 (May 2024), 43-56. DOI:https://doi.org/10.1609/icwsm.v18i1.31296.


Online content moderation

Investigators: Savvas Zannettou in cooperation with Shagun Jhaver (Rutgers University, USA), Jeremy Blackburn (Binghamton University, USA), Emiliano De Cristofaro (University College United Kingdom), Gianluca Stringhini (Boston University), Robert West (EPFL, Switzerland), Krishna P. Gummadi (MPI-SWS, Germany)

Analyzing content moderation online is important for several reasons. First, social media and other online platforms have become major sources of information and communication, shaping public discourse and opinions. The content shared on these platforms can have significant consequences for individuals, communities, and society as a whole. Thus, understanding the challenges and complexities of moderating online content is crucial for ensuring that these platforms are safe and inclusive spaces for all users. Second, content moderation involves a range of technical, social, and ethical issues that require interdisciplinary expertise. Studying content moderation online involves understanding the technical mechanisms used to identify and remove harmful content, as well as the social and cultural contexts in which these mechanisms operate. Moreover, content  moderation online is a constantly evolving field, as new technologies and social dynamics emerge. In this line of work, our goal is to analyze and understand multiple aspects of content moderation, including how soft moderation interventions (i.e., warning labels are applied online) and effective they are [4, 2], what happens after online platforms take moderation action on specific online communities (i.e., community bans) [1], and how we can design systems to automatically identify accounts that are state-sponsored trolls and are involved in misinformative campaigns online [3].

Soft Moderation Interventions
Over the past few years, there is a heated debate and serious public concerns regarding online content moderation, censorship, and the principle of free speech on the Web. To ease these concerns, social media platforms like Twitter, Facebook, and TikTok refined their content moderation systems to support soft moderation interventions. Soft moderation interventions refer to warning labels attached to potentially questionable or harmful content to inform other users about the content and its nature while the content remains accessible, hence alleviating concerns related to censorship and free speech. In our work, we performed one of the first empirical studies on how soft moderation interventions are applied on Twitter [4] and TikTok [2]. In particular, our work for Twitter, uses a mixed-methods approach, to study the users who share tweets with warning labels on Twitter and their political leaning, the engagement that these tweets receive, and how users interact with tweets that have warning labels. Among other things, we find that 72% of the tweets with warning labels are shared by Republicans, while only 11% are shared by Democrats. By analyzing content engagement, we find that tweets with warning labels had more engagement compared to tweets without warning labels. Also, we qualitatively analyze how users interact with content that has warning labels finding that the most popular interactions are related to further debunking false claims, mocking the author or content of the disputed tweet, and further reinforcing or resharing false claims. Finally, we describe concrete examples of inconsistencies, such as warning labels that are incorrectly added or warning labels that are not added to tweets despite sharing questionable and potentially harmful information. Our work on TikTok [2], focuses on the important problem of soft moderation interventions during important health-related events like the COVID-19 pandemic. In particular, we analyze the use of warning labels on TikTok, focusing on COVID-19 videos. First, we construct a set of 26 COVID-19-related hashtags, and then we collect 41K videos that include those hashtags in their description. Second, we perform a quantitative analysis on the entire dataset to understand the use of warning labels on TikTok. Then, we perform an in-depth qualitative study, using thematic analysis, on 222 COVID-19-related videos to assess the content and the connection between the content and the warning labels. Our analysis shows that TikTok broadly applies warning labels on TikTok videos, likely based on hashtags included in the description (e.g., 99% of the videos that contain #coronavirus have warning labels). More worrying is the addition of COVID-19 warning labels on videos where their actual content is not related to COVID-19 (23% of the cases in a sample of 143 English videos that are not related to COVID-19). Finally, our qualitative analysis of a sample of 222 videos shows that 7.7% of the videos share misinformation/harmful content and do not include warning labels, 37.3% share benign information and include warning labels, and 35% of the videos that share misinformation/harmful content (and need a warning label) are made for fun. Our study demonstrates the need to develop more accurate and precise soft moderation systems, especially on a platform like TikTok which is extremely popular among people of younger ages.

Community Bans
When toxic online communities on mainstream platforms face moderation measures, such as bans, they may migrate to other platforms with laxer policies or set up their own dedicated websites. Previous work suggests that within mainstream platforms, community-level moderation is effective in mitigating the harm caused by the moderated communities. It is, however, unclear whether these results also hold when considering the broader Web ecosystem. Do toxic communities continue to grow in terms of their user base and activity on the new platforms? Do their members become more toxic and ideologically radicalized? In our work [1], we report the results of a large-scale observational study of how problematic online communities progress following community-level moderation measures. We analyze data from r/The_Donald and r/Incels, two communities that were banned from Reddit and subsequently migrated to their own standalone websites. Our results suggest that, in both cases, moderation measures significantly decreased posting activity on the new platform, reducing the number of posts, active users, and newcomers. In spite of that, users in one of the studied communities (r/The_Donald) showed increases in signals associated with toxicity and radicalization, which justifies concerns that the reduction in activity may come at the expense of a more toxic and radical community. Overall, our results paint a nuanced portrait of the consequences of community-level moderation and can inform their design and deployment.

Detecting State-Sponsored trolls
Growing evidence points to recurring influence campaigns on social media, often sponsored by state actors aiming to manipulate public opinion on sensitive political topics. Typically, campaigns are performed through instrumented accounts, known as troll accounts; despite their prominence, however, little work has been done to detect these accounts in the wild. In our work [3], we present TROLLMAGNIFIER, a detection system for troll accounts. Our key observation, based on analysis of known Russian-sponsored troll accounts identified by Reddit, is that they show loose coordination, often interacting with each other to further specific narratives. Therefore, troll accounts controlled by the same actor often show similarities that can be leveraged for detection. TROLLMAGNIFIER learns the typical behavior of known troll accounts and identifies more that behave similarly. We train TROLLMAGNIFIER on a set of 335 known troll accounts and run it on a large dataset of Reddit accounts. Our system identifies 1,248 potential troll accounts; we then provide a multi-faceted analysis to corroborate the correctness of our classification. In particular, 66% of the detected accounts show signs of being instrumented by malicious actors (e.g., they were created on the same exact day as a known troll, they have since been suspended by Reddit, etc.). They also discuss similar topics as the known troll accounts and exhibit temporal synchronization in their activity. Overall, we show that by using TROLLMAGNIFIER, one can grow the initial knowledge of potential trolls provided by Reddit by over 300%. We argue that our system can be used for identifying and moderating content originating from state-sponsored accounts that aim to perform influence campaigns on social media platforms.

References
• [1] M. Horta Ribeiro, S. Jhaver, S. Zannettou, J. Blackburn, E. De Cristofaro, G. Stringhini, and R. West. Do Platform Migrations Compromise Content Moderation? Evidence from r/The_Donald and r/Incels, 2020. arXiv: 2010.10397.
• [2] C. Ling, K. Gummadi, and S. Zannettou. “Learn the Facts About COVID-19”: Analyzing the Use of Warning Labels on TikTok Videos, 2022.  arXiv: 2201.07726.
• [3] M. H. Saeed, S. Ali, J. Blackburn, E. De Cristofaro, S. Zannettou, and G. Stringhini. TROLLMAGNIFIER: Detecting state-sponsored troll accounts on Reddit. In 43rd IEEE Symposium on Security and Privacy (SP 2022), San Francisco, CA, USA, 2022, pp. 2161–2175. IEEE.
• [4] S. Zannettou. “I won the election!”: An empirical analysis of soft moderation interventions on Twitter. In Proceedings of the Fifteenth International Conference on Web and Social Media (ICWSM 2021), Atlanta, GA, USA, 2021, pp. 865–876. AAAI.