As online harm rises, which moderation strategies work?
New research points to trade-offs between AI moderation, humans, and Community Notes
A wave of new data confirms what many already feel in their bones: being online is getting more dangerous — especially for those already at risk.
A 2025 global survey conducted by UltraViolet, All Out, and GLAAD gathered responses from over 7,000 users of Meta products across 86 countries. Among its findings:
77% had encountered harmful content on Meta platforms since January 2025.
92% felt less protected from being targeted by harmful content.
66% reported feeling less safe expressing themselves freely.
1 in 6 experienced gender-based or sexual harassment.
These results are presented alongside the 2025 GLAAD Social Media Safety Index (SMSI), which assessed six major platforms — Facebook, Instagram, TikTok, YouTube, Threads, and X — on LGBTQ safety. None scored higher than 56 out of 100. X received the lowest rating: 30.
These findings have sparked renewed debate about content moderation. What strategies are being used — and which are most effective at reducing harm?
Blaming automation
GLAAD attributes recent increases in reported harm to two major shifts in platform operations:
Policy rollbacks, particularly in hate speech enforcement, and
Increased automation in moderation processes.
The SMSI notes that while automation can expedite enforcement, it often fails to capture nuance — particularly in cases involving identity-based expression. At the same time, GLAAD argues that some LGBTQ content is being wrongly flagged, shadow banned or demonetized, raising concerns about over-enforcement. A recent NPR report on Meta’s internal changes highlighted that the company is replacing human-led privacy and societal risk assessments with automated review systems.
Enter Community Notes
One alternative to top-down or automated enforcement is Community Notes, a feature developed by X and more recently adopted by Meta. The system invites users to collaboratively add context to posts, aiming to surface explanations that are broadly agreed upon across ideological lines.
In her 2025 study From Twitter to X: demotion, community notes and the apparent shift from adjudication to consensus-building, researcher Emilie de Keulenaar explores the effect of what she describes as a “norm-agnostic” model — one that avoids explicit judgments and instead emphasizes user-driven consensus. The approach shifts moderation away from vertical decision-making and toward a model of crowdsourced consensus-building aimed at encouraging transparency and incorporating diverse viewpoints.
However, de Keulenaar’s research highlights significant limitations in how the system functions, particularly in polarized environments:
Inconsistent application: Notes are more likely to appear on certain types of content (e.g., political debate) than others.
Low visibility on divisive topics: In areas like immigration or gender identity, notes often fail to reach the consensus threshold needed to appear.
Limited effectiveness on harmful content: Posts flagged with high “hate scores” by external tools are not always annotated or downranked.
Design-dependent outcomes: The system’s impact hinges on how platforms define ideological diversity and how prominently notes are displayed — both of which vary and lack transparency.
De Keulenaar notes that while Community Notes are sometimes seen as a substitute for traditional moderation, they serve a different purpose. Ultimately, the limits of Community Notes highlight a broader challenge: moderation systems are only as effective as the design choices that structure who participates, what counts as consensus, and what gets seen.
Design from the margins
Beyond moderation tools, some researchers are exploring how harm can be addressed through platform design itself — not just after harm occurs, but before it can take shape. One such framework is Design From the Margins (DFM), developed by researcher Afsaneh Rigot. DFM proposes that platforms should begin with those most likely to face harm — such as LGBTQ refugees, sex workers, and racialized communities — not as outliers, but as the starting point for design.
“When your most at-risk and disenfranchised are covered by your product, we are all covered.” — Afsaneh Rigot, Design From the Margins
Rather than relying on reactive moderation, DFM prioritizes structural features that anticipate harm. These include default privacy protections, user-controlled safety tools, and interface elements that guide interaction in safer, more intentional ways.
When platforms are designed to account for those most vulnerable, the overall system becomes more resilient. The recently published Blueprint on Prosocial Tech Design Governance by the Council on Tech and Social Cohesion points out how environments where marginalized users are better protected also tend to foster greater trust, constructive dialogue, and social cohesion. By contrast, when platforms fail to address the needs of at-risk users, it can contribute to wider patterns of distrust and polarization.
Make it harder to harm
Across these different approaches — automated enforcement, community-led annotation, and inclusive design — there is increasing recognition that moderation alone is not sufficient. GLAAD and other researchers point to a set of complementary strategies that may contribute to safer online environments:
Public transparency on enforcement decisions and moderation outcomes.
Mandatory training for content moderators, especially in areas related to identity-based harm.
Hybrid systems that combine AI with trained human oversight.
Proactive design tools, such as visibility filters, friction-based nudges, and personalized safety controls, that prevent harmful content from reaching its target in the first place.
Ongoing measurement of user experiences with harm, especially among marginalized groups.
As online harm grows, it’s evident that our response needs to go beyond content takedowns. The real challenge lies in addressing the features of platforms that enable or even incentivize harmful behaviour. The future of safer digital spaces may depend less on what platforms take down, and more on how they shape interactions, set boundaries, and design for trust, safety and cohesion from the start.
Lena Slachmuijlder is Senior Advisor at Search for Common Ground and Co-Chairs the Council on Tech and Social Cohesion.
Just saw this excellent report, which gives further South Asian context about the performance of Community Notes: https://www.csohate.org/wp-content/uploads/2025/06/Xs-Community-Notes-and-the-South-Asian-Misinformation-Crisis.pdf It highlights critical limitations of crowdsourced moderation systems in multilingual and politically complex environments. Despite South Asia’s scale and misinformation risk, the region accounts for less than 0.1% of all Community Notes — and most vernacular-language submissions never gather enough ratings to be published. The result is a persistent information gap where viral falsehoods in Hindi, Urdu, and other languages can circulate for days without correction. The report argues that design decisions — from static rating thresholds to limited language onboarding — are not neutral; they structurally disadvantage high-need regions.
To address these shortcomings, the report recommends several design and governance improvements: sustained local-language contributor recruitment, adaptable note thresholds for low-volume languages, geoboxed rating systems, and a public crisis-response protocol. The findings reinforce what others have said — that moderation tools like Community Notes are only as effective as the systems that support them. Without multilingual infrastructure and local context, even participatory models risk replicating the same exclusionary gaps found in top-down approaches.