Hide. Nudge. Counter. What's best in dealing with toxic content?
In the search for a balance between civility and free speech on social media, several design features show promise—and trade-offs, writes Lena Slachmuijlder
Online toxicity is more than just a nuisance—it’s pushing people, especially women and girls, away from digital spaces. This "chilling effect" silences voices, reducing the diversity of viewpoints, and reinforcing the perception that we are so polarized that collaboration across divides is impossible.
This is all the more pronounced in fragile democracies and conflict-affected contexts. According to Handling Harmful Content Online, Cross-national Perspectives of Users Affected by Conflict by Search for Common Ground, many users in conflict-affected areas resort to ‘exit strategies,’ like blocking or unfollowing harmful accounts, to protect themselves from offensive content. This self-censorship profoundly impacts democratic debate and civic engagement, reducing the number of voices participating in critical conversations.
So what should we do? In addition to advocating for a shift away from purely engagement-based ranking and recommender algorithms - which often amplify divisive and sensational content - other ideas are being tried and tested:
hide the toxic content with filters,
nudge the users to be more civil at the time of writing their content, or
counter the toxic speech with counterspeech.
Filtering out toxicity
Several social media platforms allow users to filter or hide offensive content. Instagram's comment filter offers each user the opportunity to automatically and/or manually filter out comments which don’t violate their Community Guidelines “but may be inappropriate, disrespectful, or offensive.” Users can also create a custom list of words, phrases, and emojis to hide, according to Instagram’s help center.
Yet while these filters can reduce the visibility of toxic content, research shows that they don’t necessarily improve user behavior, according to Matt Katsaros of the Justice Collaboratory at Yale Law School. He shared findings of the effects of such filters at a May 2024 Symposium on Comment Section Research and Design hosted by the Plurality Institute and the Council on Tech and Social Cohesion. His research on Nextdoor revealed that while filtering offensive comments reduced visibility by 12%, there was little change in how users behaved overall. “Hiding offensive comments reduces their visibility, but it doesn’t seem to encourage more productive or positive engagement,” Katsaros noted. This suggests that filtering alone may prevent users from seeing toxic content, but it may not change the culture of the platform.
Filtering often walks a fine line between protecting users and censoring speech. Many private companies are emerging which offer anti-toxicity filter services, such as TrollWall AI. They point out that moderation through filtering aims to maintain healthy online communities, and leads to more engagement as users feel safer online. Perspective API makes a similar argument, offering a toxicity filter for comments sections, aimed at fostering a healthier online discourse.
Not all agree that less online civility harms democracy. Dayei Oha and John Downey in their paper Does algorithmic content moderation promote democratic discourse? write that it’s intolerance not a lack of civility that’s the issue. “The current algorithmic moderation does not promote democratic discourse, but rather deters it by silencing the uncivil but pro-democratic voices of the marginalised as well as by failing to detect intolerant messages whose meanings are embedded in nuances and rhetoric,” argue the authors. “New algorithmic moderation should focus on the reliable and transparent identification of hate speech and be in line with the feminist, anti-racist, and critical theories of democratic discourse.”
What’s clear is that this is a fertile space of tech innovation. Textgain, a Belgian AI company, is developing a model that claims to not only detect toxic language in all European languages, but also understand the context in which it occurs. "A good example is that we monitor social media content around football players. Their insulting each other is actually part of the fun, and the threshold is a lot higher for what would be considered real hate speech. So having a large language model that can take context into account will be extremely valuable,” Textgain CEO Guy de Pauw said in a recent interview.
Nudging users
A more interactive solution is behavioral nudging. In 2022, Twitter had embedded prompts for users to reconsider potentially harmful tweets before posting. Katsaros studied the effect when Twitter users got a pop-up asking them if they’d like to reconsider posting a potentially harmful tweet. As a result, 9% of users decided not to post, and 22% edited their tweet. More importantly, the nudge had a lasting impact, as recipients were less likely to post offensive content in the following weeks.
“The nudge changes the behavior in the moment, but more importantly, it has a lasting impact. People are more likely to rethink their approach in future interactions,” explained Katsaros at the Comments Section symposium. Users who were nudged were less likely to receive offensive replies themselves, he added. “By cutting off one offensive comment at the start, you reduce the likelihood of a toxic back-and-forth."
It’s unclear, but doubtful that the nudging feature is still in use with X, given Elon Musk’s recent announcement to move towards disabling users’ ability to ‘block’ other users from engaging with their content.
Counter-speaking
While filters and nudges aim to prevent the visibility of the harmful content, the approach called ‘counterspeech’ actively engages with it. The Dangerous Speech Project defines this as “any direct response to hateful or harmful speech that seeks to undermine it.” This ‘undermining’ happens in several ways. First, by convincing people to stop posting harmful speech—either because they’ve changed their minds or due to fear of criticism or social sanctions. And second, because of how the counterspeakers influence the ‘audience’ reading the comments.
“Counterspeech can encourage a silent audience to chime in and even become regular counterspeakers, thus gradually shifting discourse toward the views expressed in counterspeech, even if no beliefs change,” says The Dangerous Speech Project.
Cathy Buerger’s literature review of Counterspeech highlights its effectiveness in shaping the tone of online conversations: “When users are exposed to civil responses, they are more likely to contribute civil comments themselves.” According to Buerger’s review, Reconquista Internet, formed in 2018 in Germany as a direct response to the far-right hate group Reconquista Germanica, led to a decrease in the intensity and proportion of hateful speech online. However, Buerger’s review acknowledges that counterspeech also has the potential to make exchanges more confrontational, which can escalate tensions rather than fostering understanding.
Striking the right balance
So, what’s the best approach to handling toxicity online? There may not be only one ‘right’ answer. Filtering and nudging can reduce the immediate spread of harmful content, while counterspeech can encourage a more civil atmosphere.
It’s clear to me that platforms need to do more than simply hide offensive and abusive content. They must foster environments where users—especially women and girls—feel safe enough to engage fully, without fear of harassment. A blend of filters, nudges, and counterspeech can create safer, more inclusive, and diverse digital spaces.
The line between moderation and censorship is thin, but with thoughtful, transparent policies, platforms can navigate it in ways that protect users without stifling important voices. The goal should always be to promote healthier interactions, ensure freedom of expression, and mitigate the chilling effects of online toxicity.
Lena Slachmuijlder is Co-Chair of the Council on Tech and Social Cohesion and Executive Director of Digital Peacebuilding at Search for Common Ground.