Why is content moderation so hard?

Published in

Sentropy

7 min readOct 1, 2020

Disclaimer: The text below references obscene language pertaining to hate speech.

It’s clear that abusive content is a problem. A full one-third of adults and nearly half of teens have been the target of severe online harassment. And that abuse has some chilling real-world consequences. Multiple studies have shown that children, teens, and young adults who were victims of harassment online were more than twice as likely as non-victims to self-harm, exhibit suicidal behaviors, consider, and attempt suicide.

But abusive content’s presence continues, and even grows. It’s become a real-life Hydra: cut off one head, and more sprout up in its place. Still, we live in the age of pocket-sized supercomputers, self-driving cars, and cloned sheep. Why is it that abusive content is so difficult to get under control? Let’s take a look at the reasons why this problem is so intractable — and what the future holds for content moderation.

A moving target

If it were a matter of creating a list of offensive terms and combining it with a handful of contextual signals, the problem might be manageable. But language is constantly evolving. And language spread on the internet evolves at warp speed. Some examples:

Misspellings and character substitutions. Much of the internet doesn’t use perfect spelling and grammar, meaning some language that’s still understandable to us can be easily misunderstood by software. For example, the homophobic slur faggot can appear as faget, fgt, phaggot, fa go tt, f@gg0t, f-a-g-g-o-t, or an infinite number of other variations.
New slurs. Anti-Asian slurs have spiked during the COVID-19 pandemic. By our count, dozens of new anti-Asian epithets — like kung flu and slant-eyed sickness — cropped up linking Asian people and those of Asian descent with coronavirus in just the first quarter of 2020.
Appropriation of innocuous terms. Beginning in early 2020, jogger became an anti-Black epithet following the tragic killing of Ahmaud Arbery. Seemingly ambiguous phrases are similarly and regularly weaponized across the web. Another example we’ve seen is remove kebab, a phrase likely overlooked by many detection systems that’s been used as a call for ethnic cleansing.
Ideograms. White supremacist groups employ a diverse numerical lexicon to communicate amongst themselves. Substituting the number 8 as a stand-in for the letter H (the eighth letter of the alphabet) allows the string 88 to become shorthand for Heil Hitler. The same groups have also been known to use the number 109 to refer to the number of locations they claim Jews have been expelled from, and the number 110 as a call to expel Jews from a new location.
Logograms. While sometimes laughably convoluted, emoji can be used to convey complex sentiments while avoiding detection by traditional moderation systems. Case in point: 🧀🍕 is sometimes used in place of “cheese pizza,” which is shortened to “CP,” which in turn is de-abbreviated into “child pornography.”

And that’s just English. While some phrases can be translated, there are still many that are specific to individual languages. And even within one language, meaning can vary by the community. For example, talking about killing people generally raises flags, but in a video game community, it could be merely referring to playing a first-person shooter.

Not to put too fine a point on it, but detecting abuse is infinitely more complex than matching keywords.

True or false

The above examples represent one half of the challenge: preventing false negatives. Any system needs to be able to detect these as abuse and not overlook them. But then there’s still the opposite challenge: avoiding false positives. This is where content is flagged for using abusive language, even if the intent is anything but.

Ever used a phrase like I just want to die to express embarrassment or I’m gonna kill you to show playful anger? Taken in isolation, these could easily be interpreted as having intent to harm. In those cases, flagging the content might lead to an unnecessary suspension. By no means ideal, but understandable nevertheless.

But what if someone uses a racial slur to describe their experience as the target of a verbal attack? Flagging them not only creates a “blame the victim” scenario, but also silences an innocent, and vulnerable, voice. This is the exact opposite of what online communities are trying to accomplish.

Context is everything

In many cases, what separates the false positives and negatives from legitimately abusive content is context. And many companies choose to attach that context by having a human review user-generated content. The false positives above, taken as part of a larger conversation and reviewed by a human, would more easily be identified as such. But to a machine indiscriminately plowing through isolated data sets, that might not be the case.

Human moderation, though, has its own challenges. Make it someone’s job to review abusive content every day, and they will suffer. In 2018, a former content moderator for Facebook sued the company for creating unsafe working conditions, citing psychological trauma, and post-traumatic stress disorder. A similar story surfaced more recently of similar claims against YouTube by a former moderator. For the moderators experiencing these effects day in and day out, there’s a huge emotional and psychological cost.

Further, the systems provided to human moderators are subpar. In our experience, traditional tools funnel content to moderators for review with little structure, categorization, or context. When content is served to moderators it’s typically done in a disjointed manner. This requires them to work row by row, jumping between different types of content with different levels of severity, making it difficult to identify emerging patterns. And if they’re fed too many false positives — whether from user reports or automated systems — it only serves to compound the problem.

Even if you were able to solve these problems by shielding moderators from the intensity of the content and providing better filtering and categorization, there remains the issue of volume. At the absolute best, a content moderator could review 3,000 pieces of comment-style pieces of content a day (that equates to a judgment being made every ~10 seconds for eight hours straight). That’s a drop in the bucket for the larger platforms, who see millions or billions of pieces of content daily.

An expensive problem

At its core, the content moderation problem is one of detection. Solving this at scale is a very expensive endeavor whether through human detection or by hiring machine learning talent and attempting to build custom tools in-house. And even then, there’s no guarantee of success. Most companies simply don’t have the resources to invest in this to the extent that this problem requires. As a platform’s user base grows, this issue becomes exponentially more difficult to tackle. And you’d be hard-pressed to find a company with a community platform that wants their primary focus to be on content moderation tooling rather than on building an awesome, unique community.

Even if a company hosting bullying, hateful, or otherwise abusive content had no moral qualms about it, its presence erodes user trust. As users lose confidence in a platform’s concern for their wellbeing, many leave. 30% of adults stop using a platform once they’ve been harassed on it. As much as we’d like to believe the human costs — the traumatized users and moderators — would be enough, that loss of engagement is a cost that speaks to all businesses.

The path forward

We believe that protecting large, diverse communities requires contextual understanding. And that’s the basis of the work we’re doing at Sentropy — providing a unifying, purpose-built solution to detect and defend digital communities against harmful content.

We’ve done that by using our expertise in machine learning and human intelligence to create technology that learns from and adapts to the evolution of language online. Instead of recognizing abusive language after the fact, we’re learning it as it develops. We gather contextual signals from all corners of the internet — everything from the dark web to the most common social media platforms. And because it can be a vulnerability, we’ve taken steps to actively combat bias.

Just as other technology companies have created new avenues for businesses — think Stripe for payments or Twilio for communications — we’ve built the moderation infrastructure that can help communities thrive. We do this through our Detect API, which allows customers to integrate our cutting-edge detection capabilities into existing moderation workflows. No expensive and time-consuming in-house development needed, just world-class detection technology that’s ready to get to work empowering your moderators, community managers, and Trust & Safety team.

And for platforms seeking a complete solution, we also provide Defend, a browser-based interface for accessing Detect’s intelligence. With Defend, moderators can identify specific types of abuse, discover behavioral trends, and make more efficient, consistent decisions. Perhaps most importantly, as platforms use Sentropy to make moderation decisions, our technologies learn in real-time to tailor classifications to that community, helping to bring attention where it’s needed most.

As much as communities struggle with abuse, we still see the good in them. They give voice to the underrepresented, provide safe places for shared interests to flourish, and help connect us across boundaries. Abusive content may present a daunting challenge, but know it can be overcome. And as we address the problem, one community at a time, we’ll all have an internet that’s safer and more welcoming for everyone.