How to build a content moderation system

Sentropy Technologies
Sentropy
Published in
8 min readMay 7, 2021

--

Anyone who is managing a platform with user-generated content knows the risks. Insults, spam, hate speech, profanity, and other types of abuse seem to appear as soon as you acquire users.

You were hoping this wouldn’t happen so quickly, but you have valid concerns about user churn — and your platform’s reputation. The last thing you need is to become known for having degrading or disgusting content.

How can you prevent toxic content in the first place? And how can you do it without spending an arm and a leg?

This guide will show you how to build a content moderation system as cheaply and efficiently as possible. It will outline several approaches, including profanity filters and machine learning. It will also provide estimates on cost & labor requirements.

Let’s dig in.

Auto-flagging inappropriate messages allows you to foster an engaged community.

Why does content moderation matter?

If your business relies on user-generated content, it’s extremely important to proactively detect abuse and toxic content. Content moderation is a key component for social communities, forums, dating & chat apps, multiplayer games, and marketplaces.

Without moderating your content, you run several risks:

  • Damage to reputation. Every user is a potential advocate. But when toxic messages go unchecked, users’ expectations change. They expect more low-quality content on your platform, which causes long-term damage to your brand’s reputation.
  • Toxic users undermine the bottom line. 27% of users disengage or churn when they see inappropriate content. Fewer users means slower growth.
  • Some content is illegal. For example, FOSTA-SESTA prohibits platforms from hosting content that sells sexual services, under penalty of imprisonment for up to 25 years.

First things first: Lay the groundwork with smart community guidelines

Community guidelines are the backbone of an engaged community. They lay out the rules in a way that users can understand. If all users follow your community guidelines, your community should continue to grow.

However, when people don’t follow the guidelines, platform admins need to act quickly. For the sake of the community. And the business.

If you don’t have them already, here’s a fantastic guide to creating smart community guidelines.

Building content moderation: Keywords or Machine Learning?

This is your first decision point. Before you decide which of two paths to go down, consider these factors about your situation:

  • How important is UGC to your business?
  • How much content do you expect over the next 12 months?
  • What’s your budget?

There are two main options for detecting inappropriate content:

Keyword filters are a list of words or phrases fueled by regularized expressions (AKA regex). While they can be built easily and quickly, users often circumvent these filters by replacing “s” with “$” and so on. Additionally, they miss insults & threats that don’t contain a trigger word. Therefore, keyword filters are the preferred method for non-social features, like contact forms.

Context-aware AI understands the nuance of natural language and flags toxic messages accordingly. While difficult to build, AI is far more effective in detecting toxic content, and it becomes smarter as you continue using it. This is the preferred method for every platform whose business model depends on user-generated content.

Due to its contextual understanding, Machine Learning is recommended for platforms that depend on user-generated content.

Option 1: Keyword filters

Pros

  • Inexpensive for basic filtering. You can even find some python keyword filter packages online for free.
  • Good if you have zero tolerance for profanity. There’s a difference between “you’re a piece of shit” and “oh shit, I forgot the meeting.” But if you want to auto-block both, profanity filters are good for that.

Cons

Keyword filters don’t understand context. This causes a lot of problems:

  • Can’t detect phrases without trigger words. For example, “You people aren’t so bright” could be understood as potentially abusive. What keyword would you flag? None of these words function as effective triggers, because each word is commonly used in non-abusive situations.
  • Can’t distinguish between abusive and non-abusive profanity. For example, in “you people can go to hell” and “aw hell I missed again,” the profane word has very different meanings. Most platforms would want to flag the former.
  • Requires constant maintenance. New slang, insults, and slurs are constantly emerging. Keyword filters must be constantly updated.

How to build keyword moderation:

Building keyword filters requires two skill sets. A back-end developer is required to write the code, and a front-end developer is necessary for making an interface where non-technical users can interact with it.

Step 1: Use Regularized Expressions (regex)

The complexity of written language (especially online) allows for seemingly endless variations of each word. For example, if you want to block the word “shit,” you probably also want to block “$hit,” “shiiiit,” “s h i t,” “sh!t,” “sht,” “5H|T,” and all the other variations that communicate the word.

These character variations make regex time-consuming — a major drawback about this approach.

The fastest way to build a list of keyword filters is with a back-end engineer who has experience building regularized expressions. You can feasibly expect an experienced engineer to be able to build a decent list in 2 months.

Step 2: Build a moderation interface

Now that you have a list of words and phrases that will be flagged, you need a way to interact with that data. A good moderation interface will allow you to add new words and variations, moderate the messages that were flagged, and build automations.

Building a moderation UI will require a front-end developer. You can feasibly expect a front-end developer to build a patchwork but effective interface in 1 month.

Bottom line:

Keyword filters don’t understand context, therefore their low overall effectiveness is best for platforms that don’t rely heavily on user-generated content.

Free resources:

  • You can learn and test regularized expressions with free online tools like RegexOne, Regexr, and Regex101.
  • There are some free options for regex package downloads, like the “badwords” project on GitHub (available under Creative Commons Attribution 3.0) and this Python library. These may be helpful to some degree, but these are free options and their usability/functionality cannot be guaranteed.

Option 2: Context-aware Machine Learning

Pros

  • Extremely accurate when optimized well
  • Understands context
  • Understands L337 and character substitutions

Cons

  • Expensive and time-consuming to build
  • Requires a data scientist or similar expertise

How to build a lean content moderation system using ML

Building ML can get expensive quickly, so let’s make this approach as lean and cheap as possible.

Step 1: Decide what topics you want to detect

User-generated conversations could be about anything: your dog, the gossip from the all-hands meeting, Obamacare, or the latest episode of Billions.

Similarly, abusive messages could be about anything: sexual aggression, physical violence, insults, hate speech, spam, self-harm, grooming, etc. The list goes on.

Each has different vocabulary and different contextual signals.

Therefore, it’s impossible to make one ML classifier that accurately detects the full range of abuse within the full range of user-generated conversations. If that classifier was made, it would be highly inaccurate.

That’s why ML works best when it monitors for one specific type of abuse. For example, ML is more accurate when it’s trained to look for physical violence (a specific type of abuse with a set of common characteristics), rather than abuse in general.

Therefore, you must decide: what’s most important to flag? For most platforms with social/chat/community features, the following four ML classifiers are often a good starting point:

  • Physical violence
  • Sexual aggression
  • Hate speech
  • Insult

Step 2: Crowd-source the test sets

Now that you’ve decided what to detect, you need to create a baseline “truth” for your ML.

You’ll need a large test set with accurate labels to start training your ML models. Ideally, begin with at least 10,000 labeled rows of data. But we’re taking a lean approach here, so let’s estimate at 5,000 labeled rows for a bare minimum. For reference, professional moderation services like Sentropy are using millions of labels to fuel pinpoint-accuracy detection.

The leanest way to annotate your data: use Mturk to crowdsource the test sets.

Mturk is very affordable, but the risk is that anyone could label your test set incorrectly. Therefore, it’s best to have a consensus instead of relying on one data annotator. Let’s estimate that 5 MTurk workers are annotating your test set. That way you have a consensus for each row.

Five Mturk annotators labeling 5,000 rows comes out to around $750 per classifier, or $3,000 for the four classifiers.

Of course, with new slang and new linguistic forms of abuse appearing every day, it’s a best practice to update this test set regularly. Professional moderation services like Sentropy will update every week. For a very lean (and less updated) approach, you could choose to update this once every 2 months.

Fixed cost running total: $1,500/month ($3,000 every 2 months)

Step 3: Train the models

The next step is to run your ML against the test set and optimize its performance. For this, you will need a data scientist or someone with experience creating ML models.

In addition to technical expertise, you’ll have monthly costs associated with your ML infrastructure. Realistically for this scenario, you will want 4–5 GPUs, but let’s estimate conservatively at 3 GPUs. Given 4 classifiers running on 3 fully-utilized GPUs, you can expect $200 per month for ML-related infrastructure.

Fixed cost running total: $1,700/month

Also consider: Sentropy provides a suite of high-performance content moderation tools for a fraction of the price of building. It’s helping companies of all sizes take a proactive approach to user-generated content, with one API and a moderation console.

Step 3: Production Inference

Now that the models are trained, it’s time to deploy them into production. Each time a user submits a message, the ML models need computing power to deliver a classification. As your volume grows, so will your requirements for your GPUs. Most smaller communities will require at least 5 GPUs and 5 CPUs for ongoing production, which can be estimated at $800 per month on Google Cloud.

Fixed cost running total: $2,500/month

Step 4: Build a moderation interface

Now that you have a list of words and phrases that will be flagged, you need a way to interact with that data. A moderation interface will allow you to take action on flagged messages, create automations, and adjust the sensitivity of your classifiers. This allows non-technical admins to manage your community’s user-generated content.

Building a moderation UI will require a front-end developer. You can feasibly expect a front-end developer to build a patchwork moderation interface in 1 month.

Step 5: Calculate Engineering hours

So far, we have created an extremely lean ML content moderation system by cutting costs wherever possible. However, we haven’t considered the cost of engineering. When estimating engineering hours, take into consideration:

  • Back-end: Gathering data & assembling the test set
  • Back-end: Setting up GCP GPUs and CPUs
  • Data science: Training and optimizing ML models
  • Back-end: Deploying ML models into production
  • Front-end: Build a moderation UI

Total costs: $2,500/month + hours from Back-end, Data Science, and Front-end engineers

Save time & money by using Sentropy

If you’re looking to save time & money on content moderation ML, consider a vendor like Sentropy, which makes the world’s most accurate content moderation tools. Their context-aware ML flags inappropriate messages in real-time — allowing you to swiftly identify toxic users and take action on inappropriate content. It’s a proven option for companies that need to moderate content effectively and efficiently.

--

--

Sentropy Technologies
Sentropy

We all deserve a better internet. Sentropy helps platforms of every size protect their users and their brands from abuse and malicious content.