Token Bias in Subword-Based Hate Speech Classifiers

Sentropy Technologies

Published in

Sentropy

8 min readSep 16, 2020

Disclaimer: this post references obscene language pertaining to hate speech.

By: Emma Peng

Introduction

Several studies (Kennedy et al. , Wiegand et al, Dixon et al. , etc.) have pointed out the issue of word level bias in hate speech datasets. For instance, Kennedy et al. suggest that group identifiers such as “gay” or “black” are overrepresented in the positive class of two hate speech datasets. They found that classifiers trained on such imbalanced datasets struggle on negative examples that contain the overrepresented group identifiers. Such biases manifest as undesirable false positives when these identifiers are present.

In this blog post, we show that bias caused by overrepresentation exists not only on the word level, but also on the subword level, which can be much more evasive and harder to combat against. We conclude by discussing two solutions to overcome this issue.

Subword-Based Models

Many state-of-the-art language models such as BERT and GPT-2 are subword-based — instead of splitting (or tokenizing) input text into words, a tokenizer further splits words into subwords, and uses them as inputs to the model. For instance, the base GPT2 tokenizer will tokenize “Glimpses into the lives of Beirut residents” like this, whereĠ signifies a space before the token in the original input:

Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words should be decomposed into meaningful subword units. For instance, the word “libtard,” recently coined to insult political liberals in online communities and derived from the more common insult, “retard,” may be treated as a rarely observed word and thus decomposed as “lib”, “t” and “ard” during modeling. This allows the model to keep a reasonably-sized vocabulary while still learning useful representations for common words or subwords. The use of subword tokenization also enables the model to process words it has never seen before, by decomposing them into subwords it knows.

Subword Bias

Figure 1 shows two examples that are classified as hate speech by a fine-tuned GPT2 classifier, although only the first example contains hateful content. Since “muslim” is a group identifier that appears much more frequently in the positive class of hate speech datasets, the classifier becomes overattentive to the subwords in “muslim”, namely “mus” and “lim”. This oversensitivity biased the model to incorrectly predict the bottom example as hate speech. To confirm the contribution of the word “glimpses” to the overall score, we neutralized its presence by masking it with a generic token and then running it through the same model. With this change, the positive class probability dropped from 0.866 to close to 0.

Figure 1: Two examples classified as hate speech by a fine-tuned GPT2 hate speech classifier. Overrepresented subwords are highlighted in dark blue. The second example, which is not hateful, illustrates the danger of subword overfitting.

In the rest of the post, we discuss how to discover subword imbalance within a labeled dataset, followed by ways to mitigate subword bias in datasets and models.

Discovering Subword Bias

Dataset

For our analysis, we use the Jigsaw Unintended Bias in Toxicity Classification, which contains around 2 million annotated comments from the Civil Comments platform. Each comment is annotated as not hateful, which we refer to as other , or belongs to one or more of the 6 different toxicity classes: severe_toxicity, identity_attack, obscene, threat, insult and sexually_explicit. The identity_attack class contains comments that attack individuals or groups based on their membership in a protected or vulnerable group (e.g. ethnicity, gender, race, sexual orientation and others). To simplify the analysis, we convert the dataset to binary by focusing only on identity_attack. The converted dataset contains a total of 8.4K positive examples. To avoid making the dataset too imbalanced, we sampled 30K negative examples (class other) from the original not-hateful comments.

Detecting Overrepresented Subwords

To find potentially overrepresented subwords in the positive class, we first tokenize the training split of the Jigsaw dataset into subwords, using the BPE tokenizer associated with a pretrained GPT-2 language model. Then we calculate a “hate score” based on the approximate Point-wise Mutual Information (PMI) between each subword w and the positive class.

Intuitively, the score captures whether a subword cooccurs with the positive class significantly more than with the negative class. Table 1 shows the 20 subwords with the highest approximate PMI in the training set.

Table 1: The 20 subwords with the highest approximate PMI in the training set.

We hypothesize that subwords with high PMI are likely overrepresented in the dataset, and a classifier trained on the dataset is more likely to make a false positive prediction if an example contains one or more of those subwords.

Subword Bias in Hate Speech Classifiers

Model Training

We fine-tuned a pretrained GPT-2 language model on the training split of the Jigsaw dataset. A linear classification layer is added on top of the language model to predict the probability of an example being identity_attack.

Jigsaw Test Set Performance

We first evaluated the classifier on the test split of the Jigsaw dataset, which contains 3,643 comments, of which 751 are positive. Below is the precision-recall curve as well as a detailed classification report of the classifier:

The model appears to be doing well based on the test set performance. When looking closely at the false positives, we can see that the model is oversensitive towards word level group identifiers such as “black”, “white”, and “muslim”. For instance, Figure 2 shows two high-probability false positive predictions from the classifier:

Figure 2: False positives from the Jigsaw test set. Group identifiers are highlighted in blue

However, we are not seeing the same kind of bias on the subword level. We can think of two potential reasons:

The model is actually not suffering from subword bias
The test data is imbalanced. In other words, we don’t see false positives with overrepresented subwords such as “igger” because examples that contain such subwords don’t exist in the negative class of the test set.

To verify which one of the two hypotheses is true, we decided to build an adversarial dataset to further test the model’s robustness against overrepresented subwords.

Constructing an Adversarial Dataset

Adversarial datasets are usually created by manipulating the standard benchmark datasets such that system performance degrades, but not human performance. Many such datasets have been built for various NLP tasks to test if a system is robust against noisy data (Belinkov et al., Ebrahimi et al., Jia et al., Naik et al., etc.). For text classification, common techniques to generate adversarial examples include:

Injecting word level noise such as typos and misspellings
Substituting named entities
Replacing sentiment words with synonyms (label-preserving) or antonyms (label-flipping)

In this analysis, we built an adversarial dataset of negative examples by sampling data from a different domain, to test if the model is robust against examples that:

Are slightly out-of-distribution, and
Contain over-represented subwords

Specifically, we sampled around 176K tweets from the following accounts on Twitter (excluding retweets):

Popular news accounts such as the New York Times, BBC, and The Economist. Tweets from those accounts are usually news headlines
Verified user accounts

The assumption we made is that mainstream news outlets and verified users are less likely to post hateful content on Twitter. We applied a filter to remove any tweets that contain slurs from the sample. Only 38 out of the 170K tweets were removed as a result of filtering.

We took the top 140 overrepresented subwords (approximate PMI > 0.7) and further filtered the dataset down by keeping only tweets that contain at least one subword in the target set. After the filtering, we were left with a total of 7,696 tweets, among which 60% are from news accounts, and the rest from verified users.

Discovering Subword Bias by Input Occlusion

We ran all the adversarial examples through the trained classifier. Out of the 7,696 examples, we found 1,447 false positive predictions (18.8%), with the probability threshold set to 0.5. Next, we want to identify whether a false positive prediction is caused by subword bias.

Subword Importance via Input Occlusion
To measure the contribution of a subword to the positive class score, we implemented a technique called input occlusion proposed by Li et al. For each overrepresented subword in a false positive example, we generate a new example by masking the word with the padding token, and run it through the classifier. We conclude that a false positive is caused by subword bias if any of the generated masked examples get a score below 0.5.

Figure 3 shows some of the false positive predictions caused by subword bias — the original score shows the predicted positive class probability of the original example, while the masked score corresponds to the probability of the example with the highlighted subword token masked. The subwords are either from group identifiers (e.g. “Islam”, “blacks”, “black”), or terms that commonly appear in a hateful context (e.g. “niger”, “homophobia”, “transphobia”).

Figure 3: examples of false positives with different subwords. Subwords in the examples are highlighted in blue

Aggregated Bias Score
In addition to looking at individual false positive predictions, we compute an aggregated bias score S for each overrepresented subword, with the following formula:

Intuitively this score measures the overall contribution of a subword to false positive predictions. The following table shows the subwords with the highest bias scores:

Table 2: Subwords with the highest bias scores

Mitigating Subword Bias

Now that we have identified subword bias in our model, what can we do to mitigate the bias? Here we propose two directions:

Data Augmentation. More often than not, model bias is a result of dataset bias. To fix the model, we can try to make the dataset we train our model on more balanced, via data augmentation. As we have shown, when certain tokens, either at the word or subword level, are overrepresented in the dataset, models may be overattentive to such tokens and make false positive predictions. One simple data augmentation idea is to collect negative examples that contain those tokens, to compensate for their overrepresentation in the positive class.
Bias-Aware Model Training. In some cases, we might not have access to additional training data, or we don’t want to afford the extra model training time with more data. Bias-aware training incorporates various bias metrics into a model’s objective function. Those metrics rely on dividing the data into different subgroups. For instance, Borkan et al. divide data into identity and demographics based subgroups, and compute metrics by comparing a subgroup to the rest of the data, which they call the “background data”. To mitigate subword bias, we can define similar metrics by dividing the data into subword-based subgroups. Check out our blog post to learn more about bias-aware model training as well as other techniques that mitigate model bias.

Let us know if you have encountered subword bias in your work, or have any feedback on the analysis we presented!