The Sanitized Web: Unpacking the Risks of Synthetic Data in Hate Speech Detection

The explosion of Generative AI has handed researchers and engineers a “magic wand” for data creation. Facing a shortage of labeled training data? Just ask a Large Language Model (LLM) to generate it for you. This promise of infinite, privacy-compliant, and low-cost data is revolutionizing Natural Language Processing (NLP).

But when we move away from objective tasks—like summarizing a news article—and into the murky, subjective waters of hate speech detection, does this magic still hold up?

A recent research paper, titled “Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection,” challenges the “more is better” narrative. The researchers conducted a rigorous study not just on how well synthetic data trains models, but on what that data actually looks like.

In this post, we will break down their work. We will explore how synthetic data helps models generalize better to new environments, but importantly, we will uncover the hidden costs: the sanitization of language, the erasure of minority identities, and the unexpected biases introduced by the very safety guardrails designed to protect us.

The Problem: Data Scarcity and the Subjectivity Trap

To train a machine learning model to detect hate speech, you need examples. Thousands of them. Traditionally, this involves scraping social media (Twitter, Reddit, YouTube) and paying humans to label posts as “hateful” or “not hateful.”

This process is fraught with issues:

Scarcity & Obsolescence: Social media language evolves rapidly. A dataset from 2016 might not recognize the slurs or context of 2024.
Privacy: Using real users’ posts raises ethical concerns.
Trauma: Human annotators suffer psychological distress from reading toxic content for hours.

Synthetic data generated by LLMs (like Llama 2 or Mistral) offers a solution. It’s cheap, infinite, and doesn’t require exposing humans to toxicity. However, previous research has been mixed. Sometimes synthetic data boosts performance; other times, it fails to capture the nuance of human hatred.

The authors of this paper set out to answer a specific question: If we use LLMs to paraphrase existing hate speech datasets, do we get a robust training set, or do we introduce new, invisible problems?

The Methodology: Paraphrasing as Augmentation

The researchers did not ask LLMs to invent hate speech from scratch (which many models are hard-coded to refuse). Instead, they adopted a paraphrasing approach.

They started with the Measuring Hate Speech (MHS) corpus, a high-quality dataset of social media comments annotated for hatefulness and target identities (e.g., race, gender, religion).

The Process

Input: A real comment from the MHS dataset.
Prompting: They used three open-source LLMs—Llama-2 Chat 7B, Mistral 7B Instruct, and Mixtral 8x7B Instruct. The prompt was simple: “Paraphrase this text: {text}”.
Filtering:

Fuzzy Matching: If the LLM just regurgitated the original text verbatim, it was discarded.
Classifier Filtering: This is a crucial step. They used a separate classifier to check if the new synthetic text had the same label as the original. If a “hateful” post was paraphrased into something “non-hateful,” it was flagged.

This setup allowed the researchers to compare three distinct training scenarios:

Training on Original Gold Data (Real humans).
Training on Synthetic Data Only.
Training on a Mixture (Gold + Synthetic).

Extrinsic Evaluation: The Numbers Look Good

First, let’s look at the quantitative results. This is usually where most papers stop. The researchers trained classifiers (specifically RoBERTa Large) on the different data mixes and tested them on three datasets:

MHS (In-Distribution): The same source as the training data.
MDA (Out-of-Distribution): A dataset covering Black Lives Matter, Covid-19, and the 2020 US Elections.
HateCheck (Out-of-Distribution): A challenge set designed to trick models.

The results were surprisingly positive for synthetic data.

Table 1: Results of RoBERTa Large models trained on synthetic data only.

As shown in Table 1, models trained only on synthetic data (specifically the “No Filter” rows) performed admirably. While they lagged slightly behind gold data on the source dataset (MHS), they actually outperformed the gold-trained models on the out-of-distribution datasets (MDA and HateCheck).

Look at the HateCheck column. The original gold data achieved an F1 score of .507. The synthetic models (without filtering) jumped to scores around .675 - .687.

Why did this happen?

The researchers hypothesize that real-world datasets often contain specific keywords that models “overfit” to. For example, if a dataset contains many hateful comments about a specific politician, the model might learn that the politician’s name equals “hate.”

Synthetic paraphrasing increases lexical variety. It rewrites sentences in new ways, forcing the model to learn the structure and meaning of hate rather than just memorizing bad words. This makes the model more robust when facing completely new topics (like those in MDA).

The Impact of Mixing Data

Table 2: Results of RoBERTa Large models trained on a mixture of synthetic and gold data.

Table 2 shows what happens when you combine real and synthetic data. The performance stabilizes. You get the best of both worlds: high performance on the original dataset and improved robustness on unseen data.

From a purely engineering standpoint, the conclusion seems clear: Use synthetic paraphrasing. It boosts generalization.

But this paper digs deeper. If the numbers look good, why do they look good? And is the synthetic data actually faithful to the reality of hate speech?

Intrinsic Evaluation: Peeling Back the Layers

This is the core contribution of the paper. The authors performed a qualitative analysis on 3,500 examples, annotating them by hand to check for realism, label preservation, and identity representation.

1. The “Uncanny Valley” of Text

Are these paraphrased tweets realistic? Can they pass for human writing?

To test this, human annotators were given a mix of real and synthetic tweets and asked to identify the source. The result? Humans correctly spotted the AI-generated text 88% of the time.

Table 3: Synthetic text realism annotations.

Table 3 highlights a different aspect: while the grammar is almost perfect (only 1-2% errors), the texts suffer from what we might call “polite toxicity.”

LLMs are trained to be helpful and harmless. When forced to paraphrase hate speech, they often adopt a bizarre, formal register.

Original: “Please stop this queer feminist bullsh*t.”
Synthetic: “Kindly halt this conduct characterized by the blending of unconventional gender identities and feminist ideologies.”

While a classifier might still recognize this as negative, it is not how humans speak on Twitter. This lack of realism means models trained on this data might learn to associate “complex vocabulary” with hate, rather than the colloquial slurs actually used by bigots.

2. The Great Sanitization (Label Redistribution)

The most alarming finding concerns label preservation. When you ask an LLM to paraphrase a hateful post, does it remain hateful?

The answer is often no.

Figure 1: Distribution of hateful and non hateful texts in the manually labeled subset of gold and synthetic data created using the Mixtral 8x7B Instruct model.

Figure 1 is a Sankey diagram visualizing the flow of labels. On the left is the label of the original (Gold) text. On the right is the label of the paraphrased (Synthetic) text.

Follow the “Hateful” stream from the left. You will notice a massive chunk of it flows into “Non-hateful” on the right.

Observation: Almost half of the hateful examples were “sanitized” by the LLM during paraphrasing.
Cause: This is likely due to Model Alignment. Modern LLMs (especially Llama-2, and to a lesser extent Mistral/Mixtral) are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) to refuse generating toxic content. Even when asked to simply “paraphrase,” the model’s safety guardrails kick in, softening the blow.

This explains why the “Classifier Filtering” method mentioned in the Methodology section resulted in so much less data—the classifiers were rejecting thousands of examples that had lost their toxicity.

3. Identity Erasure: Whitewashing Intersectionality

Perhaps the most ethically significant finding is how synthetic data handles target identities.

In hate speech detection, we care deeply about who is being targeted (e.g., Black people, women, Jewish people). We need models to be fair and accurate across different groups.

The researchers tracked how identity mentions shifted during paraphrasing.

Figure 2: Target identity redistribution with the Mixtral 8x7B Instruct model.

Figure 2 paints a concerning picture.

Loss of Specificity: Look at the massive flow into “No Target” on the bottom right. Over one-third of examples that originally targeted a specific group lost that reference entirely in the synthetic version.
Intersectionality Lost: The “Multiple” category (on the left) represents intersectional hate (e.g., hate directed at Black women). A huge portion of this flows into single categories or “No Target.”

The models have a tendency to generalize.

Original: “These [slur] need to go back to their country.” (Target: Race/Origin)
Synthetic: “These individuals should return home.” (Target: None/Generic).

This “identity bleaching” means that a model trained on synthetic data might become colorblind in the worst possible way—failing to recognize hate speech when it is specifically directed at marginalized groups.

4. Lexical Analysis: Where did the slurs go?

To confirm this sanitization, the authors analyzed the “Most Informative Tokens” (words most strongly associated with the hateful class) for both real and synthetic data.

Table 4: Top-k most informative tokens for the hateful class across targets of hate in GOLD and SYNTHETIC posts.

Table 4 is stark.

Gold Data (Top rows): The most informative words are explicit slurs and profanities (censored in the image, but clearly recognizable). This reflects the ugly reality of online harassment.
Synthetic Data (Bottom rows): The slurs are gone. Instead, the model relies on words like “individuals,” “behavior,” “promiscuous,” “ignorant,” and “intellectually.”

Why does this matter? If you train a model on the synthetic data, it learns that the word “individual” or “woman” in a certain context is a signal for hate. It doesn’t learn that specific racial slurs are hateful, because the LLM refused to write them.

When you deploy this model in the real world, it might fail to flag the most egregious, slur-filled racism because it was trained on a “polite,” sanitized version of hate. Conversely, it might falsely flag innocuous sentences just because they use formal words like “individual.”

Conclusion: A Double-Edged Sword

The paper “Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection” provides a crucial reality check for the AI industry.

The Good: Synthetic data does offer a computational advantage. By rewriting training data, LLMs break the model’s reliance on specific keywords, helping it generalize to new topics and datasets (as seen in the MDA and HateCheck results).

The Bad: This performance boost comes at a high qualitative cost.

Sanitization: The data is stripped of the very toxicity we are trying to detect.
Erasure: Specific identities and intersectional groups are washed away, replaced by generic terms.
Spurious Correlations: Models may learn to detect “LLM writing styles” rather than actual hate speech.

The Takeaway for Students and Practitioners: Do not look at F1 scores in isolation. A model might perform well on a benchmark, but if the training data has systematically erased the existence of Black women or LGBTQIA+ individuals, that model is ethically compromised.

As we move forward, we cannot simply rely on “prompt and pray.” Using synthetic data for sensitive tasks requires human-in-the-loop validation. We must ensure that in our quest to build robust models, we don’t accidentally build systems that are blind to the very harm they are meant to prevent.

The Problem: Data Scarcity and the Subjectivity Trap#

The Methodology: Paraphrasing as Augmentation#

The Process#

Extrinsic Evaluation: The Numbers Look Good#

Why did this happen?#

The Impact of Mixing Data#

Intrinsic Evaluation: Peeling Back the Layers#

1. The “Uncanny Valley” of Text#

2. The Great Sanitization (Label Redistribution)#

3. Identity Erasure: Whitewashing Intersectionality#

4. Lexical Analysis: Where did the slurs go?#

Conclusion: A Double-Edged Sword#