Introduction

In the world of Natural Language Processing (NLP), we often treat data labeling as a search for a single truth. If we ask five people to label a comment as “toxic” or “not toxic,” and three say it is, we typically take the majority vote and discard the dissenting opinions as noise. But is that disagreement really noise?

Consider the phrase: “You’re an idiot.”

To a close friend in a gaming chat, this might be playful banter. To a stranger in a political debate, it is an insult. What one group considers acceptable, another might find deeply offensive. By aggregating these distinct perspectives into a single “ground truth,” we strip the data of its inherent social variance. We lose the “who” behind the label.

This poses a significant challenge for safety systems. If an AI only learns the majority view, it may fail to protect minority groups or misunderstand specific cultural contexts.

In this post, we explore a fascinating research paper, “Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree,” which proposes a shift in perspective. Instead of asking “Is this sentence toxic?”, the researchers ask “How toxic would this specific person rate this sentence?”

We will dive into three distinct architectures designed to predict individual annotator ratings: Neural Collaborative Filtering (NCF), an Embedding-Based Architecture, and In-Context Learning (ICL). We will also uncover a surprising finding about the relationship between demographic data and survey responses, and what it means for user privacy.

Background: The Subjectivity Problem

Traditional supervised learning relies on the assumption that for every input \(x\), there is a correct label \(y\). In subjective tasks like hate speech detection, this assumption breaks down. Disagreement among human annotators is not necessarily an error; it is often a reflection of their background, lived experiences, and personal tolerance levels.

Recent work in “Perspectivist NLP” suggests that we should model these disagreements explicitly. Rather than training a model to output a single score (e.g., 0 for safe, 1 for toxic), we can train models to predict the specific rating a specific user would give.

To do this, the model needs more than just the text. It needs context about the annotator. The researchers in this study utilized a dataset where each text was rated on a toxicity scale of 0 (least toxic) to 4 (most toxic). Crucially, the dataset included metadata about the annotators:

  • Demographics: Race, gender, age, education, political stance, etc.
  • Survey Information: Their social media habits, whether they’ve been harassed online, and their views on technology.
  • Rating History: How they have rated other texts in the past.

The researchers set out to answer two main questions: Can we build architectures that accurately predict these personalized ratings? And is sensitive demographic data actually necessary, or can we infer preferences from less sensitive survey data?

Core Method: Three Approaches to Personalization

The heart of this research lies in how we combine the text being read with the profile of the person reading it. The authors proposed and rigorously tested three distinct methodologies.

1. Neural Collaborative Filtering (NCF)

The first approach draws inspiration from recommendation systems—the same algorithms Netflix uses to predict if you’ll like a movie based on your viewing history. This is known as Neural Collaborative Filtering (NCF).

The hypothesis here is that an annotator’s “taste” in toxicity is a latent preference that can be learned. The architecture creates two parallel streams of information that merge to form a prediction.

Figure 1: Design of our neural collaborative filtering (NCF) architecture. Annotator information and the text being rated were passed into an embedding model, then concatenated with the annotator embedding, and passed through a series of dense layers to predict the rating.

As shown in Figure 1, the process works as follows:

  1. Text Encoding: The text to be rated (e.g., “You are an idiot”) is passed through a RoBERTa model. RoBERTa is a robust language model fine-tuned on toxicity datasets. This generates a dense vector representation of the sentence.
  2. Annotator Embedding: Simultaneously, the model maintains a learnable embedding for the annotator. In the most basic version, this is a random vector assigned to the user ID, which the model adjusts during training to represent that user’s latent behaviors.
  3. Concatenation & Prediction: The text embedding and the annotator embedding are concatenated (joined together). This combined vector—containing information about what was said and who is reading it—is passed through a series of dense neural network layers (the classification head) to output a final rating between 0 and 4.

The researchers experimented heavily with this architecture. They tried freezing the RoBERTa weights, adjusting the size of the annotator embeddings (from 8 dimensions up to 768), and even injecting demographic data directly into the RoBERTa input.

2. Embedding-Based Architecture

The second method moves away from the “black box” of latent user IDs and explicitly models the annotator using their data. This is the Embedding-Based Architecture, and as we will see in the results, this approach proved to be the powerhouse of the study.

Figure 2: Design of our embedding-based architecture.

This architecture, illustrated in Figure 2, treats the annotator’s profile as a text problem.

  • Step 1: Textual Input: The text to be rated is converted into an embedding using OpenAI’s embedding models (specifically text-embedding-3-small and text-embedding-3-large).
  • Step 2: Profile Input: The annotator’s information—their demographics, survey responses, and rating history—is converted into a descriptive string. For example: “The reader is a 25-34 year old Asian female who… thinks toxic comments are a serious problem.”
  • Step 3: Dual Embedding: This descriptive profile string is also passed through the text embedding model. Now, both the toxic comment and the user’s biography are represented in the same high-dimensional vector space.
  • Step 4: Fusion: These two embeddings are concatenated and fed into a custom Multi-Layer Perceptron (MLP). This is a neural network consisting of fully connected layers designed to learn the non-linear interactions between the user’s profile and the text.

The advantage of this method is that it is semantically rich. The model doesn’t just know “User 123”; it understands the attributes of User 123 semantically.

3. In-Context Learning (ICL)

The final approach leverages the power of Large Language Models (LLMs) like GPT-3.5 and Mistral. This method, known as In-Context Learning, does not involve training a new neural network structure. Instead, it relies on sophisticated prompting.

The researchers constructed a structured prompt that feeds the LLM all the necessary context. The prompt follows this pattern:

  1. System Prompt: Defines the role (e.g., “You are a model that predicts toxicity ratings…”).
  2. Annotator History: Examples of previous texts this specific annotator has rated (e.g., “‘This is harmless’ is rated 0”).
  3. Survey/Demographics: A natural language description of the annotator (similar to the embedding approach).
  4. Target Text: The actual sentence that needs a prediction.

The LLM is then asked to generate the rating. This tests the model’s ability to “roleplay” the specific annotator based on the provided biography and history.

Experiments & Results

The researchers evaluated these models using Mean Absolute Error (MAE). Since the ratings are on a scale of 0 to 4, a lower MAE means the predicted rating is closer to the actual human rating.

Q1: Which architecture performs best?

The results showed a clear hierarchy among the methods.

The Neural Collaborative Filtering (NCF) approach struggled. Despite extensive tuning, it failed to significantly outperform baselines. As seen in the table below, even freezing the pre-trained model or adjusting embedding dimensions yielded an MAE of roughly 0.80 to 0.89.

Table 3: Significant Experiments and Their Impact on Mean Absolute Error (MAE)

The NCF model likely failed because the interactions between annotator ID and text are complex and difficult to capture via simple concatenation in the classification head, especially when the dataset is sparse per user.

The Winner: Embedding-Based Architecture

The Embedding-Based Architecture was the clear winner, achieving an MAE of 0.61. This is a significant improvement over the NCF approach.

The In-Context Learning (ICL) models also performed well, with Mistral achieving an MAE of 0.69. However, the specialized embedding architecture consistently beat the general-purpose LLMs.

Q2: What information matters most?

The researchers performed an ablation study, systematically removing different inputs (demographics, history, survey info) to see what drove performance.

Comparison of Ablations Across Selected Models Figure 3: Comparison of MAE improvement with varying amounts of annotator input across all models. The text-embedding-3-large model consistently outperforms all other models and has most improvement on its own baseline.

Figure 3 illustrates the improvement in error reduction (higher bars are better) when adding different data sources compared to a text-only baseline.

  1. Text Only (Baseline): Predicting based only on the sentence itself yields the highest error.
  2. Adding Demographics: Gives a modest boost (the first set of bars).
  3. Adding History & Survey: This is where the magic happens. The combination of Demographics + History + Survey (the far-right bars) yields an error reduction of nearly 18% for the best model (text-embedding-3-large).

The “Imputed” Demographics Discovery

Perhaps the most interesting finding of the paper concerns the “Predict Demographics” ablation (the fourth group of bars in Figure 3).

Collecting demographic data (race, gender, sexual orientation) is often legally difficult or intrusive. The researchers asked: Can we just use the survey data instead?

They trained a separate model to predict an annotator’s demographics based solely on their survey responses (views on tech, social media usage) and rating history.

  • They found they could predict gender with 63% accuracy and race with 47% accuracy—significantly better than random guessing.
  • When they used these predicted (imputed) demographics in the toxicity model instead of the true demographics, the performance drop was negligible.

This suggests that survey responses implicitly capture the same signal as demographic data. If you know someone’s online habits and rating history, you don’t necessarily need to ask for their race or gender to predict how they will rate toxicity. The survey data is a sufficient proxy.

Conclusion and Implications

This research highlights that toxicity is not an objective property of a text, but a relationship between a text and a reader. By moving beyond majority voting, we can build systems that respect individual variations in judgment.

The key takeaways are:

  1. Architecture Matters: A specialized embedding-based architecture, which fuses semantic representations of both the text and the user profile, significantly outperforms collaborative filtering and general LLM prompting.
  2. Context is King: Including annotator history and survey data dramatically reduces prediction error (MAE 0.61 vs baseline >0.75).
  3. The Privacy Paradox: The finding that survey data acts as a proxy for demographics is a double-edged sword. While it means we can build accurate personalized models without explicitly asking for sensitive data (Data Efficiency), it also implies that “anonymous” surveys might inadvertently reveal sensitive demographic traits (Privacy Risk).

As we move toward more personalized AI, understanding these nuances—how to model the user and how to protect their latent data—will be essential for building safer and more inclusive digital environments.