In the world of Natural Language Processing (NLP), we often cling to a comforting myth: the myth of the “Gold Label.”

Imagine you are training an AI to detect hate speech. You show a sentence to three human annotators. Two say it’s offensive; one says it’s satire. In traditional machine learning, we take a majority vote, label the sentence “offensive,” and move on. The dissenting voice is treated as noise—an error to be smoothed over.

But what if that disagreement is the signal?

In subjective tasks—like judging morality, identifying toxicity, or interpreting humor—there is rarely a single, objective truth. By flattening human disagreement into a single label, we risk erasing minority perspectives and training models that only represent the majority view.

This brings us to a pressing problem: accurately modeling the full spectrum of human opinion is expensive. It requires hiring many diverse annotators for every single data point. How can we capture this rich diversity without breaking the bank?

Enter Annotator-Centric Active Learning (ACAL), a novel framework proposed by researchers from Idiap, Leiden University, University of Stuttgart, and TU Delft. This approach flips the script on how we train models, focusing not just on what data to label, but who should label it.

In this deep dive, we will explore how ACAL works, the strategies it uses to prioritize fairness alongside accuracy, and why the future of AI might depend on listening to the “worst-off” voices in the room.

Part 1: The Problem with Passive Learning

To understand why ACAL is necessary, we first need to look at the limitations of standard supervised learning (often called “Passive Learning” in this context).

The Gold Standard vs. Soft Labels

In a standard NLP pipeline, we rely on datasets where every input \(x\) (e.g., a tweet) has a target label \(y\) (e.g., “Positive Sentiment”). Usually, \(y\) is a “hard label”—a definitive category derived from a majority vote among annotators.

However, researchers are increasingly moving toward Soft Label Prediction. Instead of predicting a single class, the model tries to predict the distribution of annotations.

For example, if 10 people look at a tweet and 7 say “Hate Speech” while 3 say “Not Hate Speech,” the target isn’t just “Hate Speech.” The target is a probability distribution: [0.7, 0.3].

The mathematical formulation for aggregating these judgments into a soft label \(\hat{y}_i(x)\) looks like this:

Equation for aggregating individual annotations into a soft label distribution.

Here, the model learns from the collective wisdom of the crowd rather than a winner-takes-all vote. While this is better for subjective tasks, it creates a resource bottleneck. To get a reliable distribution (a smooth 0.7 vs 0.3), you need many annotations per item. If you have 50,000 items, paying 10 annotators for each item results in half a million annotations. That is often prohibitively expensive.

The Traditional Active Learning (AL) Solution

The standard industry solution to high labeling costs is Active Learning (AL).

In Active Learning, the model starts with a tiny amount of labeled data. It then looks at a massive pool of unlabeled data and asks, “Which of these examples am I most confused about?” It selects those specific confusing examples and sends them to an “Oracle” (usually a human expert) to be labeled.

The process looks like the left side of the diagram below:

Comparison of traditional Active Learning (left) and Annotator-Centric Active Learning (right).

The Flaw in AL: Traditional AL assumes the existence of an Oracle—a source of absolute truth. It assumes that if the model asks for a label, it will get the correct label.

But in subjective tasks, there is no Oracle. There is only a pool of human annotators, each with their own biases, values, and cultural backgrounds. If the AL system selects a controversial sentence and sends it to a random annotator, the label it receives depends entirely on who picks up the task.

If we want to model the full distribution of human opinion, simply picking the right data isn’t enough. We also need to pick the right people.

Part 2: Introducing Annotator-Centric Active Learning (ACAL)

The researchers propose ACAL (shown on the right side of Figure 1 above). It extends the traditional loop by adding a critical new step: Annotator Selection.

In ACAL, the system makes two decisions at every step:

Sample Selection: Which text document should we label next? (Standard AL)
Annotator Selection: Who from our pool of available humans should label this specific document?

The Algorithm

The process is iterative. The model selects a batch of data, chooses specific annotators for those items, trains on the new data, and repeats.

Algorithm 2: The step-by-step process of ACAL.

This simple addition changes the optimization goal. We are no longer just trying to reduce the model’s uncertainty about the label; we are trying to efficiently approximate the diversity of human judgments.

Ideally, we want to construct a dataset that reflects the views of the entire population (majority and minority) without having to ask every single person to annotate every single item.

Part 3: Strategies for Selecting Annotators

If you have a pool of 100 annotators, how do you decide which one should label the next data point? Random selection is the baseline, but it’s inefficient. It tends to over-represent majority views simply because there are more of them.

The paper introduces three specific strategies designed to capture diversity and fairness, inspired by Rawls’ principle of fairness. This philosophical principle suggests that a fair society is one where the well-being of the “worst-off” members is maximized. In NLP terms, the “worst-off” are the annotators whose opinions are rarely heard—the minority voices.

Here are the four strategies tested:

1. Random Selection (\(T_R\))

This is the control baseline. Given a selected text sample, the system picks an annotator uniformly at random. Over time, this reflects the natural distribution of the annotator pool (biases and all).

2. Label Minority (\(T_L\))

This strategy focuses on outcome. It looks at the history of labels each annotator has given.

The Logic: It identifies which label class is currently the “minority” in the training data (e.g., if “Toxic” appears less often than “Safe”).
The Action: For the new sample, it selects an annotator who has a history of assigning that minority label.
The Goal: To artificially balance the dataset labels, ensuring the model sees enough examples of the rare class.

3. Semantic Diversity (\(T_S\))

This strategy focuses on content coverage. It looks at what the annotator has read before.

The Logic: It uses embeddings (mathematical vector representations of text) to understand the semantic meaning of the samples an annotator has already labeled.
The Action: For the new sample, it calculates the “semantic distance” between this sample and the annotator’s history. It picks the annotator who has least seen this type of content.
The Goal: To broaden the experience of every annotator, ensuring their unique perspective is applied to a wide range of topics.

4. Representation Diversity (\(T_D\))

This strategy focuses on annotator distinctiveness. It looks at how an annotator generally behaves compared to others.

The Logic: It builds a profile for each annotator based on the text they annotated and the labels they gave. It then compares annotators to each other.
The Action: It selects the annotator who is most dissimilar to the others available for that item.
The Goal: To find the “outliers” or “contrarians.” If most people agree, this strategy specifically hunts for the person likely to disagree, ensuring the distribution captures the full range of subjectivity.

Part 4: Measuring Success in a Subjective World

How do we know if these strategies work? In traditional AI, we use F1-score (accuracy). But if the “truth” is subjective, accuracy to the majority vote might actually be a bad thing—it might mean the model has learned to ignore minority groups.

The researchers used a suite of metrics, divided into two categories:

Standard Metrics (The Utilitarian View)

Macro F1: How well does the model predict the majority vote?
Jensen-Shannon Divergence (JS): A statistical measure of how different the predicted probability distribution is from the true distribution of annotations. (Lower is better).

Annotator-Centric Metrics (The Egalitarian View)

To align with the Rawlsian fairness principle, the researchers introduced metrics that look at individual annotators:

Average Annotator F1 (\(F_1^a\)): We treat each annotator as the sole source of truth, calculate the model’s F1 score against them, and average the results. This tells us how well the model represents the “average” person.
Worst-off Annotator F1 (\(F_1^w\)): We calculate F1 scores for all annotators, take the bottom 10% (the people the model represents most poorly), and average their scores.
Why this matters: If this score is low, the model is failing the minority. If this score is high, the model has successfully learned to represent even the most unique or divergent viewpoints.

Part 5: Experiments and Results

The team tested ACAL on three datasets:

DICES: Conversations with a chatbot, rated for safety. (High number of annotators per item).
MFTC: Tweets labeled with moral foundations (e.g., Care, Loyalty, Betrayal). (Highly polarized agreement).
MHS: Hate speech comments. (Mixture of agreement levels).

Let’s look at the key findings.

Finding 1: ACAL is More Efficient

The primary goal of Active Learning is to save money. The results show that ACAL achieves similar or better performance than Passive Learning (training on everything) while using significantly less data.

Table 1: Performance comparison showing reduction in annotation budget.

Take a look at the DICES section in Table 1 above. The column \(\Delta\%\) shows the reduction in annotation budget compared to Passive Learning (PL).

ACAL strategies (like \(S_R T_S\)) achieved comparable F1 and JS scores while reducing the budget by ~30-38%.
On the MHS dataset, the budget reduction was massive—up to 62.5%.

This confirms that we do not need to ask every annotator to label every item. By strategically picking who annotates what, we can build robust models at a fraction of the cost.

Finding 2: Learning Curves and Convergence

How fast does the model learn? The learning curves below compare traditional Active Learning (AL, left strategies) against ACAL.

Learning curves for DICES and MHS datasets.

Top Chart (DICES): The JS score (error) drops much faster for ACAL strategies (solid lines) compared to traditional AL. This means ACAL approximates the true distribution of human opinion with fewer training steps.
Bottom Chart (MHS): Interestingly, for hate speech detection (Dehumanize), ACAL actually achieves a higher F1 score than Passive Learning (the yellow horizontal line). This suggests that selectively sampling diversity can sometimes yield better representations than blindly training on all noisy data.

Finding 3: The “Worst-Off” Trade-off

One of the most profound findings relates to the fairness metrics. The researchers found a trade-off between modeling the majority and protecting the minority.

Validation plots showing F1 average vs. JS worst-off.

In the plots above, look at the Worst-off JS (\(JS^w\)) (lower is better). This measures how badly the model misrepresents the minority 10% of annotators.

On datasets with high disagreement (like MFTC and MHS), ACAL strategies resulted in better (lower) \(JS^w\) scores than standard Active Learning.
This proves that Annotator Selection effectively captures minority voices that standard sampling ignores.

However, notice that as the model sees more data, the “Worst-off” error sometimes increases. Why? Because the model begins to converge on the true distribution, which inherently contains disagreement. The model correctly learns that “10% of people will hate this label,” and thus its prediction for those 10% will never be a perfect “1.0” match. This is a feature, not a bug—it reflects the reality of subjective disagreement.

Finding 4: ACAL Requires a Large Crowd

The researchers noted a critical limitation. ACAL shines brightest on the DICES dataset. Why? Because DICES had a large pool of annotators available for every item (average of 73).

In contrast, datasets like MFTC only had about 3-4 annotators per item. When the pool is that shallow, “selecting” an annotator doesn’t give you much leverage—you essentially have to ask everyone anyway.

This is visualized in the cross-task comparison below:

Bar charts comparing ACAL, AL, and Passive Learning across tasks.

In the top-left (MFTC), the bars are relatively even. But in tasks where the annotator pool was deeper or the disagreement was more complex, the distinction between strategies became more pronounced.

Finding 5: Managing Entropy

Finally, the team analyzed “Entropy”—a measure of chaos or disagreement. They wanted to see if their strategies were artificially inflating disagreement or accurately reflecting it.

Proportion of data samples resulting in higher/lower entropy.

This chart for the DICES dataset tracks how the strategies behave over time.

\(T_D\) (Representation Diversity - Light Blue): This strategy (picking the “contrarian”) consistently over-estimates entropy. It hunts for disagreement so aggressively that it makes the world look more divided than it actually is.
\(T_S\) (Semantic Diversity - Green): This strategy was more conservative, aligning closer to the “True” entropy of the dataset.

This gives practitioners a knob to turn: Do you want to rigorously hunt for edge cases (\(T_D\)), or do you want a balanced view of content (\(T_S\))?

Part 6: Implications and Conclusion

The “Gold Label” era of NLP is fading. As we task AI with increasingly subjective responsibilities—moderating communities, analyzing political sentiment, or identifying safety risks—we must accept that human disagreement is a fundamental part of the data.

Annotator-Centric Active Learning (ACAL) offers a promising path forward. By treating the annotator as a variable just as important as the data, ACAL allows us to:

Save Resources: Achieve high-performance models with 30-60% less annotation effort.
Enhance Fairness: Specifically target and include minority perspectives (the “worst-off” annotators) that represent valid but less common viewpoints.
Customize Training: Choose strategies (\(T_L, T_S, T_D\)) that tune the model toward stability or diversity depending on the application.

The takeaway for students and future researchers is clear: When designing AI systems for subjective tasks, do not just ask “What is the label?” Ask “Who is labeling it?” The answer to that question changes everything.

References: Van der Meer, M., Falk, N., Murukannaiah, P. K., & Liscio, E. (2024). Annotator-Centric Active Learning for Subjective NLP Tasks.

Part 1: The Problem with Passive Learning#

The Gold Standard vs. Soft Labels#

The Traditional Active Learning (AL) Solution#

Part 2: Introducing Annotator-Centric Active Learning (ACAL)#

The Algorithm#

Part 3: Strategies for Selecting Annotators#

1. Random Selection (\(T_R\))#

2. Label Minority (\(T_L\))#

3. Semantic Diversity (\(T_S\))#

4. Representation Diversity (\(T_D\))#

Part 4: Measuring Success in a Subjective World#

Standard Metrics (The Utilitarian View)#

Annotator-Centric Metrics (The Egalitarian View)#

Part 5: Experiments and Results#

Finding 1: ACAL is More Efficient#

Finding 2: Learning Curves and Convergence#

Finding 3: The “Worst-Off” Trade-off#

Finding 4: ACAL Requires a Large Crowd#

Finding 5: Managing Entropy#

Part 6: Implications and Conclusion#