Beyond the Wisdom of Crowds: How Human-LLM Hybrid Frameworks Are Revolutionizing Text Annotation

The “Wisdom of Crowds” is a concept as old as statistics itself. The idea is simple: if you ask enough people to guess the number of jellybeans in a jar, the average of their guesses is often startlingly close to the truth—closer, in fact, than the guess of any single expert.

In the world of Machine Learning (ML) and Natural Language Processing (NLP), we rely heavily on this principle. We use platforms like Amazon Mechanical Turk or Lancers to gather labeled data. When the task is simple, like clicking a button to say whether an image contains a cat, aggregating the answers is easy: just take the majority vote.

But what happens when the task is complex and open-ended? What if we ask ten different people to translate a Japanese sentence into English? We will get ten different valid sentences. Taking the “average” of text strings is impossible in the traditional sense. You cannot simply mathematically average the words “The wheelchair is unnecessary” and “I don’t need a wheelchair.”

This brings us to a cutting-edge intersection of crowdsourcing and Generative AI. A recent paper, “Human-LLM Hybrid Text Answer Aggregation for Crowd Annotations”, proposes a fascinating solution. Instead of choosing between human workers or Large Language Models (LLMs), why not use LLMs to aggregate human wisdom?

In this post, we will tear down this paper to understand how a new framework called CAMS (Creator-Aggregator Multi-Stage) works, how LLMs perform as “aggregators,” and why the future of data annotation is likely a hybrid one.

The Problem: When Majority Voting Fails

Before we dive into the solution, we must clearly define the problem. In data annotation, we have two primary types of tasks: Categorical and Text-based.

Categorical Labeling: This is close-ended. Is the sentiment Positive or Negative? Is the image a Dog or a Cat? Aggregation here is straightforward statistical analysis.
Text Answer Annotation: This is the challenge. Tasks include translation, summarization, or explaining why a model made a decision.

The paper provides a clear visual comparison of these two worlds:

Comparison of categorical label aggregation versus text answer aggregation.

As shown in Section II of the image above, the Text Answer Aggregation Task involves inputs that vary significantly in syntax and vocabulary while sharing the same semantic meaning. One worker says, “For me it is not necessary,” while another says, “I never need wheelchair.”

The goal of an aggregation system is to produce an estimated answer ($\hat{z}_i$) that captures the “ground truth” better than any single worker’s attempt. In the example above, the aggregated answer became “I don’t need a wheelchair at all,” which synthesizes the crowd’s intent.

The Traditional Approach: Single-Stage Frameworks

Historically, researchers used a Single-Stage Framework. The process was linear:

A requester sends a question ($q$) to the crowd.
Multiple “Crowd Creators” (workers) submit their answers ($a$).
A mathematical algorithm (the Model Aggregator) tries to pick the best one.

A diagram of the traditional single-stage crowdsourcing framework.

In this traditional setup (Figure 1), the “Model Aggregator” is usually an algorithm like Sequence Majority Voting (SMV) or Sequence Maximum Similarity (SMS). These algorithms convert text into vector embeddings and look for the answer that is geometrically “closest” to all other answers (the centroid).

The limitation here is strict: the Model Aggregator is extractive. It can only pick one of the existing answers provided by the crowd. If every worker wrote a slightly flawed sentence, the algorithm picks the “least bad” one. It cannot rewrite the sentence to fix the grammar or combine the best parts of two different answers.

The New Approach: The CAMS Framework

The researchers propose a paradigm shift. With the advent of Large Language Models (LLMs) like GPT-4, we now have systems that are excellent at reading multiple texts and synthesizing them.

They introduce the Creator-Aggregator Multi-Stage (CAMS) framework. This framework treats the annotation process as a pipeline with distinct roles, and importantly, it allows both Humans and LLMs to participate in the aggregation phase.

Diagram of the Creator-Aggregator Multi-Stage (CAMS) framework.

Let’s break down the architecture shown in Figure 2:

1. The Crowd Creators ($W_{C.C.}$)

This is the standard layer. Human workers generate the initial raw answers. For a translation task, ten different workers might provide ten different translations. These are the “raw materials.”

2. The Aggregators

This is the innovation layer. Instead of feeding raw answers directly into a mathematical selection algorithm, the system passes them to Aggregators. These aggregators read the raw answers and generate a new, refined answer in their own words. This is abstractive aggregation—they can create new sentences that didn’t exist in the input.

The paper introduces two types of workers for this stage:

Crowd Aggregators ($W_{C.A.}$): Humans who are paid to look at the list of translations and write the best possible version.
LLM Aggregators ($W_{L.A.}$): An LLM (like GPT-4) prompted to read the list of raw translations and infer the correct original meaning to generate a high-quality answer.

3. The Model Aggregator

Finally, all the answers—the raw ones from creators, the refined ones from human aggregators, and the synthesized ones from LLM aggregators—are pooled together. A mathematical Model Aggregator (like RASA or SMS) makes the final selection.

By combining these resources, the system creates a “Human-LLM Hybrid.” It doesn’t just rely on the LLM to do the job from scratch, nor does it rely solely on noisy human data. It uses the LLM to clean up the human data, and then uses math to pick the winner.

The Mathematical Engines: How the Best Answer is Picked

Even with the CAMS framework, we eventually need a mathematical way to select the final output. The paper utilizes three specific “Model Aggregators.” It is important to understand these to grasp the results later.

1. Sequence Majority Voting (SMV)

This is the baseline. It uses a sentence encoder (like the Universal Sentence Encoder) to turn every text answer into a numerical vector. It calculates the average vector (centroid) of all answers. Then, it selects the specific answer that is closest to this average. It assumes the majority is right.

2. Sequence Maximum Similarity (SMS)

SMS is slightly more robust. For every answer, it calculates how similar it is to every other answer using cosine similarity. It sums up these similarity scores. The answer with the highest total similarity score wins. It’s like a popularity contest where every answer votes for its neighbors.

3. Reliability Aware Sequence Aggregation (RASA)

RASA is the most advanced method. It assumes that not all workers are equal. Some are consistently excellent; others are spammy or incompetent.

Iterative Learning: RASA alternates between estimating the “true” answer and estimating the “reliability” of each worker.
Weighting: If a worker consistently provides answers close to the estimated truth, their reliability score ($\theta$) goes up. In the next round, their answers carry more weight.

The authors hypothesized that applying these models to a hybrid pool of answers (Human + LLM) would yield superior results.

Experimental Setup

To test this, the researchers used real-world crowdsourcing datasets (J1, T1, T2) involving Japanese-to-English translations.

Crowd Creators: Workers from the CrowdWSA dataset.
Crowd Aggregators: New workers recruited from a platform called Lancers.
LLM Aggregators: GPT-4 and Gemini Pro. (Note: We will focus primarily on the GPT-4 results as they were generally superior).

Metrics:

GLEU & METEOR: These are standard NLP metrics used to evaluate translation quality by comparing the output to a “Gold Standard” (expert translation). Higher is better.
Embedding Similarity: Measures how semantically close the output is to the gold standard.

Results: Man vs. Machine vs. Hybrid

The results provided by the paper are illuminating, challenging some common assumptions about Generative AI.

Question 1: Who is smarter individually?

First, the researchers looked at the average quality of a single answer produced by a Crowd Creator versus a Crowd Aggregator versus an LLM Aggregator.

Table showing individual answer quality statistics.

Table 3 (above) reveals several key insights:

LLMs are better on average: Look at the “MEAN” column. For dataset J1, Crowd Creators had a mean GLEU score of 0.1868. The LLM Aggregator (GPT-4) achieved 0.2729. The LLM is significantly better than the average human worker.
Humans have higher ceilings: Look at the “MAX” column. The best Crowd Aggregator achieved a perfect score of 1.0000. The LLM peaked at 0.2756. This means that while humans are noisy, the best human is often better than the LLM.
The Diversity Problem: Look at the “STD” (Standard Deviation) column. The LLM has a tiny standard deviation (0.0018). This means the LLM is extremely consistent. However, in aggregation tasks, consistency can be a weakness. We want diversity. If the LLM makes a mistake, it tends to make the same mistake repeatedly. Humans, with their high variance, provide a broader range of potential answers, increasing the chance that the “truth” is somewhere in the pile.

Question 2: Does the Hybrid approach work?

The core of the paper is determining if combining these forces yields a better result than using them in isolation.

The table below compares different configurations.

Group I: Only Crowd Creators (The old way).
Group II: Only Aggregators (Crowd or LLM).
Group IV: The Hybrid (Creators + Aggregators + LLMs).

Results of text answer aggregation using GLEU metrics.

Focusing on the SMS and RASA columns (the stronger algorithms) in Table 5:

Hybrid Wins: The rows representing Group IV (combinations of $A_{C.C}$, $A_{C.A}$, and $A_{L.A}$) consistently score the highest. For example, in dataset J1 using SMS, the hybrid model scores 0.3003, beating the LLM-only score of 0.2846 and the Crowd-Creator-only score of 0.2489.
Synergy: This proves that the LLM benefits from the “noise” of the crowd. The raw human answers provide context and nuance that the Model Aggregator can use to steer the LLM’s outputs toward the truth.
Model Choice Matters: SMV (Majority Voting) performed poorly across the board. This confirms that simple averaging doesn’t work well for complex text or hybrid data. You need sophisticated algorithms like SMS or RASA to find the needle in the haystack.

Sensitivity Analysis: How many LLMs do you need?

An interesting practical question is: “How many LLM agents should I run?” Since LLMs are non-deterministic (temperature > 0), you can ask the same prompt 5 times and get 5 slightly different answers.

Graph showing GLEU results by different numbers of LLM aggregators.

Figure 3 shows the performance as we increase the number of LLM aggregators from 1 to 9.

Performance varies: There isn’t a straight line up. Sometimes more agents help; sometimes performance plateaus.
Risk Mitigation: However, the data suggests that using only one LLM aggregator is risky (see the dip at x=1 for some lines). Using a small ensemble of LLM calls (e.g., 3 to 5) ensures stability.

Discussion: Why does this matter?

This paper is significant because it reframes the narrative of “AI replacing jobs.” Instead of viewing the LLM as a replacement for the crowd, it views the LLM as a collaborator.

The “Needle in a Haystack” Effect

The authors note a crucial observation: “One purpose of answer aggregation methods is to estimate good workers and good answers from the raw crowd answers (looking for the needle in a haystack).”

Because the “MAX” performance of humans is so high (as seen in the individual results), the goal of the system is to ensure those brilliant human outliers are identified. The LLM raises the floor of quality (getting rid of bad answers), while the human crowd raises the ceiling (providing occasional flashes of perfect insight). The Model Aggregator (RASA/SMS) then acts as the bridge, selecting the high-quality human insights that align with the consistent reasoning of the LLM.

Cost Efficiency

While the blog focuses on quality, the paper briefly touches on cost. LLM aggregators cost less than $0.01 per instance, while crowd aggregators cost around $0.36. This is a thirtyfold difference. By using a hybrid model—perhaps using many LLMs and fewer, high-quality humans—requesters can balance budget and accuracy effectively.

Conclusion

The “Creator-Aggregator Multi-Stage” (CAMS) framework represents the next evolution in crowdsourcing. By treating Large Language Models as a specific type of “worker” within a broader pipeline, we can achieve results that neither humans nor AI could achieve alone.

For students of NLP and crowdsourcing, the takeaways are clear:

Don’t trust the average: In text annotation, simple majority voting (SMV) is insufficient.
Diversity is fuel: The low variance of LLMs is a limitation. Human noise is actually a feature, not just a bug, because it provides the diversity necessary for robust aggregation algorithms to work.
Hybrid is the future: The strongest systems utilize the reliability of algorithms (RASA), the consistency of LLMs, and the peak performance of human creativity together.

As we move forward, we can expect to see more of these “Human-in-the-loop” (or rather, “AI-in-the-human-loop”) architectures defining how we build the ground truth datasets that power the next generation of models.

The Problem: When Majority Voting Fails#

The Traditional Approach: Single-Stage Frameworks#

The New Approach: The CAMS Framework#

1. The Crowd Creators (\(W_{C.C.}\))#

2. The Aggregators#

3. The Model Aggregator#

The Mathematical Engines: How the Best Answer is Picked#

1. Sequence Majority Voting (SMV)#

2. Sequence Maximum Similarity (SMS)#

3. Reliability Aware Sequence Aggregation (RASA)#

Experimental Setup#

Results: Man vs. Machine vs. Hybrid#

Question 1: Who is smarter individually?#

Question 2: Does the Hybrid approach work?#

Sensitivity Analysis: How many LLMs do you need?#

Discussion: Why does this matter?#

The “Needle in a Haystack” Effect#

Cost Efficiency#

Conclusion#