Who Watches the Watchmen? Uncovering Bias in Human and AI Judges

The explosion of Large Language Models (LLMs) like GPT-4, Claude, and Gemini has brought us remarkable capabilities in natural language processing. But with great power comes a difficult question: How do we know if these models are actually doing a good job?

Evaluating an LLM isn’t like checking a math test. In open-ended tasks—like writing an essay, summarizing a story, or providing therapy-like advice—there is no single “correct” answer. Historically, we relied on humans to grade these responses. Recently, however, the field has shifted toward “LLM-as-a-judge,” where powerful models like GPT-4 are used to grade the outputs of other models. It’s faster, cheaper, and scalable.

But this raises a critical “Inception”-style problem: If we trust LLMs to judge other LLMs, how do we evaluate the judges?

In a fascinating paper titled “Humans or LLMs as the Judge? A Study on Judgement Bias,” researchers from The Chinese University of Hong Kong, Shenzhen, investigate the reliability of these evaluators. They propose a novel framework to expose the hidden biases of both human and AI judges, revealing that even our most advanced models can be easily tricked by superficial formatting or fake citations.

The Problem with Current Evaluations

Traditional benchmarks (like MMLU or C-Eval) use multiple-choice questions. While useful, they don’t reflect how we actually use AI. We use AI for chat, creativity, and reasoning. Open-ended benchmarks (like MT-Bench) are better, but they suffer from the “Golden Standard” problem.

To measure bias, you usually need a ground truth—a perfect answer to compare against. But in creative writing or complex reasoning, “perfect” is subjective.

The researchers behind this paper decided to bypass the need for a golden standard. Instead, they used an Intervention Study. They took an answer, deliberately “poisoned” it with specific biases (perturbations), and watched to see if the judges (both humans and LLMs) would notice the flaw or be seduced by it.

The Biases of Interest

The study focuses on four specific types of biases, categorized into two groups:

These biases relate to the actual meaning and content of the text.

Misinformation Oversight Bias: The tendency to overlook factual errors. If an answer sounds confident but claims \(7 \times 7 = 36\), does the judge catch it?
Gender Bias: The failure to detect discriminatory or stereotypical language within an answer.

2. Semantic-agnostic Biases

These are superficial biases unrelated to the correctness of the answer.

Authority Bias: The tendency to trust an answer simply because it cites a source (even if the source is fake or irrelevant).
Beauty Bias: The tendency to prefer answers that look nice—using Markdown formatting, bold text, lists, and emojis—regardless of whether the content is actually better.

The Method: Setting the Trap

To test these biases, the authors created a robust experimental protocol. They didn’t just grab random internet text; they meticulously constructed a dataset based on Bloom’s Taxonomy, ensuring questions ranged from simple recall (“Remembering”) to complex synthesis (“Creating”).

Step 1: Generating the “Control”

They used GPT-4 to generate a question (Q) and two correct answers (\(A_1\) and \(A_2\)). This forms the Control Group.

Step 2: Creating the “Intervention”

Here is where the science happens. They took the second answer (\(A_2\)) and perturbed it to create a modified version (\(A_2^p\)).

To test Misinformation, they injected factual errors into \(A_2\).
To test Gender Bias, they injected gender stereotypes.
To test Authority Bias, they added fake references (citations that look real but aren’t).
To test Beauty Bias, they added “rich content” like bolding, emojis, and structured lists without changing the actual meaning.

Sample demonstration of the intervention process.

As shown in Figure 1 above, the framework creates a clear path for comparison. On the left, we have valid answers. On the right, we have the perturbed versions.

Note the “Fallacy Oversight” box: The answer claims \(\sqrt{36} is 7\). A good judge should hate this.
Note the “Authority Bias” box: It adds a citation to “MathWorld.” A biased judge might think this answer is smarter because it cites a source.

Step 3: The Vote

The researchers then presented these pairs to the judges.

Control Group Vote: Compare \(A_1\) vs. \(A_2\) (both correct).
Experimental Group Vote: Compare \(A_1\) vs. \(A_2^p\) (one correct, one perturbed).

Experiment Procedure showing the voting mechanism.

Figure 2 illustrates the workflow. The judges (both humans and LLMs) vote on which answer is better. By comparing the voting patterns of the Control Group against the Experimental Group, the researchers can measure how much the perturbation influenced the decision.

The Metric: Attack Successful Rate (ASR)

How do we quantify bias? The researchers introduced the Attack Successful Rate (ASR).

Intuitively, if you take a good answer and add factual errors to it, the judge should prefer it less. If you take a standard answer and add emojis, the judge should not prefer it more.

ASR measures the percentage of times the preference shifted in the wrong direction (towards the perturbed answer) after the perturbation was added.

Formula for calculating ASR.

Ideally, ASR should be 0. A high ASR means the judge was successfully “attacked” or fooled by the bias.

The Results: Who is the Better Judge?

The study tested a wide range of judges, including 60 human evaluators (college students) and major LLMs like GPT-4, GPT-4o, Claude-3, Gemini-Pro, and LLaMA-2.

The results, summarized in Table 1 below, are eye-opening.

Table showing ASR results for different judges across various biases.

Let’s break down the key takeaways from these numbers.

1. Factual Errors (FE): Humans Struggle

Look at the FE column. Human judges had an ASR of 0.21. This means that in 21% of cases where factual errors were introduced, humans failed to penalize the answer (or even preferred the wrong one).

Why? Humans get tired. They might gloss over details or assume a confident-sounding paragraph is correct.

Winner: GPT-4o (0.06) and Claude-3 (0.08) were excellent fact-checkers.
Loser: LLaMA2-70B (0.60) was worse than random guessing.

2. Gender Bias: Humans Shine

In the Gender column, humans achieved the best score (0.06). Educated human evaluators are highly sensitive to social biases and stereotypes.

The LLM Problem: Most LLMs (like Ernie and GPT-4 Turbo) performed significantly worse than humans here. Despite safety training, LLMs often failed to penalize subtle gender biases in the text as strictly as humans did.

3. Authority Bias (Ref): Everyone is Gullible

This is perhaps the most concerning finding. Look at the Ref column. This measures how often a judge prefers an answer just because it has a fake citation.

Humans (0.37): Humans are easily swayed by authoritative-looking citations.
LLMs: Almost every LLM performed poorly. Claude-2 had an ASR of 0.89—meaning it almost always preferred the answer with the fake citation.
Implication: If you want your LLM output to be rated highly by another LLM, just slap a fake citation on it. The judge will likely think it’s “higher quality.”

4. Beauty Bias (RC): Style Over Substance

The RC (Rich Content) column shows what happens when you format an answer nicely (Markdown, emojis).

Claude-3 was very robust (0.04), ignoring the formatting to focus on content.
Humans (0.47) and Claude-2 (0.68) were easily seduced by the pretty formatting.

Hacking the Judge

The researchers didn’t stop at just identifying biases. They wanted to see if they could actively “attack” LLM judges to artificially inflate scores.

They set up a scenario where they took a “weak” answer (one with factual errors or gender bias) and tried to make it beat a “strong” answer by applying Semantic-agnostic perturbations (fake refs and rich content).

Bar charts showing ASR under different perturbation attacks.

Figure 4 displays the results of these attacks.

Left Chart (a): This shows attempts to mask factual errors. The purple-striped bars represent a combined attack (Fake Reference + Rich Content). Notice that for models like LLaMA-2-70B and Ernie, the attack works surprisingly well.
The “Ref” factor: The green bars (Ref) are generally higher than the purple bars (Rich Content). This suggests that Authority Bias is a stronger vulnerability than Beauty Bias. LLMs are trained to value citations as a proxy for truthfulness, even when the citation is hallucinated.

This proves a dangerous reality: A bad answer can be disguised as a good answer simply by formatting it well and adding fake sources.

The Verbosity Problem

Another well-known issue in LLM evaluation is Verbosity Bias—the tendency to prefer longer answers regardless of quality.

The researchers analyzed this by looking at the length difference between two answers and the judge’s preference.

Line graph illustrating Verbosity Bias across different judges.

Figure 5 shows the preference for longer answers as the length difference increases.

The X-axis is the difference in length (tokens).
The Y-axis is the probability of choosing the longer answer.
A Perfect Evaluator (dotted line) should stay at 0.5 (neutral), judging based on content, not length.

Instead, we see upward slopes for almost everyone. Claude-3 (the pink line) shoots up dramatically—if an answer is 40+ tokens longer, Claude-3 is almost guaranteed to prefer it. GPT-4-Turbo (green line) appears to be the most robust against this bias, keeping closer to the neutral line.

Conclusion: We Need Better Judges

This paper serves as a wake-up call for the AI community. As we move toward autonomous agents and self-improving systems, we are increasingly relying on “LLM-as-a-judge” to tell us what is working and what isn’t.

However, the findings show that these judges are not neutral arbiters.

Humans are good at social nuance (gender) but bad at tedious fact-checking and easily swayed by formatting.
LLMs are better at fact-checking (the top-tier ones, at least) but are easily manipulated by fake authority and length.

The authors have proposed a reference-free framework (using ASR) that allows developers to test their own judges. If you are building an evaluation pipeline, you cannot assume your judge is fair. You must test it for Misinformation Oversight, Gender Bias, Authority Bias, and Beauty Bias.

Until we develop robust evaluation systems that can see past the “glitter” of emojis and fake citations, our benchmarks will remain hackable, and our understanding of true model performance will remain blurred. The question remains: Who watches the watchmen? Currently, the watchmen are easily distracted by a nice font and a fake URL.

The Problem with Current Evaluations#

The Biases of Interest#

1. Semantic-related Biases#

2. Semantic-agnostic Biases#

The Method: Setting the Trap#

Step 1: Generating the “Control”#

Step 2: Creating the “Intervention”#

Step 3: The Vote#

The Metric: Attack Successful Rate (ASR)#

The Results: Who is the Better Judge?#

1. Factual Errors (FE): Humans Struggle#

2. Gender Bias: Humans Shine#

3. Authority Bias (Ref): Everyone is Gullible#

4. Beauty Bias (RC): Style Over Substance#

Hacking the Judge#

The Verbosity Problem#

Conclusion: We Need Better Judges#