Large Language Models (LMs) have a well-known tendency to “hallucinate”—producing fluent but factually incorrect information. To mitigate this, researchers rely on Uncertainty Quantification (UQ). The goal of UQ is simple: we want the model to tell us when it is unsure, so we can flag those responses for human review or discard them entirely.

But how do we know if a UQ method is actually working? We have to test it. Typically, we generate an answer, ask the UQ method for a confidence score, and then check if the answer is actually correct. If the UQ method assigns low confidence to wrong answers and high confidence to right answers, it works.

However, a new paper titled “Revisiting Uncertainty Quantification Evaluation in Language Models” uncovers a critical flaw in this process. The researchers demonstrate that the tools we use to judge “correctness” are themselves biased—specifically regarding the length of the response. When this bias aligns with the biases inside the UQ methods, it creates a spurious interaction that rigs the evaluation, making certain methods look far better than they actually are.

In this deep dive, we will unpack how this “mutual bias” works, why standard metrics like ROUGE and BERTScore might be misleading us, and how we can fix the evaluation protocol using LLMs as judges.

The Toolkit: How We Measure Uncertainty

Before analyzing the flaw, we need to understand the standard setup for evaluating uncertainty.

1. The UQ Methods

A UQ method assigns a score to an LLM’s output (\(\hat{y}\)) given an input (\(x\)). One of the most fundamental approaches is Negative Sequence Probability. This method looks at the probabilities the model assigned to each token in the generated sequence and multiplies them together.

Equation showing negative sequence probability calculation.

Here, \(L\) represents the length of the generated answer. As you can see in the equation, because probabilities are always less than 1, multiplying more of them together (a longer sequence) naturally results in a lower total probability. This means this specific UQ method has an inherent length bias: it naturally assigns higher “uncertainty” (lower probability) to longer answers.

Other methods include Entropy (measuring the spread of the probability distribution) or Learned Probes (training a small classifier to predict if the model is right).

2. The Correctness Functions

To evaluate the UQ method, we need to know the ground truth: was the model’s answer actually correct? Since we can’t manually grade millions of answers, we use automated Correctness Functions.

These functions compare the model’s generated answer (\(\hat{y}\)) to a reference “gold standard” answer (\(y\)).

Table listing various correctness functions like ROUGE, SQuAD, BERTScore, and LM-as-a-judge.

As shown in the table above, these functions fall into three categories:

  • Lexical-based (e.g., ROUGE): These count how many words overlap between the model’s answer and the reference.
  • Embedding-based (e.g., BERTScore): These use a smaller model (like BERT) to check if the two answers have similar semantic embeddings.
  • LM-as-a-judge: This uses a powerful LLM (like GPT-4 or Qwen) to read both answers and decide if they mean the same thing.

3. The Performance Metric: AUROC

Finally, to score the UQ method, we typically use the AUROC (Area Under the Receiver Operating Characteristic curve).

Equation defining AUROC as the probability that a correct answer has a lower uncertainty score than an incorrect one.

The AUROC metric asks a simple probabilistic question: If I pick one random correct answer (\(h_i=1\)) and one random incorrect answer (\(h_j=0\)), what is the probability that the UQ method assigned a lower uncertainty score to the correct one?

A perfect UQ method has an AUROC of 1.0. A random guess has an AUROC of 0.5.

The Core Problem: The “Proxy” Trap

The problem lies in the fact that we don’t have the true correctness labels (\(h\)). We are relying on an estimated correctness (\(\hat{h}\)) provided by functions like ROUGE or BERTScore.

Equation showing estimated AUROC based on estimated correctness labels.

If our correctness function (\(\hat{h}\)) was perfect, there would be no issue. But we know automated metrics make mistakes. The researchers mathematically proved that the nature of these mistakes determines whether our evaluation is valid or broken.

Scenario A: Random Noise (The Good News)

If the correctness function makes random errors—sometimes marking right answers as wrong, and wrong answers as right, with no pattern—the evaluation is noisy but unbiased. The AUROC score might drop closer to 0.5 (random), but the ranking of different UQ methods remains largely stable.

Scenario B: Mutual Bias (The Bad News)

The danger arises when the errors in the correctness function are correlated with the UQ method.

Imagine a scenario where:

  1. The UQ Method tends to be more “uncertain” about long answers (because of the probability multiplication we saw earlier).
  2. The Correctness Function tends to mark long answers as “incorrect” (perhaps because extra words reduce the percentage of overlap with a short reference answer).

If both the judge (Correctness Function) and the student (UQ Method) share a bias against long answers, they will agree with each other even if the answer is actually correct.

Mathematically, the researchers show that when these errors correlate, the estimated probability of distinguishing correct from incorrect samples shifts:

Inequality showing that correlated errors cause the estimated probability to deviate from the true probability.

This inequality proves that any mutual bias introduces systematic distortions. It can artificially inflate the AUROC score, making a flawed UQ method look state-of-the-art simply because it shares a bias with the correctness metric.

Empirical Evidence: The Rankings are Unstable

To see if this theoretical danger exists in the real world, the authors ran extensive experiments across 4 datasets, 4 models, and 8 UQ methods.

If the choice of correctness function didn’t matter, the ranking of UQ methods should stay roughly the same regardless of whether we used ROUGE, BERTScore, or an LLM Judge.

Bar chart showing how the ranking of UQ methods changes drastically depending on the correctness function used.

Figure 1 above shows the results, and they are alarming.

  • Look at the Negative Sequence Probability (the orange/tan bar). When evaluated by ROUGE-L, it performs competitively.
  • However, when evaluated by LM-as-a-judge, its performance ranking drops significantly.
  • Similarly, simple baselines like Token Length (blue bar)—which literally just counts how long the answer is—perform shockingly well according to ROUGE and SentenceBERT.

This suggests that some correctness functions are rewarding methods simply for tracking length, rather than tracking semantic uncertainty.

Validating the Judge: Humans vs. Machines

To determine which correctness function we should actually trust, the researchers conducted a human evaluation. They hired annotators to grade 450 samples and compared the human labels to the automated metrics.

Heatmap showing Cohen Kappa agreement rates between human trials and various correctness functions.

The heatmap in Figure 2 shows the agreement (Cohen’s Kappa) between humans and the metrics.

  • Red/Low Scores: ROUGE and SQuAD show poor agreement with humans in many settings.
  • Blue/High Scores: LM-as-a-judge (Prompt) and AlignScore show the highest consistency with human annotators.

This confirms that lexical metrics (like ROUGE) and even some embedding metrics (like BERTScore) are poor proxies for truth in this context.

One reason for the poor performance of standard metrics is their sensitivity to thresholds. To calculate AUROC, we often need to turn a continuous score (like 0.6 ROUGE) into a binary “Correct/Incorrect” label.

Graph showing how human agreement drops sharply as the threshold changes for metrics like ROUGE-L.

Figure 8 demonstrates that finding the right threshold is a nightmare. For ROUGE-L (top left), the agreement peaks at a specific threshold and then crashes. If a researcher picks the wrong threshold, their entire evaluation is invalid. In contrast, AlignScore (bottom row) is incredibly stable regardless of the threshold.

The Culprit: Response Length Bias

The researchers hypothesized that response length is the hidden variable causing the “mutual bias” discussed in the theory section.

First, they checked the correctness functions. Do they unfairly penalize long answers?

Scatter plots showing that ROUGE-L scores drop as length increases, while AlignScore remains stable.

Figure 5 confirms the bias.

  • Plot (a) ROUGE-L: Notice the downward trend. As the response length (x-axis) increases, the ROUGE score (y-axis) naturally decreases. The metric struggles to handle “verbosity,” penalizing correct but long answers.
  • Plot (b) AlignScore: The distribution is much flatter. Long answers can still receive high correctness scores.

This proves the Correctness Function Bias. Now, what about the UQ methods?

Heatmap of correlations showing strong relationships between UQ methods and response length.

Figure 4 shows the correlation between UQ methods and length.

  • Negative Sequence Probability has a strong correlation with length (as expected from the math).
  • Token Length obviously correlates perfectly with length.

The Spurious Interaction: Because ROUGE penalizes long answers (marking them “Incorrect”), and Negative Sequence Probability assigns high uncertainty to long answers, the UQ method successfully “predicts” that long answers are “incorrect.”

It looks like the UQ method is detecting hallucinations. In reality, it’s just detecting that the sentence is long, and the grading rubric hates long sentences.

A Better Path Forward: LM-as-a-judge

The findings of this paper serve as a stark warning: We cannot blindly trust standard NLP metrics for evaluating Uncertainty Quantification. Using ROUGE or BERTScore creates a feedback loop of length bias that obscures the true performance of our models.

The data points to a clear solution.

Scatter plots comparing various metrics against response length. LM-as-a-judge shows high stability.

As seen in Figure 7 (bottom right), LM-as-a-judge remains robust across different lengths. It does not blindly penalize a model for being verbose. Because it lacks the length bias, it breaks the spurious correlation.

Conclusion

Evaluating Artificial Intelligence is becoming as complex as building it. This research highlights a “mutual bias” trap where the flaws in our evaluation metrics hide the flaws in our methods.

The key takeaways for students and researchers are:

  1. Distrust Lexical Metrics for UQ: ROUGE and similar metrics introduce systematic biases that distort AUROC rankings.
  2. Beware of Confounders: Length is the obvious confounder here, but other hidden variables (like vocabulary complexity) could cause similar spurious correlations.
  3. Adopt LM-as-a-judge: While computationally more expensive, using a strong LLM to evaluate correctness aligns best with human judgment and provides the fairest assessment of uncertainty methods.

By refining our evaluation protocols, we can stop chasing spurious correlations and focus on building UQ methods that genuinely understand when a model is hallucinating.