Reading the Room: How the Shape of Probability Distributions Reveals Model Confidence

Introduction

We have all experienced the “hallucination” problem with Large Language Models (LLMs). You ask a model for a fact, and it confidently states something completely incorrect. This isn’t just annoying; in fields like medicine, law, or automated decision-making, it can be dangerous.

To mitigate this, we rely on confidence scores. Ideally, when a model is right, it should report a high confidence score. When it is unsure or likely to be wrong, it should report a low score. This relationship is known as calibration. If a model says it is 90% confident, it should be correct 90% of the time.

However, extracting reliable confidence scores from text generation models is surprisingly difficult. Traditional methods often interpret low probability on the top answer as uncertainty. But in open-ended text generation, there might be ten different ways to say the correct answer. If a model splits its probability mass across those ten valid answers, the probability for any single one appears low, falsely signaling that the model is confused.

In the paper Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution’s Characteristics, researchers propose a clever, task-agnostic solution. Instead of looking at just the top answer, they analyze the “shape” of the entire output distribution—specifically its slope and the thinness of its tail—to determine if the model actually knows what it’s talking about.

The Problem: Classification vs. Generation

To understand why standard confidence metrics fail in text generation, we first need to look at how they work in classification.

In a classification task (like determining if an image contains a cat or a dog), there is usually one single correct label. If a model assigns 95% probability to “Cat,” it is confident. If it assigns 50% to “Cat” and 50% to “Dog,” it is uncertain.

Text generation is different. Suppose you ask a model to translate a sentence. There might be five valid translations, all slightly different in wording but identical in meaning. A smart, confident model will assign high probabilities to all five of these sequences. Consequently, the probability of the single best sequence might drop to, say, 20%.

If we use traditional metrics that only look at the top sequence’s probability, we would erroneously conclude the model is unconfident.

Histograms illustrating the difference between confident classification and confident generation. Figure 1: In classification (left), confidence is easy to spot: one spike dominates. In generation (right), a confident model might distribute probability across several valid sequences (the first three histograms), making the “top” probability lower.

As shown in Figure 1, the rightmost histogram represents a truly unconfident model—the probabilities are flat and spread across everything. However, the middle histograms represent confident generation where the model likes multiple options. Previous methods struggle to distinguish these middle cases from the unconfident case.

The Core Method: Analyzing the Distribution

The researchers propose that even if there are multiple valid answers, a confident model’s output distribution will share two geometric characteristics:

A Steep Slope: The model assigns significantly higher probability to the “good” set of answers compared to the “average” or “bad” ones.
A Thin Tail: The probability drops off quickly after the valid answers, leaving very little probability mass for the nonsensical sequences at the “tail” of the distribution.

A graph showing the steep slope and long tail of a confident distribution. Figure 2: The researchers hypothesize that a confident model separates “Good” sequences from “Mediocre” ones, resulting in a steep slope and a thin tail.

Based on these characteristics, the authors introduce two new metrics.

1. The Ratio Method

The first metric captures the “steep slope” intuition. It measures the gap between the model’s best guess and an “average” guess.

Specifically, the method calculates the ratio between the probability of the top-ranked sequence (\(1\)) and the \(k\)-th ranked sequence.

Equation for the Ratio method.

Here, \(p_{\hat{y}^{(1)}}\) is the probability of the number one beam (sequence), and \(p_{\hat{y}^{(k)}}\) is the probability of the \(k\)-th beam.

High Ratio: The top answer is vastly more likely than the \(k\)-th answer. This implies a steep drop-off, suggesting the model is confident in its top choices.
Low Ratio: The top answer and the \(k\)-th answer have similar probabilities. The distribution is flat, suggesting the model is unsure.

The value of \(k\) is a hyperparameter tuned on a validation set. As we will see later, different tasks require different \(k\) values depending on how open-ended they are.

2. The Tail Thinness Method

The second metric focuses on the “tail.” A confident model should have a “thin” tail, meaning it assigns almost zero probability to bad sequences. An unconfident model has a “thick” tail, assigning non-negligible probability to many different sequences because it doesn’t know which ones are wrong.

To measure this, the authors adapt a Tail Thinness index:

Equation for the Tail Thinness method.

This formula sums the squares of the probabilities of all generated sequences (\(N\)).

If one sequence has 1.0 probability and the rest are 0, the sum is \(1.0^2 = 1\) (Maximum thinness).
If all 100 sequences have equal probability (0.01), the sum is \(100 \times 0.01^2 = 0.01\) (Very thick tail).

Visualizing different tail shapes and their indices. Figure 3: Visualizing Tail Indices. Graph B (one dominant answer) has the highest index (1.0). Graph A (flat distribution) has the lowest (.01). Graphs C and D show realistic confident generation scenarios with indices in between.

This metric is robust because it doesn’t care if the probability is split between two answers or five; as long as the probability is concentrated on a few answers and drops off for the rest, the score remains relatively high.

Experiments

The researchers tested their methods using two models, BART and Flan-T5, across three distinct types of tasks:

Translation: (WMT English-German/Russian, FLORES)
Question Answering (QA): (SQuAD, HotpotQA)
Summarization: (CNN-DailyMail, XSum, etc.)

They compared their new metrics against several baselines, including:

Average Token Probability (ATP): Just looking at the probability of the top sequence.
Entropy: Measuring the randomness of the output.
Dropout Methods: Running the model multiple times with “dropout” noise to see if it remains consistent.

Evaluation Strategy

To determine if a confidence score is “good,” the authors measured the Spearman correlation between the confidence score and the actual quality of the generation (measured by BLEU for translation, ROUGE for summarization, etc.).

A high correlation means the confidence score accurately predicts the quality of the output.

Results

The proposed methods—specifically the Tail Thinness method—outperformed the baselines in the majority of cases.

Table showing correlation results across different datasets. Table 1: Spearman correlation between confidence scores and quality metrics. The “Tail” and “Ratio” rows (bottom) frequently show higher correlations (bolded/starred) compared to probability and variance-based baselines.

The results highlight a few key findings:

Tail Thinness is Robust: The Tail method achieved the best performance in 10 out of 16 dataset-model pairs.
Significant Gains in Open-Ended Tasks: The improvements were particularly notable in translation and QA tasks compared to summarization.
Better Calibration: The new metrics correctly identified confident outputs even when the top beam had low probability.

Why Do Traditional Methods Fail?

The paper provides a compelling visual example using the SQuAD dataset to explain why the baseline methods underperform.

Comparison of beam search outputs on SQuAD. Figure 4: In the left example, there is only one valid answer, so the top probability is high. In the middle and right examples, there are multiple valid ways to answer. Standard metrics see the lower probability on the middle graph and assume “low confidence.” The Tail/Ratio methods see the shape of the distribution and correctly identify it as “high confidence.”

In the middle graph of Figure 4, the model predicts “2 weeks” but also considers “2 weeks each year” and “2 weeks each year in mid June” as valid options. Because it splits its vote, the probability of “2 weeks” drops. A standard metric flags this as uncertain. However, the Tail Thinness metric sees that the probability is concentrated entirely on variations of the correct answer, with a sharp drop-off afterwards, and correctly marks it as confident.

The Importance of \(k\)

For the Ratio method (Top vs. \(k\)-th), the choice of \(k\) matters. The experiments showed an interesting trend regarding the “open-endedness” of the task.

Graphs showing correlation relative to k. Figure 5: Correlation vs. \(k\). Open-ended tasks like summarization (Top row, A-C) benefit from a larger \(k\), while closed-ended tasks like QA (G-H) benefit from a smaller \(k\).

In strict QA tasks (where there are only a few right answers), we should compare the top answer to the 2nd or 3rd answer. In summarization (where there are dozens of valid ways to summarize a text), we should compare the top answer to the 80th or 90th answer to get a true sense of the distribution’s slope.

Conclusion and Implications

This research highlights a critical nuance in Generative AI: Uncertainty is not the same as diversity.

A model can be extremely confident that the answer lies within a specific set of five sentences, even if it isn’t sure which of those five is the absolute “best.” By moving away from single-sequence probabilities and analyzing the geometry of the output distribution (slopes and tails), we can extract much more reliable confidence scores.

The implications for this are significant for deploying AI in the real world:

User Trust: Systems can accurately flag when a human needs to intervene.
Safety: Models can be programmed to abstain from answering if the distribution is too flat (indicating true ignorance).
Efficiency: These metrics are computed using the standard output probabilities, requiring no expensive retraining or auxiliary models.

While there are limitations—such as the computational cost of generating \(k\) beams to measure the ratio—this approach offers a mathematically grounded way to “read the room” of a neural network, distinguishing between a model that is creatively diverse and one that is simply lost.

Introduction#

The Problem: Classification vs. Generation#

The Core Method: Analyzing the Distribution#

1. The Ratio Method#

2. The Tail Thinness Method#

Experiments#

Evaluation Strategy#

Results#

Why Do Traditional Methods Fail?#

The Importance of \(k\)#

Conclusion and Implications#