Introduction

We have all experienced it: you ask a Large Language Model (LLM) a specific factual question—perhaps about an obscure historical figure or a specific coding error—and it responds with absolute conviction. The grammar is perfect, the tone is authoritative, and the delivery is decisive.

There is just one problem: the answer is completely wrong.

This phenomenon highlights a critical gap in modern Artificial Intelligence. LLMs are trained to generate fluent, persuasive text, often at the expense of accuracy. While we call these “hallucinations,” the danger isn’t just that the model is wrong; it is that the model is persuasively wrong. It mimics the cadence of an expert even when it is essentially guessing.

Ideally, an AI assistant should communicate like a responsible human expert. If it knows the answer, it should say so directly. If it is guessing, it should hedge its language using phrases like “I think,” “It’s possible that,” or “I’m not entirely sure.”

In the research paper “Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?”, researchers from Google Research and Tel Aviv University tackle this exact problem. They investigate whether we can get LLMs to align their “linguistic confidence” (what they say) with their “intrinsic confidence” (what they actually know).

As illustrated in Figure 1 above, the goal is simple but profound. If a model is unsure about a birthdate (top left), it shouldn’t state it as a fact. It should produce a hedged response (bottom left) that matches its internal uncertainty level.

This blog post will break down how the researchers formalized this problem, the metric they invented to measure it, and the surprising (and somewhat concerning) results they found when testing today’s most powerful models.

Background: The Two Types of Confidence

To understand the paper’s contribution, we first need to distinguish between two concepts that are often conflated:

Intrinsic Uncertainty: This is the mathematical probability the model assigns to its output. If you ask an LLM the same question 20 times with a high “temperature” setting, does it give the same answer every time (high confidence)? or does it give 20 different answers (high uncertainty)? This is the model’s internal state.
Linguistic Decisiveness: This is how the model presents the information to the user. Does it use words like “definitely,” “is,” and “fact”? Or does it use “maybe,” “possibly,” and “unclear”?

The core hypothesis of this research is that for an LLM to be trustworthy, these two concepts must be faithful to each other. A high-confidence model should speak decisively. A low-confidence model should speak tentatively.

When there is a mismatch—specifically, when intrinsic confidence is low but linguistic decisiveness is high—we get the “confident hallucination” problem.

The Core Method: Measuring Faithfulness

The researchers propose a new framework to quantify this alignment, called Faithful Response Uncertainty.

At a high level, they treat a response \(R\) to a query \(Q\) as a set of assertions. They then calculate a score that penalizes the model if the “vibe” of the answer (decisiveness) doesn’t match the “math” of the answer (confidence).

The Faithfulness Metric

The mathematical definition of faithfulness is elegant in its simplicity. For a response generated by a model \(M\), the faithfulness score is defined as:

Equation for Faithfulness Metric

Let’s break this down:

\(\operatorname{dec}(A; \mathbf{R}, \mathbf{Q})\): The Decisiveness score. A value between 0 and 1 representing how certain the text sounds.
\(\operatorname{conf}_{M}(A)\): The Intrinsic Confidence score. A value between 0 and 1 representing how certain the model is.
The term \(|\operatorname{dec} - \operatorname{conf}|\) calculates the gap. If the model sounds 100% sure (\(1.0\)) but is only 20% consistent (\(0.2\)), the gap is \(0.8\).
We subtract this gap from 1. A perfect score (\(1.0\)) means the decisiveness perfectly matches the confidence.

To use this formula, the researchers needed concrete ways to measure both Decisiveness and Confidence.

Step 1: Quantifying Decisiveness

How do you measure how confident a sentence sounds? “I believe it is 1995” is clearly less decisive than “It is 1995,” but how much less?

The researchers chose not to rely on a simple list of keywords. Instead, they used the concept of perceived decisiveness. They defined decisiveness as the probability a third-party agent would assign to the statement being true, judging purely based on the language used.

Equation for Decisiveness

To implement this, they used a “Judge” LLM (Gemini Ultra). They fed the Judge the model’s response and asked it to score the decisiveness on a scale of 0.0 to 1.0.

Does an LLM judge align with human perception? Surprisingly, yes. The researchers compared their Judge’s scores against human surveys regarding probability words (e.g., how likely is “highly probable” vs. “unlikely”).

Figure 2: Our mean decisiveness score vs. IQR of human perceptions of probability.

As shown in Figure 2, the automated scores (stars) align very well with human intuition (blue bars). “Almost certain” scores near 1.0, while “About even” scores near 0.5.

Step 2: Quantifying Intrinsic Confidence

Next, they needed to measure how much the model actually “knows” the answer. They used a method called Self-Consistency.

The idea is straightforward: If you ask the model “When was Obama born?” and it really knows the answer, it should consistently say “1961” even if you sample the response multiple times. If it doesn’t really know, it might hallucinate “1962” one time and “1960” the next.

They quantified confidence as the percentage of sampled answers that do not contradict the original answer.

Equation for Intrinsic Confidence

Here, \(k\) is the number of additional samples (they used 20). If the model generates 20 new answers and none of them contradict the main answer, the confidence score is 1.0. If half of them contradict, the score is 0.5.

Experiments & Results

With the metric established, the researchers evaluated several state-of-the-art models, including the Gemini family (Nano, Pro, Ultra) and the GPT family (GPT-3.5, GPT-4).

They tested these models on two datasets:

Natural Questions (NQ): Real queries issued by Google users.
PopQA: A challenging dataset focusing on “tail entities”—obscure facts that LLMs often struggle with, making it a perfect test bed for uncertainty.

The Experimental Setup

They didn’t just ask the models to answer normally; they tried to “prompt” them into better behavior. They tested four specific methods:

Vanilla: Standard “Answer the question” prompt.
Granularity: Telling the model to give a broader answer if it is unsure (e.g., “1900s” instead of “1905”).
Uncertainty: Explicitly instructing the model to “convey uncertainty linguistically” (hedge) if it isn’t sure.
Uncertainty+: The same as above, but providing few-shot examples of what good hedging looks like.

Table 2: The specific instructions we use in the baselines we evaluate.

Result 1: Models are defaults at “Overconfident”

The first major finding is that without special prompting (the “Vanilla” setting), models are almost incapable of expressing doubt.

Take a look at Figure 3 below. The Green bars represent Decisiveness, and the Orange bars represent Confidence.

Figure 3: Standard decoding yields decisive answers, even under uncertainty.

Notice that for every single model, the Green bar is maxed out at 1.0. The models always sound completely sure of themselves. However, the Orange bars (actual confidence) are significantly lower, especially for the smaller models. This gap represents the “faithfulness failure.” The models are writing checks their internal probabilities can’t cash.

Result 2: Prompting helps the tone, but not the accuracy

Can we fix this by just telling the model to be humble? The researchers found that prompting (using the “Uncertainty” and “Uncertainty+” methods) did successfully lower the decisiveness. The models started using words like “I think” and “It might be.”

We can see examples of this behavioral shift in Table 3.

Table 3: Random examples from PopQA and NQ of questions for which standard decoding answers decisively, but the uncertainty prompt induces hedged answers.

In the Vanilla column, the model states “The producer… was Carl Bessai.” In the Uncertainty+ column, it shifts to “I’m not certain, but I believe…”

However, there is a catch. Just because the model is saying “I’m not sure” doesn’t mean it is saying it at the right times.

The researchers measured the correlation between the model’s new hedged tone and its actual internal confidence. If the method worked, we would see a strong diagonal line on a graph: low confidence should equal low decisiveness, and high confidence should equal high decisiveness.

Instead, they found this:

Figure 4 & 5: Weak correlation between decisiveness and confidence.

In Figure 4 (the scatter plots), look at how scattered the blue dots are. There is a very weak relationship between how confident the model effectively is (x-axis) and how decisive it sounds (y-axis).

Gemini Ultra (Left): Even at low confidence (0.2), the model often outputs highly decisive answers (1.0).
GPT-4 (Right): It tends to cluster, but not along a clean diagonal.

This implies that while we can force LLMs to use hedging words, they often hedge when they are actually right, or sound confident when they are actually guessing. They are mimicking the style of uncertainty without accessing the substance of it.

Result 3: The Faithfulness Scores remain low

Because of this mismatch, the overall “Faithfulness” scores (cMFG) improved only marginally over the baseline.

Table 1: State of the art models struggle at faithfully communicating uncertainty.

In Table 1, a score of 0.5 is essentially the baseline (random chance). Most models hovered around 0.52–0.54 in the Vanilla setting. Even with the best prompting strategies (“Uncertainty”), they only reached roughly 0.59 to 0.70. While Gemini Ultra showed the most improvement, the general trend indicates that faithful uncertainty is not a capability that emerges naturally just by asking for it.

Discussion: The Complexity of Uncertainty

Why is this so hard? Part of the issue is that “uncertainty” isn’t just one thing. The paper briefly touches upon the difference between Epistemic (Data) uncertainty and Aleatoric (Model) uncertainty.

Epistemic/Data Uncertainty: The question itself is ambiguous (e.g., “When did Harry Potter come out?"—Book or Movie?).
Aleatoric/Model Uncertainty: The question is clear, but the model lacks the knowledge (e.g., “When was the first airline meal served?”).

Table 4: The appropriate way to reflect uncertainty linguistically depends on the source of the uncertainty.

As shown in Table 4, a truly advanced AI needs to distinguish why it is confused to give a helpful answer. The experiments in this paper focused primarily on Aleatoric uncertainty (the model simply not knowing a fact), and even that proved difficult for current architectures.

Conclusion

The research presented in “Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?” serves as a reality check for the deployment of AI in high-stakes fields.

We often assume that as models get “smarter” (larger parameters, more training data), they will naturally become better at knowing what they don’t know. This paper suggests that this assumption is flawed. Current LLMs are trained to be helpful and fluent, which biases them toward decisive answers regardless of their internal probability states.

Key Takeaways:

Overconfidence is Default: Without intervention, LLMs will state incorrect information with 100% decisive language.
Prompting isn’t a Silver Bullet: You can tell a model to “be humble,” and it will use humble words, but it won’t necessarily use them accurately. It might say “I’m not sure” about a fact it knows perfectly, or state a hallucination as absolute truth.
Alignment is Needed: To fix this, we likely need new training methods (RLHF) that specifically reward the model not just for being right, but for correctly calibrating its tone to its knowledge level.

Until then, when an AI sounds absolutely certain about a specific fact, remember the “faithfulness gap.” It might just be a very decisive guess.

Introduction#

Background: The Two Types of Confidence#

The Core Method: Measuring Faithfulness#

The Faithfulness Metric#

Step 1: Quantifying Decisiveness#

Step 2: Quantifying Intrinsic Confidence#

Experiments & Results#

The Experimental Setup#

Result 1: Models are defaults at “Overconfident”#

Result 2: Prompting helps the tone, but not the accuracy#

Result 3: The Faithfulness Scores remain low#

Discussion: The Complexity of Uncertainty#

Conclusion#