Large Language Models (LLMs) like GPT-4 and LLaMA-2 have revolutionized how we interact with information. They are incredibly helpful, harmless, and creative. However, they have a notorious flaw: they don’t know when to shut up.
Or, more accurately, they don’t know how to admit they are unsure. You have likely experienced an LLM hallucinating a fact with the same supreme confidence it uses to state that \(2+2=4\). This is a failure of calibration.
In a perfectly calibrated system, if a model says it is 80% confident in an answer, it should be correct 80% of the time. Unfortunately, modern LLMs fine-tuned with Reinforcement Learning from Human Feedback (RLHF) tend to be “people pleasers”—they exhibit overconfidence because they are rewarded for generating helpful-sounding responses, not necessarily for being statistically cautious.
In this post, we will dive deep into a research paper titled “Calibrating the Confidence of Large Language Models by Eliciting Fidelity”. The researchers propose a clever, plug-and-play method called UF Calibration that forces a model to reveal its true confidence by testing two specific traits: its Uncertainty and its Fidelity.

As shown in Figure 1, this method brings the model’s confidence much closer to reality (the diagonal dotted line), significantly outperforming standard approaches. Let’s explore how it works.
The Calibration Crisis in RLHF Models
Before we fix the problem, we must understand it. Pre-trained base models (before the chat-bot fine-tuning) generally had decent calibration. If you looked at the raw probability scores (logits) of the next token, they roughly correlated with correctness.
However, the alignment process (RLHF) distorts this. By optimizing for human preference, models push probability distributions toward extremes. They become “overconfident.”
Existing Attempts to Fix Calibration
Researchers have tried several ways to get honest confidence scores out of LLMs:
Logit-Based Methods: This involves looking at the raw probability scores of the output tokens.
\[ \mathrm { C o n f } ( a _ { i } ) = \frac { \mathrm { e x p } ( \mathrm { l o g i t } _ { a _ { i } } / t ) } { \sum _ { j = 1 } ^ { | \mathcal { A } | } \mathrm { e x p } ( \mathrm { l o g i t } _ { a _ { j } } / t ) } , \]
While accurate, this requires “white-box” access to the model internals, which isn’t always available for commercial APIs (like GPT-4).Sampling Methods: For “black-box” models, you can ask the same question multiple times and see how often the answers repeat.
\[ \mathrm { C o n f } ( a _ { i } ) = \mathcal { P } _ { \mathrm { s a m p l e d } } ( a _ { i } ) = \frac { n _ { i } } { K } , a _ { i } \in \mathcal { A } \]
Verbalized Confidence: Simply prompting the model: “Are you sure? Give me a number between 0 and 1.”
\[ ( \mathrm { A n s w e r , C o n f } ) = \mathrm { L L M ( Q u e s t i o n ) } , \]
The issue here is that LLMs often just say “0.9” or “High Confidence” regardless of the actual difficulty, because that’s what they think a helpful assistant sounds like.
The Solution: UF Calibration
The authors of this paper argue that confidence isn’t a single metric. Instead, they decompose it into two distinct dimensions:
- Uncertainty (U): How confused is the model about the question?
- Fidelity (F): How committed is the model to the answer it chose?
Combining these gives us the UF Calibration score.
Step 1: Estimating Uncertainty
The first step is straightforward. Given a multiple-choice question, we sample the model’s output \(K\) times (e.g., 10 times).
If the model answers “A” every single time, the uncertainty is low. If it answers “A” 4 times, “B” 3 times, and “C” 3 times, the distribution is flat, and uncertainty is high.
To quantify this, the researchers use Information Entropy. They look at the frequency distribution of the sampled answers and calculate the normalized entropy:
\[ \mathbf { U n c e r t a i n t y } ( \mathcal { Q } ) = - \frac { \sum _ { i = 1 } ^ { \mathrm { M } } p _ { i } \cdot \log p _ { i } } { \log \mathrm { M } } , \]
This gives a value between 0 and 1 representing how “scattered” the model’s initial guesses are.
Step 2: Eliciting Fidelity (The Core Innovation)
This is the most innovative part of the paper. Just because a model picks answer “A” frequently doesn’t mean it truly understands why “A” is correct. It might just be guessing “A” because of a bias in the training data.
To test the model’s loyalty (Fidelity) to its answer, the researchers use a counterfactual trap.
Imagine the model chooses Option A for a question. The researchers then modify the prompt by replacing the text of Option A with the phrase: “All other options are wrong.”
They then ask the model to answer again.
- High Fidelity: If the model really believed in A, it should recognize that “All other options are wrong” is now the correct logical choice.
- Low Fidelity: If the model was just guessing or matching keywords, it might now drift to Option B or C because the original text of A is gone.

As shown in Figure 2, if the model switches from Option D to Option A just because the text changed, its fidelity to D is low.
The Fidelity Chain
Since there might be multiple plausible answers, the researchers perform this substitution iteratively to create a Fidelity Chain. They perform greedy decoding (taking the top choice) and generate a hierarchy of preferences.
For example, if the model prefers A, but if A is removed/replaced it prefers C, and if C is removed it prefers D, the chain is \(A \rightarrow C \rightarrow D\).
The fidelity score is calculated by assigning weights to this chain. Answers that appear earlier in the chain (meaning the model defends them strongly) get higher scores.
\[ \mathbf { F i d e l i t y } _ { \mathcal { C } } ( a _ { i } ) = \frac { \boldsymbol { \tau } ^ { i } } { \sum _ { i = 1 } ^ { | \mathcal { C } | } \boldsymbol { \tau } ^ { i } } , \]
Here, \(\tau\) is a hyperparameter (usually set to 2). This formula ensures that the top choice in the chain gets the majority of the fidelity mass.
Because we sampled multiple times in Step 1, we might have different starting points. We calculate the overall fidelity by averaging the chain results weighted by how often those chains appeared:
\[ \mathbf { F } ( a _ { i } ) = \sum _ { j = 1 } ^ { | A | } \mathcal { P } _ { \mathrm { s a m p l e d } } ( \mathcal { C } _ { j } ) \cdot \mathbf { F i d e l i t y } _ { \mathcal { C } _ { j } } ( a _ { i } ) , \]
Step 3: Calculating Final Confidence
Finally, we combine the two metrics. The confidence in answer \(a_i\) is its Fidelity scaled down by the overall Uncertainty of the question.
\[ \mathbf { C o n f } ( \mathscr { Q } , a _ { i } ) = \left( 1 - \mathbf { U n c e r t a i n t y } ( \mathscr { Q } ) \right) \cdot \mathbf { F } ( a _ { i } ) , \]

Figure 3 summarizes the whole pipeline:
- Sample to get the distribution (Uncertainty).
- Elicit Fidelity by replacing answers with “All other options are wrong” to see if the model sticks to its guns.
- Combine them for a final score.
Experiments and Results
The researchers tested UF Calibration against standard baselines (Verbalized, Sampling, etc.) using models like GPT-3.5-Turbo, GPT-4, LLaMA-2, and Baichuan2. They used challenging multiple-choice datasets like MMLU, ARC-Challenge, and TruthfulQA.
The Metrics
To evaluate success, they used three metrics:
- ECE (Expected Calibration Error): The standard metric. The lower, the better.
- IPR (Inverse Pair Ratio): A new metric proposed by the authors. It measures “monotonicity.” Ideally, higher confidence answers should always have higher accuracy than lower confidence ones. If a 60% confident answer is right more often than a 90% confident answer, that’s an “Inverse Pair” (bad).
\[
\mathrm { I P R } _ { M } = { \frac { \mathrm { I P } } { C _ { K } ^ { 2 } } } ,
\]

- CE (Confidence Evenness): Another new metric. It checks if the model uses the full range of confidence scores (0-100%) or if it just clusters everything around 90%.
\[
\mathrm { C E } _ { M } = - \frac { \sum _ { i = 1 } ^ { \mathrm { M } } p _ { i } \cdot \log p _ { i } } { \log \mathrm { M } } ,
\]

Performance Comparison
The results were compelling. In the tables below, blue indicates better performance (lower error). UF Calibration (labeled “Ours”) consistently beats Verbalized (“Verb”, “Ling”) and Sampling methods.
GPT-3.5 and GPT-4 Results:

Baichuan2-13B Results:

In Table 1, notice how “Ours” achieves significantly lower ECE scores (0.088 on MMLU for GPT-3.5) compared to Verbalized methods (0.215 on TruthfulQA).
Visualizing the Improvement
Numbers are great, but reliability diagrams tell the real story. These charts plot Confidence (x-axis) vs. Accuracy (y-axis). A perfectly calibrated model follows the diagonal dotted line (\(y=x\)).

In Figure 6, look at the plot for “Verb” (Verbalized). The bars are clustered at the high end (0.8-1.0), but the accuracy is low (the bars are below the diagonal). This is classic overconfidence.
Now look at “Ours” (UF Calibration). The bars are distributed more evenly across the x-axis, and their height matches the diagonal line much better. This means when UF Calibration says “I’m 60% sure,” the model is actually right about 60% of the time.
Robustness Tests
One might worry that this method is fragile. Does it break if we change the temperature (randomness) of the model?

Figure 7 shows that UF Calibration (the teal line) remains stable and keeps the lowest ECE error rate regardless of temperature changes, whereas sampling methods (orange line) fluctuate wildly.
Furthermore, does it work across different model sizes?

Figure 5 confirms that whether you use a 7B parameter model or a 70B parameter model, UF Calibration consistently provides the lowest calibration error.
Discussion: What is “True” Calibration?
One of the most interesting discussions in the paper is about the nature of confidence in LLMs.
The authors note that Verbalized methods (“Just ask the model”) fail partially because models have a “favorite” confidence level. They love to say they are 90% sure, regardless of the question.

Table 7 shows the raw counts of confidence scores output by LLaMA-2. Notice the massive spikes at 0.8 and 0.9. The model is essentially guessing a number that sounds good.
“True” calibration requires Confidence Evenness (CE). A model should be able to say “I have absolutely no idea (10%)” just as often as “I am certain (99%).” UF Calibration forces this spread by mathematically deriving the score from behavior (Fidelity) rather than asking for a self-assessment.
Conclusion
The paper “Calibrating the Confidence of Large Language Models by Eliciting Fidelity” introduces a significant step forward in making AI systems more honest. By acknowledging that confidence is a mix of initial uncertainty and the robustness (fidelity) of the answer, the UF Calibration method provides a more accurate probability score for model outputs.
Key takeaways:
- Don’t trust the model’s words: Verbalized confidence (“I am 95% sure”) is often a hallucination of helpfulness.
- Test for loyalty: By trying to trick the model with “All other options are wrong,” we can measure how strongly it actually holds its beliefs.
- Plug-and-Play: This method works on black-box models without needing access to internal weights.
For students and practitioners, this implies that building reliable AI applications isn’t just about prompt engineering for better answers—it’s about engineering better ways to measure when those answers might be wrong.
](https://deep-paper.org/en/paper/2404.02655/images/cover.png)