Can We Trust LLMs When They Say “I’m Fairly Certain”? A Deep Dive into Epistemic Markers

As Large Language Models (LLMs) like GPT-4 and Claude integrate deeper into high-stakes fields—medicine, law, and financial analysis—the question of reliability becomes paramount. It is not enough for a model to give an answer; we need to know how confident it is in that answer.

Traditionally, researchers have looked at numerical confidence scores (like log-probabilities or explicit percentage outputs). But let’s be honest: that is not how humans communicate. If you ask a doctor for a diagnosis, they rarely say, “I am 87.5% confident.” They use epistemic markers—phrases like “I am fairly certain,” “It is likely,” or “I’m not sure.”

A recent paper titled “Revisiting Epistemic Markers in Confidence Estimation” investigates a critical question: Do LLMs use these verbal markers reliably? If a model says “I am certain,” does that actually correspond to a higher probability of correctness than when it says “I think”?

In this post, we will break down this research, exploring the framework the authors developed to quantify these linguistic markers, the rigorous metrics they proposed, and the somewhat concerning results regarding how LLMs handle their own uncertainty across different domains.

The Problem: Words vs. Numbers

When an LLM generates text, it calculates the probability of the next token. This is a “white-box” numerical value. However, humans interact with LLMs via natural language. We expect the model to express uncertainty using words.

The challenge is that “certainty” is subjective. Does “highly likely” mean 80% or 95%? Furthermore, does an LLM maintain a consistent definition of “highly likely” when it switches from solving a math problem to answering a medical question?

The researchers argue that previous studies focused too much on whether LLMs use markers the same way humans do. This paper takes a different approach: it asks whether LLMs are internally consistent. Even if a model’s definition of “maybe” is different from ours, can we at least trust that its “maybe” consistently reflects a specific accuracy level?

The Framework: Defining “Marker Confidence”

To study this, the authors moved away from abstract semantic meanings and grounded their study in empirical data. They introduced the concept of Marker Confidence.

What is Marker Confidence?

Simply put, Marker Confidence is the observed accuracy of the model whenever it uses a specific phrase.

Imagine a model answers 100 questions.

In 31 of those answers, it uses the phrase “fairly certain.”
Out of those 31 “fairly certain” answers, 20 are correct and 11 are wrong.
The Marker Confidence for the phrase “fairly certain” is \(20/31 \approx 64.5\%\).

The authors visualize this framework below:

Figure 1: An example of our framework calculating the marker confidence of “fairly certain” for GPT-4o on StrategyQA.

This approach turns a linguistic quality (a phrase) into a quantifiable metric. It allows us to ignore what the dictionary says “fairly certain” means and focus on what the model actually achieves when it uses that phrase.

The mathematical formalization is straightforward. Let \(W\) be a marker, \(D\) be a dataset, and \(M\) be the model. The confidence is the average accuracy of the subset of questions \(Q_{W_i}\) where the marker appears:

Equation for Marker Confidence

Here, \(\mathbb{I}(\cdot)\) is an indicator function that equals 1 if the answer is correct and 0 if it is wrong.

Study Design and Setup

To test this thoroughly, the researchers didn’t just use one model or one topic. They set up a comprehensive environment:

7 Models: Ranging from open-source models like Llama-3.1-8B and Mistral-7B to proprietary giants like GPT-4o.
7 Datasets: Covering diverse domains including:
Commonsense: BoolQ, StrategyQA, CSQA
Math: GSM8K
Medicine: MedMCQA
Law: CaseHOLD
General Knowledge: MMLU

Interestingly, the researchers found that Instruction-Tuned models (models trained to follow instructions) were much better suited for this task than base models. As shown in the figure below, instruction-tuned models generated a much wider variety of epistemic markers, providing a richer dataset for analysis.

Figure 4: The number of epistemic markers that six different models generated in BoolQ and CSQA dataset.

Prompting Strategy

To elicit these markers, they used a specific prompt structure. They didn’t just ask for an answer; they explicitly instructed the model to “incorporate only one epistemic marker to reflect your confidence level.”

Table 4: Our prompt to elicit epistemic markers and numerical confidence values.

The Metrics: Measuring Consistency

The core contribution of this paper lies in how they evaluated the “goodness” of these markers. It is not enough to just calculate accuracy. We need to know if the markers are calibrated (do they reflect reality?) and consistent (do they work the same way everywhere?).

The authors proposed seven specific metrics, broken down into three categories:

1. Calibration Metrics (The “ECE” Family)

Expected Calibration Error (ECE) measures the gap between confidence and accuracy. If a model has a confidence of 70%, it should be right 70% of the time.

First, they established a baseline using numerical outputs (asking the model to give a number between 0-100):

Equation for NumECE

Then, they looked at marker calibration in two contexts:

In-Domain Average ECE (I-AvgECE): This measures how well markers perform when trained and tested on the same dataset type (e.g., training on math questions and testing on math questions).

Equation for I-AvgECE

Cross-Domain Average ECE (C-AvgECE): This is the stress test. It measures how well marker confidence transfers across different datasets. If the model learns “I am sure” means 99% accuracy on math, does “I am sure” still mean 99% accuracy on medical questions?

Equation for C-AvgECE

2. Dispersion Metrics (The “CV” Family)

We want a model to use markers distinctively. If “maybe” and “definitely” both map to 60% accuracy, the markers are useless. The Coefficient of Variation (CV) measures how spread out the confidence values are.

Equation for CV

In-Domain CV (I-AvgCV): Measures the dispersion within a dataset. A higher value is generally better here, as it means the model is distinguishing between high and low confidence scenarios.

Equation for I-AvgCV

Cross-Domain CV (C-AvgCV): This measures the dispersion of a single marker’s confidence across different datasets. Here, we want a low value. We want the marker “Unlikely” to mean roughly the same probability regardless of whether we are talking about law or physics.

Equation for C-AvgCV

3. Correlation Metrics

Finally, the authors checked if the rankings make sense.

Marker Rank Correlation (MRC): Based on Spearman correlation. If “Certain” > “Likely” > “Maybe” in one dataset, that order should hold true in another dataset.

Equation for MRC

Marker Accuracy Correlation (MAC): Based on Pearson correlation. This checks if higher marker confidence actually correlates with higher overall model accuracy.

Equation for MAC

Experimental Results: The Reliability Gap

So, how did the models perform? The results paint a picture of models that are capable but brittle.

1. In-Distribution vs. Out-of-Distribution

The most significant finding is the gap between in-domain and cross-domain performance. Look at the summary table below:

Table 1: Model performance across seven metrics.

Notice the difference between I-AvgECE (In-domain) and C-AvgECE (Cross-domain). For almost every model, the cross-domain error is significantly higher.

Interpretation: Models are decent at calibrating their language within a specific context. However, they fail to generalize. A marker like “fairly certain” might imply 80% accuracy in a logic puzzle but only 50% accuracy in a legal query. This makes trusting the markers dangerous in real-world, open-ended applications.

2. The Ranking Problem

The MRC (Marker Rank Correlation) scores are alarmingly low across the board (mostly between 10% and 37%).

Interpretation: This means the hierarchy of confidence words flips between datasets. “Highly likely” might be a stronger indicator than “Probably” in one domain, but weaker in another. This inconsistency makes it nearly impossible for a user to learn the model’s “language of uncertainty.”

3. Visualizing the Instability

The authors provide a heatmap to visualize how confidence levels for specific markers shift across datasets.

Figure 2: Model’s marker confidence varies greatly across different datasets.

In Figure 2, looking at GPT-4o (left) and Qwen2.5 (right), you can see the color intensity (representing confidence/accuracy) changes for the same markers across the x-axis (datasets). If the markers were stable, we would see consistent horizontal bands of color. Instead, we see fluctuations.

Furthermore, the ranking instability is visualized in the scatter plot below. If rankings were consistent, the shapes (representing different markers) would cluster horizontally. Instead, they bounce up and down the y-axis (normalized rank) depending on the dataset.

Figure 3: The rankings of the model’s marker confidence fluctuates significantly across different datasets.

4. Robustness of Findings

One might argue that these results are noise—perhaps caused by markers that appeared only a few times. To counter this, the authors ran a robustness check, filtering for markers that appeared at least 10, 50, or 100 times.

Table 6: The exact confidence interval of different filtering thresholds.

Even with strict filtering (requiring 100 occurrences of a marker), the conclusions held firm: C-AvgCV remained high (indicating instability) and MRC remained low (indicating poor ranking consistency).

Why Does This Happen?

The authors suggest that models learn a “preference” for using certain markers based on the distribution of the data they are processing. When the distribution changes (e.g., shifting from easy math problems to complex multi-hop reasoning in StrategyQA), the model’s internal calibration shifts, but its vocabulary selection doesn’t adapt perfectly to that shift.

They found that multi-domain datasets (like MMLU) caused higher calibration errors than single-domain datasets (like GSM8K). The complexity and diversity of the data make it harder for the model to map its internal state to a consistent verbal marker.

Conclusion and Implications

This research serves as a reality check for the deployment of LLMs in critical systems. While it is tempting to parse an LLM’s response for phrases like “I am sure” to automate decision-making, this paper proves that epistemic markers are currently unreliable across different contexts.

Key Takeaways for Students and Practitioners:

Context Matters: You cannot assume a confidence marker means the same thing in Medicine as it does in Math.
Bigger is Better (Mostly): Larger models like GPT-4o and Qwen2.5-32B generally showed better stability (lower C-AvgCV) than smaller models, but they are not perfect.
No Standard “Uncertainty Language”: Models do not currently possess a universal, stable mapping between their internal probability and human language.

The paper concludes that while LLMs are getting smarter, their ability to communicate their own limitations is still flawed. Achieving trustworthy AI will require new alignment techniques that specifically target the consistency of these epistemic markers, ensuring that when an AI says “I know,” it really does.

Can We Trust LLMs When They Say “I’m Fairly Certain”? A Deep Dive into Epistemic Markers#

The Problem: Words vs. Numbers#

The Framework: Defining “Marker Confidence”#

What is Marker Confidence?#

Study Design and Setup#

Prompting Strategy#

The Metrics: Measuring Consistency#

1. Calibration Metrics (The “ECE” Family)#

2. Dispersion Metrics (The “CV” Family)#

3. Correlation Metrics#

Experimental Results: The Reliability Gap#

1. In-Distribution vs. Out-of-Distribution#

2. The Ranking Problem#

3. Visualizing the Instability#

4. Robustness of Findings#

Why Does This Happen?#

Conclusion and Implications#

Key Takeaways for Students and Practitioners:#