When "Probable" Means "True": How LLMs Struggle with Theory of Mind

We use vague words every day. When you tell a friend, “It is likely to rain tomorrow,” or “It is doubtful I’ll make the party,” you aren’t outputting a precise mathematical calculation. You are expressing a fuzzy degree of belief. Remarkably, despite the lack of precision, humans are generally on the same page. We instinctively know that “likely” represents a higher probability than “possible,” but a lower probability than “almost certain.”

But what happens when we ask Large Language Models (LLMs) to interpret these words?

As LLMs like GPT-4 and Llama 3 become integrated into decision-making pipelines—summarizing medical reports, analyzing legal briefs, or aggregating news—their ability to correctly interpret the uncertainty of others is critical. If a doctor writes that a side effect is “possible,” the AI must understand what “possible” implies numerically to accurately convey the risk.

A recent paper from the University of California, Irvine, titled “Perceptions of Linguistic Uncertainty by Language Models and Humans,” investigates this exact capability. The researchers uncovered a fascinating and somewhat concerning paradox: while LLMs are excellent at mapping uncertainty words to numbers in a vacuum, their ability to interpret these words collapses when they have “prior knowledge” about the topic. In effect, LLMs struggle to separate what they believe from what the speaker believes—a failure of “Theory of Mind.”

In this post, we will break down how researchers measured this, the stark differences between human and AI perceptions, and why this matters for the future of AI communication.

The Problem: My Belief vs. Your Belief

To understand the core issue, we must first define a concept from cognitive science called Theory of Mind. This is the ability to attribute mental states—beliefs, intents, desires—to oneself and others, and to understand that others have beliefs that are different from one’s own.

If I tell you, “John believes the Earth is flat,” you understand that John is wrong. However, if I ask you to describe John’s certainty, you should be able to objectively assess how confident he is, regardless of the fact that you know the Earth is round.

LLMs, however, seem to struggle with this separation.

Consider the example below. Researchers prompted ChatGPT to write a headline based on a short text. In one version, scientists claim it is “probable” that human activity drives climate change. In the second, scientists claim it is “probable” that early vaccination causes autism.

Two interactions with ChatGPT showing different headlines for the word “probable” based on the context.

As shown in Figure 1, despite the input text being structurally identical and using the exact same uncertainty word (“probable”), ChatGPT treats them differently. For climate change (a fact the model accepts), the headline uses strong words like “Conclude.” For the vaccine-autism link (a fact the model rejects based on scientific consensus), the headline is much weaker, using “Possible Link” and “Suggests.”

The model is allowing its own internal knowledge to color its interpretation of the speaker’s uncertainty. It is conflating the speaker’s “probable” with its own “improbable.”

Establishing a Human Baseline

Before judging the models, the researchers had to establish a “ground truth.” How do humans interpret these fuzzy words?

The team conducted a baseline study with 94 human participants. They presented them with Non-Verifiable (NV) statements. These are statements where the participant has no prior knowledge or context to judge the truthfulness of the event, forcing them to rely entirely on the uncertainty word.

For example:

“Catherine believes it is somewhat likely that the defendant will be found guilty.”

The participant doesn’t know the defendant or the case. They must judge the probability purely on the phrase “somewhat likely.”

Example of a non-verifiable statement provided to human participants.

The researchers tested 14 different expressions ranging from “almost certain” to “highly unlikely.” The participants were asked to assign a numerical probability (0% to 100%) to what the speaker believed.

The results, visualized in Figure 3, show a beautiful consistency in human language.

Heatmap of human empirical distributions of numerical responses.

While there is some spread—people don’t agree on a single exact number—there is a clear, monotonic progression. “Unlikely” clusters low on the scale, “Uncertain” hovers around 50%, and “Almost Certain” pushes toward 100%. This “heatmap” of human perception serves as the reference distribution against which the AI models were tested.

Methodology: Testing the Machines

The researchers evaluated 10 popular LLMs, including proprietary models like GPT-4 and Gemini, and open-source models like Llama 3 and Mixtral.

The goal was to see if these models could perform the same task as the humans: mapping a sentence containing an uncertainty expression to a specific probability number (0-100).

The experiment was divided into two distinct settings to test the “Theory of Mind” hypothesis:

Non-Verifiable (NV) Setting: Just like the human baseline. The statements concerned unknown people and random events (e.g., “Mary thinks her boss owns a cat”). The model has no prior knowledge of these events.
Verifiable (V) Setting: The statements were based on general knowledge facts (trivia and science).

True Statements: e.g., “The Mona Lisa is a famous painting by Leonardo da Vinci.”
False Statements: e.g., “The Mona Lisa is a famous painting by Tintoretto.”

In the Verifiable setting, the prompt might look like this:

“John believes it is likely that [True/False Statement].”

If the model has a robust Theory of Mind, it should assign the same numerical value to “likely” regardless of whether the statement itself is true or false. It should be measuring John’s belief, not the fact’s validity.

Result 1: LLMs Are Fluent in Uncertainty

The first major finding is positive. When tested on Non-Verifiable statements—where the model’s own knowledge isn’t triggered—modern LLMs are remarkably human-like.

The researchers used a metric called Proportional Agreement (PA) to measure how well the model’s responses aligned with the human population’s preferred values.

Comparison of GPT-4o and OLMo model distributions against human data.

As seen in Figure 4, GPT-4o produces a distribution that closely mirrors the human heatmap. It understands that “highly likely” implies a very high probability and “doubtful” implies a low one.

However, not all models are created equal. The OLMo (7B) model (on the right of Figure 4) struggles significantly, failing to distinguish between high and low certainty expressions effectively. But generally, the larger, state-of-the-art models (GPT-4, Llama 3 70B, Gemini) displayed “human-like” calibration in this neutral setting.

In fact, statistically, the top models were often more consistent with the aggregate human population than individual humans were with each other.

Result 2: The Knowledge Bias

The story changes drastically when we introduce Verifiable statements. This is where the models’ “Theory of Mind” is put to the test.

The researchers analyzed how the models interpreted the same uncertainty expressions when applied to statements the model knew were True versus statements the model knew were False.

Remember, the prompt asks about the speaker’s belief. If a speaker says, “I believe it is possible that X,” the probability assigned to “possible” should theoretically be similar whether X is true or false.

The results, however, show a massive bias.

Bar chart showing the difference in mean numerical response for true vs false statements.

Figure 5 illustrates this phenomenon clearly.

The “Humans” bars (far left): There is a very small difference between the blue bar (True statements) and the orange bar (False statements). Humans generally rate the speaker’s certainty consistently, regardless of the fact’s truth.
The LLM bars: Look at GPT-4o, ChatGPT, and Llama 3. There is a massive gap.

When the statement is True, the models assign a significantly higher probability to the uncertainty word. When the statement is False, they assign a much lower probability.

For example, if the model sees:

“Bob believes it is possible that [True Fact].” -> Model rates this as ~70%.
“Bob believes it is possible that [False Fact].” -> Model rates this as ~20%.

The word “possible” is being re-defined by the model based on its own knowledge of the world. The model cannot separate the speaker’s uncertainty from its own conviction.

The “Possible” Problem

This bias isn’t spread evenly across all words. It is most severe with “middle-ground” words that allow for interpretation.

Breakdown of knowledge bias by specific uncertainty expressions.

Figure 6 breaks this down by specific words:

(a) “Possible”: This word shows an enormous sensitivity to truth. For ChatGPT, the difference is staggering—it interprets “possible” as highly probable if the fact is true, and highly improbable if the fact is false.
(b) “Uncertain”: Interestingly, this word is more resistant to bias. Models tend to rate “uncertain” consistently regardless of the statement’s truth.

This suggests that LLMs treat words like “possible,” “likely,” or “probable” as mechanisms to convey their own assessment of truth, rather than objective descriptions of a speaker’s state of mind.

Is This Just a Fluke?

To ensure these results weren’t specific to the specific trivia questions they selected, the researchers ran a generalization study using a different dataset: the AI2-ARC dataset, which consists of grade-school science questions.

They observed the exact same pattern.

Generalization results on the AI2-ARC dataset showing the persistent knowledge bias.

As shown in Figure 15, the “knowledge gap” persists across different domains. Whether asking about history, geography, or science, if the model knows the statement is false, it systematically downgrades the intensity of the uncertainty expression.

Why This Matters

This paper highlights a subtle but dangerous limitation in current Generative AI. We often treat LLMs as neutral processors of text, but they are not. They are “knowledge-contaminated.”

When an LLM processes text, it does not simply analyze the syntax and semantics; it evaluates the content against its pre-training data.

Implications for Human-AI Interaction

Summarization Bias: If you use an AI to summarize a news article or a scientific paper that presents a controversial or “minority” view (one that contradicts the model’s training data), the AI might subtly alter the tone. It might downgrade the author’s “probables” to “possibles,” or their “likelies” to “uncertains,” effectively misrepresenting the author’s confidence to align with the model’s worldview.
Theory of Mind Failure: As we build agents intended to simulate human behavior or act as intermediaries in negotiation, this lack of Theory of Mind is a hurdle. An AI lawyer predicting a judge’s ruling, or an AI medical assistant interpreting a patient’s vague description of symptoms, needs to understand the uncertainty of the human, not substitute it with its own medical or legal database.
Communication Breakdown: As shown in the climate change vs. vaccine example in the introduction, this bias changes the output the model generates. It leads to inconsistent communication standards where the same linguistic markers (like the word “probable”) result in vastly different downstream text generation.

Conclusion

The researchers from UC Irvine have provided a compelling look into the “black box” of LLM psychology. The good news is that LLMs have learned the human “dictionary” of uncertainty; they know that “likely” > “possible” > “unlikely.”

The bad news is that they struggle to use that dictionary objectively. Their perception is “poisoned” by their own priors. Unlike humans, who can entertain a hypothetical belief they know to be false, LLMs seem compelled to drag every statement back to their own ground truth.

As the paper concludes, this sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge than humans are. For students and developers working with these models, the lesson is clear: when an LLM tells you how “certain” something is, it’s often telling you about itself, not the text it’s reading.

The Problem: My Belief vs. Your Belief#

Establishing a Human Baseline#

Methodology: Testing the Machines#

Result 1: LLMs Are Fluent in Uncertainty#

Result 2: The Knowledge Bias#

The “Possible” Problem#

Is This Just a Fluke?#

Why This Matters#

Implications for Human-AI Interaction#

Conclusion#