Imagine you are told that “Tweety is a bird.” Based on your general knowledge, you logically infer that “Tweety flies.” But a moment later, you receive a new piece of information: “Tweety is a penguin.”
What happens in your brain? You immediately revise your belief. You retract the conclusion that Tweety flies, but you maintain the premise that he is a bird. You have just performed belief revision—the cognitive ability to update your understanding when new evidence contradicts or contextualizes what you previously thought was true.
This ability is fundamental to human intelligence because the real world is rarely static. Information evolves, context shifts, and exceptions to rules appear constantly. But do our current state-of-the-art Artificial Intelligence systems possess this same adaptability?
In the paper “Belief Revision: The Adaptability of Large Language Models Reasoning,” researchers from the Hong Kong University of Science and Technology investigate this precise question. They propose a new framework to test whether Large Language Models (LLMs) can rationaly update their beliefs or if they are “stubborn” static reasoners.

As illustrated in Figure 1, humans handle evolving constraints naturally. When a robot is presented with a condition (“Ben only eats at home if he cooks”) that complicates a previous rule, it struggles to determine the outcome. This blog post dives deep into how the researchers quantified this struggle, the new dataset they built, and the surprising limitations they discovered in modern AI.
The Problem with Static Benchmarks
To understand why this research is necessary, we first need to look at how we currently test AI reasoning.
Most logical reasoning benchmarks for LLMs operate in a “closed world.” The model is given a set of facts (premises) and asked to derive a conclusion. The assumption is that the information provided is complete and consistent.
However, real-world NLP applications—like chatbots, legal assistants, or medical diagnostic tools—operate in open, evolving environments. An AI might read a document stating a policy, draw a conclusion, and then read a second document that adds an exception to that policy.
If we only test models on static snapshots of data, we miss a critical component of intelligence: Non-monotonic reasoning. This is a fancy logic term meaning that adding new premises can invalidate previous conclusions.
The researchers compared their new dataset, Belief-R, against existing popular benchmarks:

As shown in Table 1, while other datasets handle incomplete or contradictory info, they don’t explicitly test Belief Revision—the specific act of deciding whether to keep or discard a prior conclusion based on the significance of new information.
The Solution: The Delta Reasoning (\(\Delta R\)) Framework
The core contribution of this paper is a new evaluation framework dubbed Delta Reasoning (\(\Delta R\)).
The “Delta” (\(\Delta\)) represents change. Instead of asking a model a single question, the researchers probe the model at two distinct timesteps (\(t\) and \(t+1\)). This allows them to measure the change in the model’s reasoning state.
Here is how the framework functions, step-by-step.
Step 1: Establishing the Prior Belief (Time \(t\))
First, the model is presented with two premises (\(\gamma_1, \gamma_2\)) that satisfy a basic logical rule, such as Modus Ponens (If \(P\) then \(Q\); \(P\) happens; therefore \(Q\)).
- Premise 1: If she has an essay to finish, she will study late in the library. (\(P \to Q\))
- Premise 2: She has an essay to finish. (\(P\))
- Expected Conclusion: She will study late in the library. (\(Q\))
The researchers call this the Basic @t stage. If a model cannot get this right, it doesn’t understand basic logic, and testing it for belief revision would be pointless.
Step 2: Introducing New Evidence (Time \(t+1\))
Next, the model is given a third premise (\(\gamma_3\))—a new piece of information. The model must now decide if this new information conflicts with the original conclusion (\(Q\)).
The genius of the \(\Delta R\) framework lies in the type of new information introduced. The researchers utilized two specific categories of new premises, inspired by the “Suppression Task” from cognitive science:
Scenario A: Belief Update (BU) - The “Additional” Condition
In this scenario, the new premise introduces a necessary condition that casts doubt on the original conclusion.
- New Evidence: “If the library stays open, she will study late in the library.”
- The Logic: Even though she has an essay (\(P\)), if the library is closed, she can’t study there. The library being open is an additional requirement.
- Correct Action: The model should Update its belief. It should retract the definite conclusion “She will study late” and switch to “She might or might not study late.”
Scenario B: Belief Maintain (BM) - The “Alternative” Condition
In this scenario, the new premise introduces an alternative way the result could happen, which shouldn’t affect the original path.
- New Evidence: “If she has textbooks to read, she will study late in the library.”
- The Logic: She still has an essay to finish (\(P\)). The fact that textbooks (\(R\)) also make her study doesn’t stop \(P\) from causing \(Q\). This is just an alternative path.
- Correct Action: The model should Maintain its belief. The conclusion “She will study late” is still valid because of the essay.
The Challenge
The model isn’t told which scenario it is facing. It has to use commonsense reasoning to understand the relationship between the essay, the library hours, and the textbooks. It must determine if the new information is a blocker (necessitating an update) or just extra flavor (necessitating maintenance).
Building “Belief-R”: A Dataset for Dynamic Minds
To test this framework at scale, the authors created Belief-R. They didn’t just write random sentences; they used a rigorous, semi-automated process to ensure high quality.
- Seed Data: They started with ATOMIC, a massive atlas of machine commonsense (e.g., “If Person X pays a compliment, Person Y feels happy”). This provided grounded, realistic causal relationships.
- Generation: They used GPT-4 to generate the premises (\(P, Q, R\)) based on the ATOMIC seeds, strictly instructing it to create either “Alternative” or “Additional” conditions.
- Human Verification: This was the crucial quality control step. Crowd-sourced workers analyzed the generated logical problems. The researchers only kept samples where at least 4 out of 5 human annotators agreed on the correct logical outcome.

As detailed in Table 2, the final dataset contains roughly 2,000 high-quality samples, balanced between Modus Ponens and Modus Tollens, and split between “Event” causes and “Mental State” causes.
Experiments: How Smart are LLMs, Really?
The researchers tested a wide range of models, from smaller open-source models (like Phi-2 and Llama-2) to massive proprietary models (like GPT-4 and Claude 3).
Result 1: You Must Be This Tall to Ride
Before testing belief revision, the models had to pass the basic logic test at time \(t\) (Basic @t).

Figure 3 shows a clear trend: small, pre-trained models fail miserably at basic logic. However, Instruction-Tuned models (the green bars) and larger models (Generative APIs) perform very well, often exceeding 90% accuracy. This confirms that modern LLMs possess the baseline logic required for the experiment.
Result 2: The Failure to Adapt
Once the researchers established that the big models could do basic logic, they hit them with the Belief Revision task. They used a metric called BREU (Belief Revision Evaluation Understudy), which averages the accuracy of updating beliefs (BU) and maintaining beliefs (BM).
The results were stark.

Look at Figure 4. The distinct colored bars (representing revision accuracy) are significantly lower than the total height of the bars (representing basic accuracy).
- The Gap: Even powerful models like GPT-4 Turbo, which ace the basic logic (near 100%), drop to around 50-60% on the BREU score.
- Random Guessing: A BREU score of ~50% is perilously close to random guessing in a binary decision framework. This suggests LLMs are struggling deeply with discerning when to change their minds.
Result 3: The Trade-off
The researchers discovered a fascinating “inverse relationship” behavior.
- Models that were good at Updating (realizing a new premise blocked the conclusion) often became too skeptical—they would update even when they should have maintained their belief.
- Models that were good at Maintaining beliefs (ignoring irrelevant alternatives) were often too stubborn—they failed to update when a blocker was introduced.
There is currently no model that perfectly balances these two opposing cognitive requirements.
Result 4: Prompting Is Not a Magic Wand
A common defense in current AI research is, “Did you try Chain-of-Thought (CoT) prompting?” (i.e., asking the model to “think step by step”).
The authors tried Direct Prompting (DP), Chain-of-Thought (CoT), and Plan-and-Solve (PS).

Figure 5(c) (far right) reveals a disappointing reality: Better prompting methods did not significantly solve the problem. While CoT and PS helped slightly in some specific configurations, they didn’t bridge the gap. In some cases, as seen in Table A1 below, sophisticated prompting actually hurt performance (lowering the score from DP to PS for some models). This suggests the deficiency isn’t just about “thinking harder”—it’s a fundamental gap in how these models represent and update knowledge states.

Why Is This Happening?
The paper offers several insights into why belief revision is so hard for LLMs:
- Modus Tollens is Harder: As shown in Figure 5(a), models struggle more with Modus Tollens (inferring the cause was false because the effect didn’t happen) than Modus Ponens. This backward reasoning combined with belief revision is a cognitive heavy lift.
- Abstract Concepts: Figure 5(b) shows that models are more likely to “Maintain” (refuse to update) when the effect is a Mental State rather than a physical Event. Mental states are abstract and less observable, making the models less confident in overturning their prior predictions.
- Sensitivity to Noise: The models perform worse on the Belief Maintain tasks compared to the baseline. This means simply adding any extra sentence (even one that shouldn’t change the outcome) confuses the model. This “distractibility” is a major weakness for Reliability Augmented Generation (RAG) systems that often retrieve noisy documents.
Conclusion and Implications
The introduction of Belief-R and the \(\Delta R\) framework highlights a critical “blind spot” in current AI evaluation. We have built models that are excellent at static reasoning—taking a snapshot of text and answering questions about it. But we have not yet mastered dynamic reasoning—building agents that can gracefully intake new, conflicting information and adjust their worldview accordingly.
For students and future researchers, this paper opens a new frontier. Improving a model’s score on a static benchmark might be a case of diminishing returns. The real challenge lies in adaptability: creating systems that know when to hold onto a belief, and when to let it go.
As the authors conclude, moving toward reliable AI systems requires us to solve this trade-off. Until LLMs can reliably pass the “Tweety the Penguin” test in complex, real-world scenarios, they remain fragile reasoners in a constantly changing world.
](https://deep-paper.org/en/paper/2406.19764/images/cover.png)