Large Language Models (LLMs) like GPT-4 and Claude 3 have dazzled the world with their ability to write code, compose poetry, and solve complex problems. When we see an LLM answer a classic riddle or a logic puzzle correctly, it is tempting to attribute human-like reasoning capabilities to the machine. We assume the model “understands” the logic.
But what if that understanding is brittle? What if the model isn’t solving the logic puzzle, but rather recognizing the specific words—the tokens—used in the puzzle?
A recent paper titled “A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners” investigates this uncomfortable question. The researchers propose that much of what looks like reasoning is actually token bias: the tendency of a model to rely on superficial patterns and specific keywords rather than the underlying logical structure of a task.
The “Twenty-Five Horses” Problem
To understand token bias, let’s look at a classic graph theory puzzle known as the “Twenty-Five Horses” problem. The goal is to find the fastest three horses with the minimum number of races. A genuine reasoner understands the math behind this regardless of the context.
However, as shown below, when researchers simply changed the word “horses” to “bunnies” (and adjusted the count to thirty-six), state-of-the-art models like GPT-4 and Claude stumbled significantly.

This phenomenon raises a critical issue: if a model can solve a math problem about horses but fails the exact same math problem about bunnies, it isn’t reasoning—it’s reciting.
Understanding the Core Hypothesis
The central argument of this research is that LLMs are subject to token bias. This means that if we systematically change the description of a task (perturbing the tokens) while keeping the underlying logic exactly the same, we can predict a shift in the model’s performance.
If an LLM were a “genuine reasoner,” its performance should be invariant to these superficial changes. Logic is logic, whether it applies to Linda the bank teller or Bob the environmentalist.
The “Linda” Problem and Conjunction Fallacy
The researchers focused heavily on the Conjunction Fallacy, a cognitive bias where people incorrectly assume that specific conditions are more probable than a single general one.
The most famous example is the “Linda Problem” from Tversky and Kahneman (1983):
Linda is 31, single, outspoken, and bright… She majored in philosophy… Which is more probable? (a) Linda is a bank teller. (b) Linda is a bank teller and is active in the feminist movement.
Logically, (a) is always more probable than (b) because the probability of two events occurring together is always less than or equal to the probability of one occurring alone.
LLMs are trained on vast amounts of data, including psychology texts discussing the Linda problem. They often answer this correctly. But is that because they understand probability, or because they recognize the name “Linda” in this context?

As Figure 3 illustrates, simply changing “Linda” to “Bob” causes the model to lose its grip on the logic. The model has “overfitted” to the specific tokens of the classic problem.
The Methodology: A Hypothesis-Testing Framework
To prove this isn’t just anecdotal evidence, the authors developed a rigorous statistical framework. They didn’t just ask a few questions; they generated large-scale synthetic datasets to test the models systematically.
The process involves three steps:
- Synthetic Data Generation: Creating new logical problems based on templates.
- Token Perturbation: Creating a “twin” of each problem where non-logical words (names, objects, framing) are changed.
- Statistical Testing: Using the McNemar test on the resulting contingency table to see if the performance drop is statistically significant.

The framework revolves to the contingency table below. The researchers are specifically looking for cases where the model gets the Original problem correct but the Perturbed problem wrong (\(n_{12}\)). If \(n_{12}\) is significantly higher than the reverse (\(n_{21}\)), it proves the model is relying on the specific tokens of the original problem.

Experiment Results: Exposing the Bias
The study tested several hypotheses across major models, including GPT-4, Llama-3, and Claude 3 Opus. The results were consistent and revealing.
Hypothesis 1: Misleading Context
Real-world logic problems often contain irrelevant information. A true reasoner ignores it. The researchers tested if LLMs could withstand contextually misleading options. They found that when irrelevant context was added or swapped, performance degraded.

In the chart above, the salmon-colored bars (\(n_{12}\)) represent cases where the model solved the original problem but failed the perturbed one. The high salmon bars indicate that the models struggle when the “distractor” options are changed, showing they rely on specific contextual clues rather than pure logic.
Hypothesis 2: The “Linda” Fixation
We previously mentioned the Linda vs. Bob example. The researchers scaled this up, testing models on hundreds of variations where classic names (like Linda) were swapped for generic names (like Bob).

The results were stark. For most models, the salmon bars are huge. This confirms Hypothesis 2: LLMs possess a strong token bias toward names frequently appearing in classic literature. They know “Linda” implies a trick question, but they don’t treat “Bob” with the same logical scrutiny.
Hypothesis 3: The Celebrity Effect
Does the mention of “Taylor Swift” change how an AI reasons? The researchers tested “Celebrity Bias” by replacing celebrity names with generic ones in conjunction fallacy problems.

The data suggests that LLMs are frequently misled by irrelevant celebrity names. The rich contextual background associated with a famous token seems to override the logical circuitry, leading the model to hallucinate relationships or probabilities based on the celebrity’s persona rather than the math.
Hypothesis 4: Syllogisms and Keywords
Classic syllogisms often use the words “All” and “Some” (e.g., “All roses are flowers. Some flowers fade quickly…”). The researchers tested if models were overfitting to these specific quantifiers by swapping them with synonyms (e.g., “All” \(\rightarrow\) “Every single”, “Some” \(\rightarrow\) “A subset”).

Here, we see a massive spike in \(n_{12}\) for models like GPT-4-Turbo (the top left graph in the figure above). This indicates that the model relies on the specific pattern “All… Some…” to trigger its reasoning module. When the wording changes, even if the meaning is identical, the reasoning often collapses.
Hypothesis 5: The Appeal to Authority
The researchers also tested if the “trustworthiness” of the source mattered. They framed syllogisms as coming from reputable sources (like the New York Times or MIT) versus generic or satirical sources.

The results showed that models were often misled by reputable names. If a logical fallacy is wrapped in a sentence starting with “Research from MIT supports…”, the model is more likely to accept the fallacy as true. This “Authority Bias” is a dangerous form of token bias where the token of a high-status entity bypasses the model’s logical filters.
Hypothesis 6: Leaking Hints
Finally, the researchers tested how much models rely on explicit hints. If the prompt explicitly mentions “Conjunction Fallacy,” the model performs well. If that hint is removed or weak, performance drops.

The overwhelming height of the teal bars (\(n_{21}\)) in the chart above implies that when the hint is present (the “Perturbed” state in this specific experimental setup, which added the hint), the model gets it right. When the hint is absent (the “Original” state), it gets it wrong. This confirms that LLMs heavily rely on “hint tokens” to solve logical problems effectively.
Conclusion: Not Yet Genuine Reasoners
The evidence presented in this paper is compelling. By using a rigorous matched-pair statistical test, the authors demonstrated that LLMs are not consistently applying abstract reasoning rules. Instead, they are highly sensitive to token bias.
They excel when the problem looks like the data they were trained on—using the name “Linda,” the word “horses,” or the quantifier “All.” But when we strip away these familiar tokens and leave only the naked logic, the models frequently fail.
This suggests that techniques like Chain-of-Thought prompting or In-Context Learning may not be eliciting actual reasoning in the way we hope. Instead, they might simply be providing semantic shortcuts that allow the model to pattern-match its way to the correct answer.
For students and researchers in AI, this highlights a major challenge. We cannot rely solely on accuracy benchmarks to judge reasoning. A model that scores 90% on a logic test might drop to 40% if we simply rename the variables. To build truly intelligent systems, we must move beyond token prediction and solve the problem of genuine, invariant reasoning.
](https://deep-paper.org/en/paper/2406.11050/images/cover.png)