Large Language Models (LLMs) like GPT-4 and Claude 3 have dazzled the world with their ability to write code, compose poetry, and solve complex problems. When we see an LLM answer a classic riddle or a logic puzzle correctly, it is tempting to attribute human-like reasoning capabilities to the machine. We assume the model “understands” the logic.

But what if that understanding is brittle? What if the model isn’t solving the logic puzzle, but rather recognizing the specific words—the tokens—used in the puzzle?

A recent paper titled “A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners” investigates this uncomfortable question. The researchers propose that much of what looks like reasoning is actually token bias: the tendency of a model to rely on superficial patterns and specific keywords rather than the underlying logical structure of a task.

The “Twenty-Five Horses” Problem

To understand token bias, let’s look at a classic graph theory puzzle known as the “Twenty-Five Horses” problem. The goal is to find the fastest three horses with the minimum number of races. A genuine reasoner understands the math behind this regardless of the context.

However, as shown below, when researchers simply changed the word “horses” to “bunnies” (and adjusted the count to thirty-six), state-of-the-art models like GPT-4 and Claude stumbled significantly.

Figure 1: We illustrate token bias using the classic “twenty-five horses” problem in graph theory.

This phenomenon raises a critical issue: if a model can solve a math problem about horses but fails the exact same math problem about bunnies, it isn’t reasoning—it’s reciting.

Understanding the Core Hypothesis

The central argument of this research is that LLMs are subject to token bias. This means that if we systematically change the description of a task (perturbing the tokens) while keeping the underlying logic exactly the same, we can predict a shift in the model’s performance.

If an LLM were a “genuine reasoner,” its performance should be invariant to these superficial changes. Logic is logic, whether it applies to Linda the bank teller or Bob the environmentalist.

The “Linda” Problem and Conjunction Fallacy

The researchers focused heavily on the Conjunction Fallacy, a cognitive bias where people incorrectly assume that specific conditions are more probable than a single general one.

The most famous example is the “Linda Problem” from Tversky and Kahneman (1983):

Linda is 31, single, outspoken, and bright… She majored in philosophy… Which is more probable? (a) Linda is a bank teller. (b) Linda is a bank teller and is active in the feminist movement.

Logically, (a) is always more probable than (b) because the probability of two events occurring together is always less than or equal to the probability of one occurring alone.

LLMs are trained on vast amounts of data, including psychology texts discussing the Linda problem. They often answer this correctly. But is that because they understand probability, or because they recognize the name “Linda” in this context?

Figure 3: What is token bias? Here is another example exhibited by GPT-4. On the left, GPT-4 correctly identifies the conjunction fallacy and answers the question correctly. On the right, the exemplar is rephrased by altering “Linda” to “Bob” while keeping the same logic, which surprisingly confuses the model.

As Figure 3 illustrates, simply changing “Linda” to “Bob” causes the model to lose its grip on the logic. The model has “overfitted” to the specific tokens of the classic problem.

The Methodology: A Hypothesis-Testing Framework

To prove this isn’t just anecdotal evidence, the authors developed a rigorous statistical framework. They didn’t just ask a few questions; they generated large-scale synthetic datasets to test the models systematically.

The process involves three steps:

  1. Synthetic Data Generation: Creating new logical problems based on templates.
  2. Token Perturbation: Creating a “twin” of each problem where non-logical words (names, objects, framing) are changed.
  3. Statistical Testing: Using the McNemar test on the resulting contingency table to see if the performance drop is statistically significant.

Figure 2: An illustration of the overall framework. We generate synthetic data, perform systematic token perturbations, and evaluate an LLM for comparative studies.

The framework revolves to the contingency table below. The researchers are specifically looking for cases where the model gets the Original problem correct but the Perturbed problem wrong (\(n_{12}\)). If \(n_{12}\) is significantly higher than the reverse (\(n_{21}\)), it proves the model is relying on the specific tokens of the original problem.

Table 1: A template for the contingency table.

Experiment Results: Exposing the Bias

The study tested several hypotheses across major models, including GPT-4, Llama-3, and Claude 3 Opus. The results were consistent and revealing.

Hypothesis 1: Misleading Context

Real-world logic problems often contain irrelevant information. A true reasoner ignores it. The researchers tested if LLMs could withstand contextually misleading options. They found that when irrelevant context was added or swapped, performance degraded.

Figure 5: Full experimental results for Hypothesis 1. The perturbed problems alternate options contextually relevant to the problem statements to irrelevant ones.

In the chart above, the salmon-colored bars (\(n_{12}\)) represent cases where the model solved the original problem but failed the perturbed one. The high salmon bars indicate that the models struggle when the “distractor” options are changed, showing they rely on specific contextual clues rather than pure logic.

Hypothesis 2: The “Linda” Fixation

We previously mentioned the Linda vs. Bob example. The researchers scaled this up, testing models on hundreds of variations where classic names (like Linda) were swapped for generic names (like Bob).

Figure 7: Full experimental results for Hypothesis 2. The perturbed problems alternate the name classic “Linda” to “Bob” in in-context learning exemplars.

The results were stark. For most models, the salmon bars are huge. This confirms Hypothesis 2: LLMs possess a strong token bias toward names frequently appearing in classic literature. They know “Linda” implies a trick question, but they don’t treat “Bob” with the same logical scrutiny.

Hypothesis 3: The Celebrity Effect

Does the mention of “Taylor Swift” change how an AI reasons? The researchers tested “Celebrity Bias” by replacing celebrity names with generic ones in conjunction fallacy problems.

Figure 9: Full experimental results for Hypothesis 3. The perturbed problems alternate the celebrity name to a generic one in problem statements.

The data suggests that LLMs are frequently misled by irrelevant celebrity names. The rich contextual background associated with a famous token seems to override the logical circuitry, leading the model to hallucinate relationships or probabilities based on the celebrity’s persona rather than the math.

Hypothesis 4: Syllogisms and Keywords

Classic syllogisms often use the words “All” and “Some” (e.g., “All roses are flowers. Some flowers fade quickly…”). The researchers tested if models were overfitting to these specific quantifiers by swapping them with synonyms (e.g., “All” \(\rightarrow\) “Every single”, “Some” \(\rightarrow\) “A subset”).

Figure 11: Full experimental results for Hypothesis 4. The perturbed problems alternate tokens “All” and “Some” to different but equivalent expressions in syllogisms.

Here, we see a massive spike in \(n_{12}\) for models like GPT-4-Turbo (the top left graph in the figure above). This indicates that the model relies on the specific pattern “All… Some…” to trigger its reasoning module. When the wording changes, even if the meaning is identical, the reasoning often collapses.

Hypothesis 5: The Appeal to Authority

The researchers also tested if the “trustworthiness” of the source mattered. They framed syllogisms as coming from reputable sources (like the New York Times or MIT) versus generic or satirical sources.

Figure 13: Full experimental results for Hypothesis 5. The perturbed problems add the names of trustworthy news agencies and universities to alter the narratives of syllogisms.

The results showed that models were often misled by reputable names. If a logical fallacy is wrapped in a sentence starting with “Research from MIT supports…”, the model is more likely to accept the fallacy as true. This “Authority Bias” is a dangerous form of token bias where the token of a high-status entity bypasses the model’s logical filters.

Hypothesis 6: Leaking Hints

Finally, the researchers tested how much models rely on explicit hints. If the prompt explicitly mentions “Conjunction Fallacy,” the model performs well. If that hint is removed or weak, performance drops.

Figure 17: Full experimental results for Hypothesis 6. The perturbed problems leak hint tokens, either weak or strong hints in problem statements.

The overwhelming height of the teal bars (\(n_{21}\)) in the chart above implies that when the hint is present (the “Perturbed” state in this specific experimental setup, which added the hint), the model gets it right. When the hint is absent (the “Original” state), it gets it wrong. This confirms that LLMs heavily rely on “hint tokens” to solve logical problems effectively.

Conclusion: Not Yet Genuine Reasoners

The evidence presented in this paper is compelling. By using a rigorous matched-pair statistical test, the authors demonstrated that LLMs are not consistently applying abstract reasoning rules. Instead, they are highly sensitive to token bias.

They excel when the problem looks like the data they were trained on—using the name “Linda,” the word “horses,” or the quantifier “All.” But when we strip away these familiar tokens and leave only the naked logic, the models frequently fail.

This suggests that techniques like Chain-of-Thought prompting or In-Context Learning may not be eliciting actual reasoning in the way we hope. Instead, they might simply be providing semantic shortcuts that allow the model to pattern-match its way to the correct answer.

For students and researchers in AI, this highlights a major challenge. We cannot rely solely on accuracy benchmarks to judge reasoning. A model that scores 90% on a logic test might drop to 40% if we simply rename the variables. To build truly intelligent systems, we must move beyond token prediction and solve the problem of genuine, invariant reasoning.