When we interact with Large Language Models (LLMs) like GPT-4 or LLaMA, it is easy to be seduced by their apparent intelligence. You ask a complex multi-step question, and the model produces a coherent, logical answer. It feels like thinking.

But under the hood, is the model actually reasoning? Or is it simply engaging in a sophisticated form of pattern matching, stitching together cues from your prompt to hallucinate a logical structure?

This is the central question posed by researchers Mehrafarin, Eshghi, and Konstas in their paper, “Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs.” They strip away the hype to perform a forensic analysis of how models handle Transitive Reasoning—the logic chain where if \(A\) implies \(B\), and \(B\) implies \(C\), then \(A\) must imply \(C\).

In this post, we will break down their diagnostic experiments. We will discover that while LLMs can solve reasoning puzzles, the way they solve them is often surprisingly alien—relying on shortcuts and keywords rather than the logical deduction we assume they are using.

The Core Problem: Reasoning vs. Retrieval

At a high level, reasoning involves deriving new information that isn’t directly stored in memory. For an LLM, this distinguishes “knowing” that Paris is in France (retrieval) from figuring out that if “Alice is in Paris” and “Paris is in France,” then “Alice is in France” (reasoning).

The researchers focused specifically on Transitive Reasoning. This is a fundamental building block of logic governed by the following rule:

A to B, B to C implies A to C.

To test this, the researchers utilized two datasets that rely on this \(A \to B \to C\) structure:

  1. QASC (Question Answering via Sentence Composition): A dataset of science questions that require combining two facts.
  2. Bamboogle: A dataset designed to test questions that models likely haven’t memorized, requiring them to “hop” between two facts to find an answer (e.g., finding a person’s birth year to determine who was president then).

The goal was simple: Feed the models two facts (the premises) and ask them to deduce the answer. Then, break the inputs in creative ways to see if the models stop working. If a model is truly reasoning, shuffling words into nonsense should break it. If it’s just pattern matching, it might not.

The Setup: Diagnostic Prompting

The researchers compared two major architectures:

  • LLaMA 2 (7B and 13B): Decoder-only models that are popular open-source standards.
  • Flan-T5 (XXL): An encoder-decoder model that has been instruction fine-tuned on various tasks, including reasoning datasets.

To understand how these models think, the authors didn’t just look at accuracy. They manipulated the “In-Context Learning” (ICL) prompts. An ICL prompt provides the model with a few examples (demonstrations) of how to solve a problem before asking the test question.

Figure 1 below illustrates the methodology. On the left (a), you see a standard “3-shot” prompt where the model is shown how to deduce an answer from two facts. On the right (b), you see the diagnostic manipulations—the “stress tests” designed to break the model’s reasoning.

Figure 1: (a) 3-shot In-Context Learning (ICL) prompt for the compositional question answering task… (b) We perform a series of manipulations…

These manipulations included:

  1. Shuffling Words: Randomizing the word order within the facts (e.g., “clouds water form” instead of “clouds form water”).
  2. Keyword Removal: Deleting the specific words in the facts that overlap with the answer.
  3. Gibberish Entities: Replacing dates and proper names with nonsense strings to see if the model relies on recognizing famous entities.

Experiment 1: Do Facts Even Matter?

Before trying to break the models, the researchers established a baseline. Do the models actually use the provided facts, or do they just answer based on pre-trained memory?

They tested the models using several prompt types:

  • Full: The standard prompt with Question + Facts + Deduction steps.
  • QA: Question only (no facts provided).
  • QAF: Question + Facts (but no deduction steps).

The results for the QASC dataset were telling:

QASC Dataset Table comparing accuracy across prompts.

Key Takeaways:

  • Facts are crucial: Look at the gap between QA (answering from memory) and Full (answering with facts). LLaMA 2-13b jumps from 55% to 90%. This proves the models are indeed utilizing the context provided.
  • The “Deduction” step helps LLaMA: LLaMA 2 performs much better when it sees examples of how to deduce (The “Full” prompt) compared to just seeing the facts (“QAF”).
  • Flan-T5 is a beast: It scores 97% on the Full prompt. However, note that Flan-T5 was fine-tuned on the QASC dataset during its training, so it has a “home field advantage.”

So far, so good. The models use the facts to get the right answer. But how are they using them?

Experiment 2: The “Word Salad” Surprise

Here is where things get strange. If a human reads the sentence “describes generally Climate terms in moisture and temperature of,” they would struggle to perform logical deduction because the syntax is broken. Grammar dictates the relationship between \(A\) and \(B\).

The researchers randomly shuffled the words inside the provided facts (the Shuffled Facts experiment) and fed them to the models.

Hypothesis: Performance should crash. Reality: It didn’t.

Figure 2: Accuracy of models prompted with the Shuffled Facts and Full diagnostic prompts.

As shown in the chart above, the orange bars (shuffled facts) are almost as high as the blue bars (grammatically correct facts).

  • LLaMA 2-13b dropped only slightly from 90% to 86%.
  • Flan-T5 dropped from 97% to 92%.

What this means: This result is profound. It suggests these LLMs are insensitive to word order. They aren’t parsing the sentence structure to understand that \(A\) leads to \(B\). Instead, they seem to be treating the sentence as a “bag of words”—a collection of keywords. If the tokens “Climate,” “temperature,” and “moisture” appear near each other, the model associates them, regardless of whether the sentence makes grammatical sense.

The researchers even tried asking the models to “un-shuffle” the sentences to see if they were mentally correcting the grammar. The models failed to do so, proving they weren’t fixing the sentence internally—they were just ignoring the syntax entirely.

Experiment 3: Hunting for Shortcuts

If the models aren’t reading for grammar, they must be hunting for specific cues. The researchers hypothesized that the models were relying on token overlap—simply matching words in the question to words in the facts, and then matching words in the facts to the answer choices.

To test this, they performed Ablation Studies, where they surgically removed specific connecting words from the facts.

  • F1Q / F2Q: Removing words that overlap between the Facts and the Question.
  • F1F2A Keyword Ablation: Removing the specific words in the facts that correspond to the Answer.

Table 2: Accuracy of LLaMA 2-13b, LLaMA 2-7b, and Flan-T5 XXL on QASC with different ablation prompts.

The results (Table 2 above) validate the “shortcut” theory:

  1. Connecting words don’t matter much: Removing the bridge words between Fact 1 and Fact 2 (F1F2 ablation) had very little impact. This is damning for “reasoning,” as the bridge is essential for the transitive property (\(B\) in \(A \to B \to C\)).
  2. Answer Keywords matter most: Look at the last row (F1F2A Keyword Ablation). When the answer keyword was removed from the text of the facts, performance dropped significantly (e.g., LLaMA 2-13b dropped by 15 points).

This indicates the models are largely playing a matching game. They look for the answer candidate that appears most prominently in the context text.

Experiment 4: The Bamboogle Stress Test

The QASC dataset is Multiple Choice, which allows models to guess. To rigorously test this, the authors moved to Bamboogle, a dataset that requires generating free-text answers. Since Flan-T5 was released before Bamboogle, this dataset also ensures the model hasn’t memorized the answers.

The researchers introduced a clever twist called “Gibberish Entities.”

In transitive reasoning, the logic should hold regardless of the nouns. If X happened in Year Y, and the President in Year Y was Z, then the President during X was Z. This is true whether the year is “1812” or “xxxx”.

The researchers replaced dates and names with gibberish (e.g., changing “1812” to “aavril”) to see if the models could still trace the logic without the semantic crutch of recognizing a famous date.

Table 4: Rouge-1 for LLaMA 2-13b and Flan-T5 on the Bamboogle Gibberish dataset…

The table above compares the results on the “Gibberish” dataset:

  • LLaMA 2-13b (49%): It struggled significantly. This suggests LLaMA relies heavily on Named Entities. It uses dates and names as anchors. When “1812” becomes “aavril,” the model gets lost, even though the logical structure is identical.
  • Flan-T5 (97%): It remained incredibly robust.

Discussion: Why the Difference?

Why did Flan-T5 perform so well on the Gibberish test while LLaMA failed?

The authors suggest the secret lies in Fine-Tuning. Flan-T5 has been instruction fine-tuned on a massive variety of reasoning datasets. It appears to have learned the abstract pattern of transitive reasoning. It understands that if the prompt says “Event A happened at time [gibberish],” and “At time [gibberish], Person B was effectively,” it should output Person B.

LLaMA 2, which generally lacks that specific supervised fine-tuning on reasoning tasks, relies more on its pre-training priors. It looks for familiar dates and entities. When those are removed, the illusion of reasoning collapses.

Conclusion: A Semblance of Reasoning

This paper provides a sobering look at the capabilities of LLMs.

  1. It’s not human-like reasoning: The fact that models perform well on “word salad” (shuffled facts) proves they process information fundamentally differently than humans. They do not rely on syntax or grammatical logic flow.
  2. It’s fragile: Remove the answer keywords from the context, or obscure the named entities (for non-fine-tuned models), and the performance drops.
  3. Fine-tuning simulates reasoning: The superior performance of Flan-T5 suggests that we can teach models to emulate reasoning patterns robustly, even if the underlying mechanism (keyword attention) remains distinct from human thought.

The next time you see an LLM solve a complex riddle, remember: it might not be deducing the answer. It might just be finding the best-fitting piece in a puzzle of keywords. The result is correct, but the “thought process” is merely a semblance of the real thing.