State-of-the-art Large Language Models (LLMs) like GPT-4 and Llama-2 are often celebrated for their reasoning capabilities. We see them pass bar exams, solve complex math problems, and generate code. But a lingering question remains in the NLP community: Are these models actually reasoning, or are they just sophisticated pattern matchers taking shortcuts?

In the era of smaller, fine-tuned models (like BERT), we knew the answer: they were shortcut takers. They would often ignore the logic of a sentence and simply match keywords to find an answer. LLMs, however, are presumed to be better. Because they are zero-shot learners (not fine-tuned on specific benchmark data), the assumption is that they don’t learn these “cheap” tricks.

A fascinating research paper, “Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?”, challenges this assumption. The researchers developed a method to test whether LLMs are truly attentive readers or if they can be seduced by “seemingly plausible” but ultimately incorrect information. The results suggest that while LLMs are smarter than their predecessors, they still get “blinded by nuance.”

The Challenge of Multi-Hop Reasoning

To understand the paper’s contribution, we first need to understand multi-hop reasoning.

Simple question answering requires looking up a single fact. For example, “Who directed Titanic?” requires finding one document that links James Cameron to the movie.

Multi-hop reasoning requires integrating information from multiple sources to arrive at an answer. Consider this question:

“Who created the 2003 remake of the 1983 overhead view, vehicular combat game developed by Bally Midway?”

To answer this, a model (or a human) must perform two “hops”:

  1. Hop 1: Identify the 1983 game developed by Bally Midway (Answer: Spy Hunter).
  2. Hop 2: Identify who created the 2003 remake of Spy Hunter (Answer: Adam Dawes).

If a model skips Hop 1 and just guesses based on keywords like “2003 remake” and “vehicular combat,” it might land on the wrong game entirely.

The Problem with Existing Benchmarks

In the past, researchers noticed that models could solve these multi-hop questions without actually doing the multi-hop work. If a paragraph contained the words “Bally Midway” and “2003,” the model might just grab the nearest name.

To test this, researchers previously used “adversarial attacks”—adding distracting paragraphs to the text to see if the model gets confused. Traditional attacks (like AddDoc) simply added paragraphs with high lexical overlap (lots of shared words). Modern LLMs are generally robust to these; they are smart enough to realize that a paragraph sharing just keywords isn’t necessarily the right answer.

But what if the distraction wasn’t just random keywords? What if the distraction was a plausible alternative reasoning path?

The Core Method: Creating the “Plausible Distractor”

The core contribution of this paper is a new framework for generating adversarial examples that are much harder to detect. Instead of messy word salads, the researchers generate coherent, logical, but factually incorrect reasoning chains—“plausible distractors.”

Here is how the researchers build these traps:

1. Question Decomposition

First, they take a multi-hop question and break it down into its constituent sub-questions.

Example of a decomposed multi-hop question.

As shown in Figure 2, a complex question about Guns N’ Roses is split into two logical steps. This decomposition allows the researchers to target specific parts of the reasoning chain.

2. Identifying and Modifying Details

To create a trap, you can’t just change the answer; you have to change the context so the wrong answer looks right. The system identifies a “main entity” and a “modifiable detail” in the sub-questions.

For example, if the original question is about the arena where a team played its home games, the researchers might change the modifier “home” to “playoff”.

3. Generating the Distractor Paragraphs

This is where LLMs are used against themselves. The researchers feed the modified sub-questions into GPT-4 to generate fake Wikipedia-style paragraphs.

If the modified question asks about the “playoff” games (instead of home games), GPT-4 generates a convincing paragraph stating the team played playoff games at “Maple Leaf Arena” (a hallucinated or irrelevant detail in this context).

The result is a test case where the model is presented with:

  1. The Gold Path: The correct paragraphs leading to the real answer.
  2. The Distractor Path: A high-quality, semantically consistent paragraph that answers a slightly different question (e.g., about playoff games rather than home games).

If the LLM is an “attentive reader,” it will notice the specific constraint in the user’s question (e.g., “home games”) and ignore the distractor. If the LLM is skimming or relying on general semantic similarity, it might fall for the trap.

Experiments: Do LLMs Take Shortcuts?

Before unleashing their new attack, the researchers first established a baseline. They wanted to see if LLMs exhibit “reasoning shortcuts” even on standard data, similar to older models like BERT.

They used a dataset called SubQA, where they checked if the model could answer the sub-questions correctly and the final question correctly.

Table 2: Results of Llama-2-13B on SubQA dataset

Table 2 reveals a discrepancy. Llama-2-13B performs significantly better on individual sub-questions (F1 ~0.74) than on the original multi-hop question (F1 ~0.42). This suggests that the difficulty lies in the integration of information—the “hop” itself.

More telling is the consistency analysis found in Table 3:

Table 3: Breakdown of the results on running SubQA

Look at the row “Correct but sub-questions wrong.” In 10.7% of cases, the model guessed the final answer correctly despite failing to answer the necessary sub-questions. This is a “smoking gun” for shortcut reasoning—getting the right answer for the wrong reasons. Conversely, in 25% of cases, the model knew both facts individually but failed to combine them, showing a failure in reasoning capabilities.

The Main Result: Falling for the Trap

The researchers then evaluated several models (Llama-2, Mixtral, GPT-3.5) on their new benchmark containing the Plausible Distractors.

The results were stark. Unlike traditional “lexical” attacks which LLMs mostly ignore, these “semantic” distractors caused massive performance drops.

Table 4: Results of Llama-2-13B, Mixtral-8x7B-Instruct-v0.1, Llama-2-70B, GPT-3.5 and longformer…

Table 4 provides the comprehensive results. Let’s break down the key takeaways:

  • Significant Drops Across the Board: Look at the “Llama-2-70b” row. On the original dataset (“ori”), it scores 54.1 EM (Exact Match). On the adversarial dataset (“adv”), it drops to 40.4 EM. That is a massive decrease in reliability.
  • The “Related” Factor Matters: The columns under “Paragraph Related” are crucial. When the distractor paragraphs form a cohesive, related reasoning chain (a complete fake story), the models perform worse (lower F1 scores) compared to when the distractors are unrelated. This confirms that LLMs are being seduced by the coherence of the fake path.
  • GPT-3.5 is not immune: Even proprietary models like GPT-3.5 saw their F1 scores plummet from 63.4 to 39.9.

The researchers also tested GPT-4 (though on a smaller sample size due to cost). While GPT-4 was more robust than the others, it still suffered a 14% relative decrease in F1 score when faced with four plausible distractor paragraphs. This proves that “scaling up” does not automatically solve the problem of reasoning shortcuts.

Why Is This Happening?

The paper suggests that LLMs aren’t necessarily “reading” in the way humans do. When a human reads “Where did they play home games?”, we specifically look for the word “home” and exclude “playoff.”

LLMs, however, operate on probabilistic associations. A paragraph discussing the team, the sport, and an arena has a very high probability weight, even if the specific modifier (“playoff” vs “home”) doesn’t match. The “Plausible Distractor” exploits this by creating a paragraph that is semantically perfect except for that one crucial constraint.

The authors call this behavior being “blinded by nuance.”

Can Prompt Engineering Fix It?

A common counter-argument in modern NLP is: “Just prompt it better.” The researchers tested this hypothesis by using advanced prompting techniques like Chain-of-Thought (CoT) and Self-Consistency (asking the model to reason multiple times and taking the most common answer).

Table 11: Effect of self-consistency on F1 score

Table 11 shows the results. While Self-Consistency provides a modest improvement (raising Llama-2-13B’s adversarial F1 from 20.4 to 23.9), it does not recover the lost performance. The models remain vulnerable. The failure seems to be fundamental to how the models attend to information, rather than just a lack of “thinking time.”

Conclusion: The Illusion of Reasoning

This paper serves as a vital reality check. We often attribute human-level comprehension to LLMs because they speak so fluently. However, their inability to distinguish between the correct reasoning path and a “seemingly plausible” alternative suggests they are not yet fully attentive readers.

The implications are significant, particularly for Retrieval Augmented Generation (RAG) systems. In RAG, we fetch documents to help an LLM answer a question. If the retrieval system pulls in a “plausible distractor”—a document that looks relevant but addresses a slightly different nuance—this paper suggests the LLM is highly likely to hallucinate an incorrect answer based on that distraction.

The researchers have shown that while LLMs have moved past simple keyword matching, they have simply graduated to a more complex form of shortcut-taking. To build truly reliable AI, we need models that don’t just find the most likely path, but the one that strictly adheres to the facts.