The capabilities of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini have revolutionized Natural Language Understanding (NLU). We have seen them write poetry, summarize legal documents, and even generate code. It is easy to look at these feats and assume that these models possess a form of robust logical reasoning. After all, if a model can write a convincing essay on philosophy, surely it can solve a logic puzzle, right?
Not necessarily. There is a growing body of evidence suggesting that while LLMs are excellent at pattern matching and linguistic probability, they struggle significantly with multi-step logical deduction.
In this post, we are doing a deep dive into a fascinating paper titled “Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?” by researchers from Arizona State University and Microsoft Research. This paper moves beyond simple accuracy metrics to answer a fundamental question: When an LLM tries to solve a logic puzzle, how exactly does it fail?
To answer this, the researchers developed a comprehensive framework involving a new dataset, a detailed error taxonomy, and automated evaluation pipelines. Let’s unravel their work step-by-step.
The Problem with Current Evaluations
Most benchmarks for LLMs focus on the final answer. If you ask a model a math word problem and it gives the correct number, we mark it as “correct.” However, this binary pass/fail metric hides the messiness of the reasoning process. An LLM might arrive at the right answer through a hallucinated formula (a false positive), or it might perform perfect reasoning for ten steps but fail on the eleventh (a false negative regarding its general reasoning ability).
To truly test logical capabilities, we need a task that relies purely on deduction, constraint satisfaction, and elimination, with zero ambiguity. Enter the Logic Grid Puzzle.
What is a Grid Puzzle?
You have likely seen these in puzzle magazines. You are given a scenario (e.g., “Five friends bought five different ice creams on five different days”) and a set of clues (e.g., “The person who bought chocolate did so the day before Tom”). Your goal is to fill a grid to match every variable correctly.
Grid puzzles are the perfect testbed for LLMs because:
- They are self-contained: No external knowledge is needed.
- They are strict: There is only one correct solution.
- They require memory: You must hold multiple constraints in “working memory” simultaneously.
The researchers introduced GridPuzzle, a curated dataset of 274 puzzles ranging in size from \(3 \times 4\) (easy) to \(4 \times 6\) (hard). They used this dataset to stress-test models like GPT-4, Claude-3, Gemini, Mistral, and Llama-2.

As shown in Figure 1, the pipeline is extensive. It starts with the GridPuzzle dataset, feeds it into LLMs using Zero-Shot Chain-of-Thought (CoT) prompting, and then subjects the output to both manual and automated scrutiny.
Dissecting the Reasoning Chain
The core contribution of this paper is not just finding out if models fail, but where. To do this, the authors scrutinized the “Reasoning Chains”—the step-by-step explanations generated by the models.
They performed a manual analysis of 150 reasoning chains, breaking them down sentence by sentence. For every sentence in a model’s output, they identified the Premise (the information the model is using) and the Conclusion (the deduction the model is making).
Based on this analysis, they proposed a new Error Taxonomy. This is a framework for classifying logical mistakes, which is incredibly useful for anyone studying AI interpretability.
The Error Taxonomy
The taxonomy is split into broad categories and fine-grained sub-categories.
Broad Categories:
- WW (Wrong Premise, Wrong Conclusion): The model starts with bad info and ends with a bad deduction.
- WR (Wrong Premise, Right Conclusion): The model gets lucky. It starts with false info but accidentally derives a true statement.
- RW (Right Premise, Wrong Conclusion): This is a critical failure of logic. The model has the correct facts but draws an invalid inference.
- RR (Right Premise, Right Conclusion): A correct reasoning step.
- NC (No Conclusion): Filler text or simple restatement of clues without new deduction.
The researchers went deeper, identifying specific sub-categories to explain why a premise or conclusion was wrong.

Table 2 details these sub-categories. Some notable ones include:
- Hallucination: The model invents information not present in the clues.
- Error Propagation: A step is wrong because a previous step was wrong (the “snowball effect”).
- Wrong Elimination: A specific failure in grid puzzles where the model fails to correctly rule out impossible options.
This taxonomy transforms the evaluation from a simple “wrong answer” to a diagnostic report. For example, if a model consistently makes “RW” errors (Right Premise, Wrong Conclusion), we know it struggles with the core logic engine. If it makes “WW” errors due to “Error Propagation,” we know it struggles with maintaining long-context consistency.
Automating the Critic: The Auto-Evaluator
Manually annotating reasoning chains is slow and expensive. To scale this up to thousands of steps, the researchers created an Auto-Evaluator. They prompted GPT-4o with their error taxonomy and strict instructions to act as a judge.
The prompt structure for this auto-evaluator is robust. It includes:
- System Instructions: The rules of the evaluation.
- Knowledge Base: The definitions of the error taxonomy (from Table 2).
- Exemplars: Examples of human-annotated reasoning chains.

Figure 9 illustrates this process in action. You can see the original puzzle, the model’s (Llama-2) flawed attempt, and then the Auto-Evaluator’s JSON output. The evaluator breaks down a sentence, identifies the premise and conclusion, explains the flaw (e.g., “The conclusion is incorrect as it assumes Underwood is staying for 2 days without sufficient information”), and assigns the error code (RW - Right Premise, Wrong Conclusion).
Validation showed that the Auto-Evaluator had an ~86% agreement rate with human annotators, making it a reliable proxy for large-scale analysis.
PuzzleEval: A New Metric for Reasoning
Identifying errors is great, but we also need a quantitative score. The researchers introduced PuzzleEval, a reference-free metric. “Reference-free” means you don’t need a “perfect” human-written explanation to compare against; you only need the final correct answer key (the Gold Solution).
How PuzzleEval Works: The metric calculates a Correctness Score for the reasoning chain by verifying if the intermediate deductions match the ground truth.

As visualized in Figure 2, the process has three stages:
- Conclusion Extraction: The LLM (acting as a parser) reads a reasoning step and extracts the logical claim (e.g., “Therefore, Sam is assigned 2013”).
- Pair-wise Extraction: This claim is converted into structured pairs (e.g.,
(Sam, 2013)or(Sam, NOT 2014)). - Validation: These pairs are checked against the Gold Solution table. If the pair exists in the solution, it gets a 1; otherwise, a 0.
The final score is the average correctness across all steps. This allows us to give partial credit. A model might reason perfectly for 90% of the puzzle and fail at the very end. Standard accuracy would give it a 0, but PuzzleEval might give it a 0.9, accurately reflecting that it can reason, just not robustly enough to finish.
Experimental Results: The Harsh Truth
So, how did the models perform? The results were sobering.
The researchers tested major models in a Zero-Shot Chain-of-Thought setting. They provided the puzzle and the instruction “Let’s think step by step.”
1. Final Answer Accuracy
If you look at pure accuracy (did the model fill the grid correctly?), the performance is abysmal.

Figure 3 shows the number of correctly solved puzzles out of 274.
- GPT-4: 14 puzzles solved (~5.1%).
- Claude-3: 10 puzzles.
- Gemini: 4 puzzles.
- Llama-2: 1 puzzle.
- Mistral: 0 puzzles.
For a task that requires strict logic, current LLMs are essentially guessing or failing completely. Even the most powerful model, GPT-4, failed 95% of the time.
2. PuzzleEval Scores vs. Accuracy
However, the PuzzleEval scores tell a more nuanced story.

Table 4 shows the Average Correctness Score (ACS). While accuracy was near zero, the PuzzleEval scores ranged from 0.27 to 0.59.
This indicates a “Logic Gap.” Models like GPT-4 are generating many correct reasoning steps (hence the 0.59 score), but they inevitably make a mistake that cascades, ruining the final answer. They are capable of local logical steps but struggle with global consistency across a long chain.
3. Where Exactly Do They Fail?
Using the taxonomy and the Auto-Evaluator, the researchers mapped the distribution of errors.

Figure 4 reveals the composition of the reasoning chains:
- NC (No Conclusion): A massive chunk of the generated text (especially for Gemini) is just filler or restating clues.
- RR (Right Premise, Right Conclusion): GPT-4 has the highest percentage of correct steps (blue bar), which aligns with its higher PuzzleEval score.
- RW/WW: These are the fatal errors.
The researchers also generated heatmaps to pinpoint the specific types of logical failures.

Figure 5 highlights that the most dominant error sub-categories are:
- Wrong Reasoning (RW-a): The premise is right, but the logic applied is flawed.
- Wrong Elimination (RW-c): Failing to cross out options that are no longer possible.
- Error Propagation (WW-4b): The most common “WW” error. Once a model makes one mistake, it treats that mistake as a fact for future steps, compounding the failure.
This confirms that LLMs behave like “greedy reasoners.” They make a deduction based on immediate probability without verifying if it conflicts with constraints set five steps ago.
4. Do Prompting Strategies Help?
A common counter-argument in AI research is, “You just didn’t prompt it right.” The authors anticipated this and tested several advanced prompting techniques on a subset of the data:
- Plan-and-Solve: Ask the model to make a plan first.
- Self-Correct: Ask the model to verify its own answer.
- Self-Discover: A structured reasoning process.
- Program-of-Thought: Asking the model to write code to solve the puzzle.

The results in Table 5 are surprising. None of these strategies significantly improved performance.
- Plan-and-Solve actually dropped accuracy from 12 to 9.
- Self-Correct dropped it to 10.
- Self-Discover offered a tiny marginal gain (12 to 13).
This suggests that the deficit in logical reasoning is fundamental to the models’ current architecture and cannot simply be prompted away. Even when asked to self-correct, the models often hallucinate that their wrong answer is correct because they cannot fundamentally “check” their logic against the strict constraints of the grid.
Conclusion and Future Directions
The paper “Step-by-Step Reasoning to Solve Grid Puzzles” provides a sobering reality check for the AI community. While LLMs are linguistically fluent, their ability to perform sustained, error-free logical deduction is currently very poor.
Key Takeaways:
- Accuracy is Misleading: Models can have 0% accuracy but 60% step-wise correctness. We need metrics like PuzzleEval to understand the partial successes.
- The “Greedy” Problem: LLMs struggle to look ahead or backtrack. Once they make a “Wrong Reasoning” error, “Error Propagation” guarantees failure.
- Prompting isn’t a Magic Bullet: Standard tricks like Chain-of-Thought or Self-Correction do not fix fundamental logic gaps in grid puzzle contexts.
- Taxonomy Matters: By classifying errors (WW, RW, RR), we move from knowing that a model failed to understanding why it failed.
For students and researchers, this paper highlights a massive opportunity. We don’t just need bigger models; we need models that can verify their own logic, backtrack when a contradiction is found, and handle strict constraints without hallucinating. Until then, while an LLM might be able to write a poem about a logic puzzle, it certainly cannot solve one.
](https://deep-paper.org/en/paper/2407.14790/images/cover.png)