The capabilities of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini have revolutionized Natural Language Understanding (NLU). We have seen them write poetry, summarize legal documents, and even generate code. It is easy to look at these feats and assume that these models possess a form of robust logical reasoning. After all, if a model can write a convincing essay on philosophy, surely it can solve a logic puzzle, right?

Not necessarily. There is a growing body of evidence suggesting that while LLMs are excellent at pattern matching and linguistic probability, they struggle significantly with multi-step logical deduction.

In this post, we are doing a deep dive into a fascinating paper titled “Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?” by researchers from Arizona State University and Microsoft Research. This paper moves beyond simple accuracy metrics to answer a fundamental question: When an LLM tries to solve a logic puzzle, how exactly does it fail?

To answer this, the researchers developed a comprehensive framework involving a new dataset, a detailed error taxonomy, and automated evaluation pipelines. Let’s unravel their work step-by-step.

The Problem with Current Evaluations

Most benchmarks for LLMs focus on the final answer. If you ask a model a math word problem and it gives the correct number, we mark it as “correct.” However, this binary pass/fail metric hides the messiness of the reasoning process. An LLM might arrive at the right answer through a hallucinated formula (a false positive), or it might perform perfect reasoning for ten steps but fail on the eleventh (a false negative regarding its general reasoning ability).

To truly test logical capabilities, we need a task that relies purely on deduction, constraint satisfaction, and elimination, with zero ambiguity. Enter the Logic Grid Puzzle.

What is a Grid Puzzle?

You have likely seen these in puzzle magazines. You are given a scenario (e.g., “Five friends bought five different ice creams on five different days”) and a set of clues (e.g., “The person who bought chocolate did so the day before Tom”). Your goal is to fill a grid to match every variable correctly.

Grid puzzles are the perfect testbed for LLMs because:

  1. They are self-contained: No external knowledge is needed.
  2. They are strict: There is only one correct solution.
  3. They require memory: You must hold multiple constraints in “working memory” simultaneously.

The researchers introduced GridPuzzle, a curated dataset of 274 puzzles ranging in size from \(3 \times 4\) (easy) to \(4 \times 6\) (hard). They used this dataset to stress-test models like GPT-4, Claude-3, Gemini, Mistral, and Llama-2.

Schematic representation of proposed pipeline. Begins with the data collection of GridPuzzle dataset (top left) and evaluating various LLMs in zero-shot CoT setting (bottom left), then analyzing reasoning chains of LLMs manually to find various error types (top right) and automate this analysis process using LLM to check the correctness of reasoning chain by finding errors (bottom right).

As shown in Figure 1, the pipeline is extensive. It starts with the GridPuzzle dataset, feeds it into LLMs using Zero-Shot Chain-of-Thought (CoT) prompting, and then subjects the output to both manual and automated scrutiny.

Dissecting the Reasoning Chain

The core contribution of this paper is not just finding out if models fail, but where. To do this, the authors scrutinized the “Reasoning Chains”—the step-by-step explanations generated by the models.

They performed a manual analysis of 150 reasoning chains, breaking them down sentence by sentence. For every sentence in a model’s output, they identified the Premise (the information the model is using) and the Conclusion (the deduction the model is making).

Based on this analysis, they proposed a new Error Taxonomy. This is a framework for classifying logical mistakes, which is incredibly useful for anyone studying AI interpretability.

The Error Taxonomy

The taxonomy is split into broad categories and fine-grained sub-categories.

Broad Categories:

  1. WW (Wrong Premise, Wrong Conclusion): The model starts with bad info and ends with a bad deduction.
  2. WR (Wrong Premise, Right Conclusion): The model gets lucky. It starts with false info but accidentally derives a true statement.
  3. RW (Right Premise, Wrong Conclusion): This is a critical failure of logic. The model has the correct facts but draws an invalid inference.
  4. RR (Right Premise, Right Conclusion): A correct reasoning step.
  5. NC (No Conclusion): Filler text or simple restatement of clues without new deduction.

The researchers went deeper, identifying specific sub-categories to explain why a premise or conclusion was wrong.

Table 2: Proposed error taxonomy for sub-categories based on manual analysis. These sub-categories are defined for cases where either the conclusion or premise is incorrect(“RW”or“WR”) or both are incorrect(WW").For “WW",the error sub-categories might appear in any combinations between (1-6)and (a-c) such as‘1a’,‘4b’,or ‘6c’.

Table 2 details these sub-categories. Some notable ones include:

  • Hallucination: The model invents information not present in the clues.
  • Error Propagation: A step is wrong because a previous step was wrong (the “snowball effect”).
  • Wrong Elimination: A specific failure in grid puzzles where the model fails to correctly rule out impossible options.

This taxonomy transforms the evaluation from a simple “wrong answer” to a diagnostic report. For example, if a model consistently makes “RW” errors (Right Premise, Wrong Conclusion), we know it struggles with the core logic engine. If it makes “WW” errors due to “Error Propagation,” we know it struggles with maintaining long-context consistency.

Automating the Critic: The Auto-Evaluator

Manually annotating reasoning chains is slow and expensive. To scale this up to thousands of steps, the researchers created an Auto-Evaluator. They prompted GPT-4o with their error taxonomy and strict instructions to act as a judge.

The prompt structure for this auto-evaluator is robust. It includes:

  • System Instructions: The rules of the evaluation.
  • Knowledge Base: The definitions of the error taxonomy (from Table 2).
  • Exemplars: Examples of human-annotated reasoning chains.

Figure 9: The top left section of the figure consists of a 3x4 sample puzzle from the GridPuzzle dataset along with the Zero-shot-CoT prompt. Right below the prompt, we have the Gold solution for the corresponding puzzle. In the top right section of the figure, we have the Model-generated Reasoning chain to solve this puzzle along with the Final Answer. In this instance, the reasoning chain was generated by the Llama2-13b model. In the bottom half of the figure, we have the GPT-4o Auto-Evaluated Reasoning chain. The auto-evaluation is done sentence-wise and the output is in a JSON-structured format consisting of 5 components: the Sentence, the Premise, the Conclusion, the Error category and the Sub-category.

Figure 9 illustrates this process in action. You can see the original puzzle, the model’s (Llama-2) flawed attempt, and then the Auto-Evaluator’s JSON output. The evaluator breaks down a sentence, identifies the premise and conclusion, explains the flaw (e.g., “The conclusion is incorrect as it assumes Underwood is staying for 2 days without sufficient information”), and assigns the error code (RW - Right Premise, Wrong Conclusion).

Validation showed that the Auto-Evaluator had an ~86% agreement rate with human annotators, making it a reliable proxy for large-scale analysis.

PuzzleEval: A New Metric for Reasoning

Identifying errors is great, but we also need a quantitative score. The researchers introduced PuzzleEval, a reference-free metric. “Reference-free” means you don’t need a “perfect” human-written explanation to compare against; you only need the final correct answer key (the Gold Solution).

How PuzzleEval Works: The metric calculates a Correctness Score for the reasoning chain by verifying if the intermediate deductions match the ground truth.

Figure 2: The process of calculating PuzzleEval metrics is described above. The reasoning chains are produced by our five LLMs and the gold solution is taken from our GridPuzzle dataset.

As visualized in Figure 2, the process has three stages:

  1. Conclusion Extraction: The LLM (acting as a parser) reads a reasoning step and extracts the logical claim (e.g., “Therefore, Sam is assigned 2013”).
  2. Pair-wise Extraction: This claim is converted into structured pairs (e.g., (Sam, 2013) or (Sam, NOT 2014)).
  3. Validation: These pairs are checked against the Gold Solution table. If the pair exists in the solution, it gets a 1; otherwise, a 0.

The final score is the average correctness across all steps. This allows us to give partial credit. A model might reason perfectly for 90% of the puzzle and fail at the very end. Standard accuracy would give it a 0, but PuzzleEval might give it a 0.9, accurately reflecting that it can reason, just not robustly enough to finish.

Experimental Results: The Harsh Truth

So, how did the models perform? The results were sobering.

The researchers tested major models in a Zero-Shot Chain-of-Thought setting. They provided the puzzle and the instruction “Let’s think step by step.”

1. Final Answer Accuracy

If you look at pure accuracy (did the model fill the grid correctly?), the performance is abysmal.

Figure 3: Performance of five different LLMs in terms of accuracy on the GridPuzzle dataset.

Figure 3 shows the number of correctly solved puzzles out of 274.

  • GPT-4: 14 puzzles solved (~5.1%).
  • Claude-3: 10 puzzles.
  • Gemini: 4 puzzles.
  • Llama-2: 1 puzzle.
  • Mistral: 0 puzzles.

For a task that requires strict logic, current LLMs are essentially guessing or failing completely. Even the most powerful model, GPT-4, failed 95% of the time.

2. PuzzleEval Scores vs. Accuracy

However, the PuzzleEval scores tell a more nuanced story.

Table 4: The results for PuzzleEval on the different grid sizes available in GridPuzzle dataset in terms of ACS.

Table 4 shows the Average Correctness Score (ACS). While accuracy was near zero, the PuzzleEval scores ranged from 0.27 to 0.59.

This indicates a “Logic Gap.” Models like GPT-4 are generating many correct reasoning steps (hence the 0.59 score), but they inevitably make a mistake that cascades, ruining the final answer. They are capable of local logical steps but struggle with global consistency across a long chain.

3. Where Exactly Do They Fail?

Using the taxonomy and the Auto-Evaluator, the researchers mapped the distribution of errors.

Figure 4: The percentage distribution of the broad error categories across the combined reasoning steps of all five LLMs. The total number of steps generated by each model is provided inside the round brackets below the model names.

Figure 4 reveals the composition of the reasoning chains:

  • NC (No Conclusion): A massive chunk of the generated text (especially for Gemini) is just filler or restating clues.
  • RR (Right Premise, Right Conclusion): GPT-4 has the highest percentage of correct steps (blue bar), which aligns with its higher PuzzleEval score.
  • RW/WW: These are the fatal errors.

The researchers also generated heatmaps to pinpoint the specific types of logical failures.

Figure 5: The first five sub-figures in the above section show the error Sub-category distribution over five LLMS. The last sub-figure denotes the top 10 error Sub category distribution across all model reasoning steps.

Figure 5 highlights that the most dominant error sub-categories are:

  • Wrong Reasoning (RW-a): The premise is right, but the logic applied is flawed.
  • Wrong Elimination (RW-c): Failing to cross out options that are no longer possible.
  • Error Propagation (WW-4b): The most common “WW” error. Once a model makes one mistake, it treats that mistake as a fact for future steps, compounding the failure.

This confirms that LLMs behave like “greedy reasoners.” They make a deduction based on immediate probability without verifying if it conflicts with constraints set five steps ago.

4. Do Prompting Strategies Help?

A common counter-argument in AI research is, “You just didn’t prompt it right.” The authors anticipated this and tested several advanced prompting techniques on a subset of the data:

  • Plan-and-Solve: Ask the model to make a plan first.
  • Self-Correct: Ask the model to verify its own answer.
  • Self-Discover: A structured reasoning process.
  • Program-of-Thought: Asking the model to write code to solve the puzzle.

Table 5: The results for accuracy and PuzzleEval using GPT-4-Turbo, with and without mitigation strategies for the 60 samples of 3x4 grid-size.

The results in Table 5 are surprising. None of these strategies significantly improved performance.

  • Plan-and-Solve actually dropped accuracy from 12 to 9.
  • Self-Correct dropped it to 10.
  • Self-Discover offered a tiny marginal gain (12 to 13).

This suggests that the deficit in logical reasoning is fundamental to the models’ current architecture and cannot simply be prompted away. Even when asked to self-correct, the models often hallucinate that their wrong answer is correct because they cannot fundamentally “check” their logic against the strict constraints of the grid.

Conclusion and Future Directions

The paper “Step-by-Step Reasoning to Solve Grid Puzzles” provides a sobering reality check for the AI community. While LLMs are linguistically fluent, their ability to perform sustained, error-free logical deduction is currently very poor.

Key Takeaways:

  1. Accuracy is Misleading: Models can have 0% accuracy but 60% step-wise correctness. We need metrics like PuzzleEval to understand the partial successes.
  2. The “Greedy” Problem: LLMs struggle to look ahead or backtrack. Once they make a “Wrong Reasoning” error, “Error Propagation” guarantees failure.
  3. Prompting isn’t a Magic Bullet: Standard tricks like Chain-of-Thought or Self-Correction do not fix fundamental logic gaps in grid puzzle contexts.
  4. Taxonomy Matters: By classifying errors (WW, RW, RR), we move from knowing that a model failed to understanding why it failed.

For students and researchers, this paper highlights a massive opportunity. We don’t just need bigger models; we need models that can verify their own logic, backtrack when a contradiction is found, and handle strict constraints without hallucinating. Until then, while an LLM might be able to write a poem about a logic puzzle, it certainly cannot solve one.