If you ask a student to solve the equation \(x^2 + x = 3\), they might grab a piece of paper, use the quadratic formula, and give you a precise irrational number involving square roots. But if you change the question slightly to “Find the integer solution of the equation \(x^2 + x = 3\),” the student’s behavior changes. They will solve it, realize the result isn’t an integer, and correctly answer: “There are no integer solutions.”

To do this, the human brain performs a feat called systematic compositionality. We combine two distinct pieces of knowledge: (1) the algebraic skill to solve a quadratic equation, and (2) the conceptual understanding of what an “integer” is. We don’t need to have seen this specific trick question before to solve it; we spontaneously combine our skills to handle the new logic.

Large Language Models (LLMs) like GPT-4 have stunned the world with their mathematical prowess. But do they possess this same systematic compositionality? Or are they merely reciting reasoning paths they’ve seen during training?

In the paper “Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems,” researchers from Fudan University take a deep dive into this question. They introduce a clever new dataset called MATHTRAP to test whether LLMs can spot logical traps that require combining different knowledge domains. The results reveal a fascinating—and somewhat concerning—gap between having knowledge and being able to use it.

The Problem: Knowledge vs. Composition

Human cognition is inherently compositional. As defined by Fodor and Pylyshyn in 1988, this is the algebraic ability to generate infinite novel combinations from finite learned components. In the context of math, if you know the rules of geometry and you know the properties of prime numbers, you should be able to solve a problem that combines both, even if you’ve never seen them mixed together before.

Recent LLMs have shown high accuracy on standard benchmarks like GSM8K (grade school math) and MATH (competition math). However, the researchers suspect that this success might be partly due to the models following familiar reasoning “paths” seen during training.

To test this, the authors propose a definition of Compositionality in Mathematical Reasoning. Simply put: If a model can solve Problem A (pure math) and possesses Knowledge B (a specific concept), it should be able to solve Problem C (Problem A + Knowledge B combined). If it fails Problem C while passing A and B, it lacks compositionality.

The Method: Setting the Trap

To evaluate this capability, the researchers constructed the MATHTRAP dataset. This isn’t just a collection of hard math problems; it is a systematic test of knowledge integration.

The dataset is built around problem triplets. For every data point, there are three distinct components:

  1. The Original Problem: A standard problem taken from existing datasets (MATH or GSM8K).
  2. The Conceptual Problem: A simple question testing the model’s understanding of a specific concept (the trap).
  3. The Trap Problem: The original problem modified with a logical flaw or contradiction based on that concept.

The goal is simple: If the model answers the Original Problem correctly, and answers the Conceptual Problem correctly, it has all the necessary knowledge. Therefore, if it fails the Trap Problem, it proves a failure to compose that knowledge spontaneously.

Five Types of Traps

The researchers didn’t just add random noise; they categorized traps into five distinct logical fallacies.

Table 1 showing the overview of the MATHTRAP Dataset with examples of original vs trap problems.

As shown in Table 1 above, the dataset covers a variety of logical pitfalls:

  • Concept Undefined (16%): The math looks solvable, but uses an undefined term, such as asking for \(\tan(90^{\circ})\).
  • Missing Condition (6%): The problem asks for a total (e.g., sales in April and June) but only provides data for part of it (April and May).
  • Direct Contradiction (24%): The premise contradicts itself immediately, like an equilateral triangle with a perimeter of 30 but a height of 10 (which is geometrically impossible).
  • Indirect Contradiction (38%): The contradiction only appears after solving the math, such as the integer quadratic equation mentioned in the introduction.
  • Violating Common Sense (15%): Asking for a probability of picking 5 different suits from a standard deck of cards (which only has 4 suits).

These problems are “unseen” in the sense that they rarely appear in training data. Real-world math textbooks usually contain solvable problems. By breaking this pattern, the researchers force the model to think, not just recite.

The Experiment: Do LLMs Fall for It?

The researchers tested a wide range of models, including proprietary giants like GPT-4 and Claude 3 Opus, as well as open-source models like Llama-3 and MetaMath.

The evaluation metric was the Ratio: The accuracy on Trap Problems divided by the accuracy on Original Problems. A ratio of 100% would mean the model is just as good at spotting traps as it is at doing standard math.

The Compositionality Gap

The results were stark. While models performed well on the Original Problems and excellently on the Conceptual Problems, they crumbled when facing the Trap Problems.

Table 2 showing accuracy of various models on Conceptual, Original, and Trap problems.

Table 2 provides the breakdown. Let’s look at GPT-4-0125-preview:

  • Conceptual Accuracy: 90.0% (It understands the concepts perfectly).
  • Original Accuracy: 70.3% (It is very good at solving the math).
  • Trap Accuracy: 36.0% (It fails to apply the concept to the math).

This results in a ratio of just 51.2%. The model possesses the bricks (knowledge) but cannot build the house (solution) when the blueprint changes slightly.

Open-source models fared even worse. For example, MetaMath-7B, a model specifically fine-tuned for math, achieved a ratio of only 5.84%. This suggests that “math-specific” training might actually make models more prone to blindly following formulas rather than reasoning about the problem statement.

Visualizing the Failure

To understand what these failures look like, we can examine the model outputs. In many cases, the model will confidently calculate an answer that is mathematically impossible because it ignores the trap.

Table 14 showing specific responses of GPT-4 to original and conceptual problems.

In Table 14, we see GPT-4 correctly identifying that \(\tan(90^{\circ})\) is undefined in the Conceptual Problem. However, if a trap problem asks for calculations involving a right triangle that implies an angle of \(90^{\circ}\) (where the tangent would be undefined), the model often blindly proceeds with the Pythagorean theorem or trigonometric ratios, completely ignoring its own knowledge that the value doesn’t exist.

Human vs. Machine

Are these trap problems just impossibly hard? Maybe humans would fail them too?

To verify this, the researchers recruited undergraduate students from top universities. They ensured the students didn’t know they were being tested on “trap” problems initially.

Table 3 showing human accuracy on MATHTRAP.

Table 3 highlights a massive difference in cognition.

  • Humans (without notice): 83.8% accuracy on Trap Problems.
  • Humans (with notice): 95.1% accuracy.
  • Human Ratio: ~86% (compared to GPT-4’s 51%).

Humans demonstrated strong spontaneous compositionality. Even without being warned, they noticed, “Hey, this doesn’t make sense.” When they did miss a trap, a simple “heads up” was enough to perfect their performance. LLMs, on the other hand, struggle to “wake up” and question the premise.

Can We Fix It? Interventions and Solutions

The researchers explored three methods to mitigate this deficiency: Prompting, Few-Shot Learning, and Fine-Tuning.

1. Natural Language Prompts & Few-Shot Demos

The simplest approach is to warn the model. By adding a prompt like “Note that this problem may be unsolvable,” or providing 1 or 5 examples (shots) of trap problems, performance improved.

Table 4 showing the impact of external intervention methods like prompts and few-shot learning.

As Table 4 shows, providing a 5-shot demonstration (showing the model 5 examples of how to handle traps) helped significantly. For Claude 3 Opus, the accuracy on trap problems jumped from 19.0% to 56.1%.

However, notice the trade-off. For some models, checking for traps confused them on original (solvable) problems, causing accuracy on standard math to drop. The models struggle to balance skepticism with capability.

2. Fine-Tuning

The researchers also tried fine-tuning the models on the MATHTRAP dataset (specifically a public subset of it).

Table 5 showing the impact of fine-tuning data configurations.

Table 5 shows that fine-tuning (seen in the MetaMath395K+MathTrap1K row) drastically improves Trap accuracy (from 6.36% to 29.1%). However, it comes at a cost: the model’s ability to solve standard Original problems decreases (from 41.4% to 33.3%). This suggests the model isn’t necessarily learning how to reason; it’s just overfitting to the idea that “some questions are traps,” making it overly cautious or confused.

3. The “Slow Thinking” Hope (OpenAI o1)

One of the most promising findings in the paper involves OpenAI’s recently released o1 model. This model is designed for “System 2” thinking—it takes more time to process and reason before outputting an answer.

Referring back to Table 2, the o1-preview (Web) model achieved a ratio of 77.4%, which is much closer to human performance (85.9%) than any other model. This suggests that the “slow thinking” inference-time scaling might be the key to unlocking true compositionality, allowing the model to double-check its own logic against the problem constraints before committing to a calculation.

Conclusion and Key Takeaways

The MATHTRAP study exposes a critical limitation in current Large Language Models. While they act as incredibly comprehensive encyclopedias of mathematical methods (Original Problems) and definitions (Conceptual Problems), they lack the spontaneous ability to weave these two strands together when the path isn’t laid out for them.

Key takeaways for students and researchers:

  1. Don’t Mistake Knowledge for Reasoning: A model can know \(A\) and know \(B\) without knowing that \(A+B\) implies \(C\).
  2. The “Blind Calculation” Effect: LLMs have a strong bias toward calculating an answer even when the premises are flawed. They prioritize the “procedure” over the “logic.”
  3. Human Cognition is Distinct: The ease with which humans handle these traps highlights that biological reasoning is fundamentally more flexible and compositional than current transformer architectures.
  4. Hope in “Slow Thinking”: The success of the o1 model suggests that future architectures that allow for iterative reasoning (thinking before speaking) may eventually bridge this gap.

Systematic compositionality remains an open challenge. Until AI can spontaneously say, “Wait a minute, that question doesn’t make sense,” it is still merely imitating the shape of reasoning rather than truly understanding the content.