Introduction
In the rapidly evolving landscape of Artificial Intelligence, we have become accustomed to headlines declaring that Large Language Models (LLMs) have conquered yet another human milestone. We see models acing the Bar Exam, performing at graduate levels in physics, and solving complex code challenges. If you look at popular leaderboards, it seems we are approaching a saturation point where AI capabilities match or exceed specialized human performance.
But there is a catch.
A growing body of research suggests that this high performance might be partly due to “contamination.” Because LLMs are trained on vast swathes of the internet, they may have simply memorized the questions and answers to standard benchmarks like the Grade-School Math (GSM8K) dataset. They aren’t necessarily reasoning; they are remembering.
Enter Mathador-LM, a fascinating new study that challenges this paradigm. The researchers introduce a dynamic, game-based benchmark that cannot be memorized because it is generated on the fly. The results are startling: while state-of-the-art models like GPT-4 and Claude 3 score exceptionally high on standard tests, they fail spectacularly at Mathador-LM—scoring significantly lower than the average 3rd-grade student.
In this post, we will break down the Mathador-LM paper, explain the game mechanics that stump AI, and explore what this reveals about the true state of mathematical reasoning in Large Language Models.
The Problem with Static Benchmarks
To understand why Mathador-LM is necessary, we must first look at the current state of LLM evaluation. The most common metrics used today include benchmarks like MMLU (Massive Multitask Language Understanding) and GSM8K (Grade School Math).
As shown in the image below, modern models are reaching saturation on these benchmarks. The blue and green lines representing MMLU and GSM8K are skyrocketing toward 90-100% accuracy.

However, look at the bottom left of the chart. That flat line near zero? That is the performance on Mathador-LM.
The discrepancy is massive. Models that theoretically possess “specialized human” knowledge (according to MMLU) are performing worse than young children on Mathador. This supports the hypothesis of test-set leakage: models look smart on standard tests because they have seen the questions during training. Mathador-LM, being dynamically generated, offers a “clean” test of reasoning that models cannot cheat on.
The Mathador-LM Benchmark: How It Works
The benchmark is based on “Mathador,” a popular mathematical game in France used to teach arithmetic to students from 3rd to 8th grade.
The Rules of the Game
The premise is simple but requires genuine planning and combinatorial reasoning.
- Inputs: You are given a set of 5 “base numbers” (operands) and a “target number.”
- Goal: You must reach the target number using the base numbers.
- Constraints:
- You can use the four basic operations: addition (+), subtraction (-), multiplication (\(\times\)), and division (\(\div\)).
- You can use each base number at most once.
- You are not required to use all numbers, but you get more points if you do.
- Intermediate results must be non-negative integers (no fractions, no negative numbers).
The researchers use a structured prompt to feed these problems to the LLMs. The prompt includes the rules and the specific instance to solve.

As seen in the example above (Figure 3), if the target is 34 and the base numbers are 4, 2, 8, 11, 17:
- A Simple Solution might be \(2 \times 17 = 34\). This is correct but simple.
- A Mathador Solution (optimal) uses complex steps: \(8+4=12\), then \(12-11=1\), then \(17/1=17\), then \(17 \times 2=34\).
The Scoring System
Mathador is not just binary (pass/fail). It uses a point system that incentivizes complexity and the use of difficult operations like division. This allows researchers to grade the quality of the reasoning, not just the correctness of the result.

The ultimate goal is the Mathador Bonus: reaching the target using all 5 numbers and all 4 operations exactly once.
Formalizing the Challenge
From a computational science perspective, Mathador is a search problem. The model must navigate a space of possible arithmetic expressions. The authors define this formally to ensure the benchmark is rigorous.
The set of valid expressions \(\mathcal{E}_P\) is defined by the permutations of operands and the placement of operators and parentheses:

This mathematical formalization ensures that every generated instance has a solvable path and allows the difficulty to be calculated based on the density of solutions in the search space. Because the problem space is vast, the researchers can generate unique datasets for every single evaluation run, completely eliminating the risk of the model having “seen” the problem before.
Experiments and Results
The authors evaluated a wide range of open-source models (like Llama-3, Qwen2, Mistral) and closed-source models (GPT-4, Claude 3) on this new benchmark. The results were humbling for the AI systems.
The Performance Gap
While human 3rd graders achieve an average accuracy of roughly 43.7%, the most advanced AI models struggled to break 15%.

Key takeaways from the results (Figure 4):
- Small Models Fail Completely: Models with fewer than 3 billion parameters (like Qwen-1.5-0.5B) scored nearly 0%.
- Size Matters, But Not Enough: There is a clear correlation between model size and performance. The 70B+ parameter models (Llama-3-70B, Qwen2-72B) performed best, hovering around 10-15%.
- The “SOTA” Struggle: Even the heavyweights—GPT-4 and Claude 3 Opus—did not dominate. They performed comparably to the best open-source models but still fell far short of human children.
Stability and Reliability
One critique of new benchmarks is often that they are noisy. If you run the test twice, do you get the same score?
The researchers tested Llama-3-70B repeatedly on dynamically generated datasets of varying sizes.

As shown in Table 2, the “Mixed” difficulty dataset produced incredibly stable results (around 11.5% to 12.3%) regardless of whether 100 or 1500 samples were used. This stability confirms that Mathador-LM is a reliable metric for measuring reasoning capability.
Does “Few-Shot” Prompting Help?
In LLM engineering, “few-shot prompting” (giving the model a few examples of solved problems in the prompt) usually boosts performance significantly. The researchers tested if giving the models 2, 5, 10, or 20 examples would help them grasp the Mathador logic.

Surprisingly, increasing the number of shots had a negligible effect. The jump from 2 shots to 20 shots resulted in only a ~1% accuracy gain. This suggests that the models aren’t failing because they don’t understand the format; they are failing because they lack the fundamental reasoning planning required to solve the puzzle.
Why Do They Fail? (Error Analysis)
This is perhaps the most revealing part of the paper. When an LLM fails a Mathador problem, how does it fail? The researchers categorized errors into four types:
- Formatting Error: The model didn’t write the solution correctly.
- Calculation Error: The math was wrong (e.g., saying \(5+5=12\)).
- Missed Target: The calculations were right, but the final number wasn’t the target.
- Illegal Operand: The model used a number that wasn’t in the list of 5 base numbers.

Table 4 reveals a shocking weakness. The vast majority of errors—over 60% for most models—are Illegal Operand errors.
This means the models are “hallucinating” numbers. If they need a “7” to solve the equation but don’t have a “7” in their base set, they simply invent it. This highlights a critical deficiency in current LLMs: they struggle to adhere to strict constraints (negative constraints) within a reasoning chain. They prioritize generating a plausible-looking mathematical equation over adhering to the strict rules of the game environment.
Score Distributions and Strategy
Not all models fail in the same way. The researchers plotted the distribution of scores to see if models were attempting complex solutions (aiming for the Mathador bonus) or playing it safe.

Figure 5 shows that Claude-3-Opus (the bottom row) manages to find higher-scoring solutions (scores of 9-10) more frequently than Llama-3-70B (top row), even if their average success rates are close. This indicates that some models have a slightly better “planning” horizon, allowing them to attempt more complex arithmetic chains, whereas others settle for the simplest path or fail trying.
Advanced Analysis
The authors dug deeper to see if standard “tricks” could improve performance.
Multiple Attempts (Self-Consistency)
If you let the model try 5 times and pick the best valid answer, does it do better?

Yes, but not by enough to close the gap. Table 5 shows a 19-29% relative gain, which is significant, but even with 5 attempts, Llama-3-70B only reaches 13.7% accuracy—still far below the human baseline.
Decoding Strategies
They also tested if changing the “temperature” or sampling method (Greedy vs. Nucleus sampling) mattered.

As Table 6 illustrates, the decoding strategy made almost no difference. Whether the model was creative (Nucleus) or deterministic (Greedy), the fundamental inability to plan a valid path remained.
Conclusion and Implications
The Mathador-LM paper serves as a vital reality check for the AI community. It demonstrates that while LLMs are becoming encyclopedic geniuses capable of passing medical boards, they still lack the fluid, constrained reasoning capabilities of an average 8-year-old solving a math puzzle.
Key Takeaways:
- Contamination is Real: The massive gap between GSM8K performance and Mathador-LM performance suggests current benchmarks are heavily contaminated.
- Reasoning vs. Retrieval: LLMs struggle with tasks that require strictly constrained planning (using only available numbers). The prevalence of “Illegal Operand” errors shows they struggle to “inhibit” the retrieval of outside information.
- The Future is Dynamic: Static benchmarks (fixed lists of questions) are obsolete the moment they are released. Dynamic benchmarks like Mathador-LM, which generate unique problems on the fly, are the only way to reliably measure progress.
For students and researchers in AI, Mathador-LM highlights a clear frontier: moving beyond simple instruction following and memorization toward systems that can plan, verify, and adhere to strict logical constraints in novel environments. Until LLMs can beat the 3rd graders at Mathador, we should remain cautious about claims of “super-human” reasoning.
](https://deep-paper.org/en/paper/2406.12572/images/cover.png)