When LLMs Can't Say "I Don't Know": Inside the TREECUT Dataset

Introduction

If you have used recent Large Language Models (LLMs) like GPT-4o or o3-mini, you know they have become incredibly proficient at mathematics. On standard benchmarks like GSM8K (grade school math) or the more advanced MATH dataset, these models often achieve near-human or even superhuman performance. They can solve complex equations, reason through multi-step word problems, and write out their “chain of thought” to justify the answer.

But there is a catch. These models are eager to please. In fact, they are often too eager.

What happens when you present an LLM with a math problem that cannot be answered? Imagine a word problem where a necessary piece of information is missing—like asking you to calculate the speed of a car given the distance it traveled, but withholding the time it took. A human would immediately say, “I don’t have enough information.” An LLM, however, often panics. Instead of admitting ignorance, it frequently hallucinates: it invents numbers, assumes relationships that don’t exist, or forcefully derives a confident, yet completely unfounded, answer.

This blog post explores a fascinating research paper titled “TREECUT: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation.” The researchers tackle the critical issue of LLM hallucination in mathematical reasoning. They introduce a novel method for generating infinite “unanswerable” math problems to test whether AI models can distinguish between a solvable puzzle and a broken one.

The Problem: Proficiency vs. True Reasoning

Why does this matter? If an AI system is to be trusted in high-stakes environments—like finance, engineering, or education—it must know its own limits. An incorrect answer is often worse than no answer at all.

While LLMs score 90%+ on standard math tests, skeptics argue this might be due to pattern matching rather than genuine reasoning. The models have seen millions of similar problems during training. But when the surface level of a problem changes, or when the logic is subtly broken, their performance often collapses.

Existing attempts to test this “unanswerability” have relied on taking real problems and manually deleting sentences. This approach is slow, limited in scale, and prone to “contamination” (where the model has already seen the original problem in its training data and answers from memory). To truly stress-test these models, we need a dataset that is synthetic (generated from scratch), infinite, and structurally controllable.

Enter TREECUT.

The Core Method: Growing and Pruning Trees

The heart of this paper is the TREECUT generation algorithm. Instead of writing word problems in English first, the authors represent math problems as trees.

Understanding the Tree Structure

To understand how this works, let’s visualize a math word problem as a directed graph.

Nodes represents Variables: Imagine variables like the price of a burger ($x_1$), a salad ($x_2$), or a sandwich ($x_3$).
Edges represent Formulas: The line connecting two nodes represents a mathematical relationship (e.g., “The burger costs $3 more than the salad”).
The Root: There is a starting point (the root) that defines the initial values.

In a solvable (answerable) problem, there is a clear, unbroken path from the root node all the way down to the “questioned variable” (the leaf node you are trying to solve for). You simply follow the path, applying the math at each step, to get the answer.

The “Cut”: creating Unanswerability

This is where the ingenuity of TREECUT lies. To create an unanswerable problem, the algorithm generates a perfect, solvable tree first. Then, it strategically removes (cuts) an edge somewhere along the critical path.

Figure 1: The left and middle panels depict the tree structures corresponding to the answerable and unanswerable questions, respectively. In the right panel, the strike-through sentence represents the formula removed by the cut.

As shown in Figure 1 above, look at the difference between the left and middle diagrams:

Left (Answerable): You can trace a line from the Root (top) down to $X_1$, then to $X_2$, and finally to $X_3$ (the answer). Every step is connected.
Middle (Unanswerable): The researchers have introduced a “Cut” (the red scissors). The link between $X_1$ and $X_2$ is severed. If you try to solve for $X_3$, you will hit a dead end. You might know how $X_3$ relates to $X_2$, but you have no way of knowing what $X_2$ is.
Right (The Text): The algorithm then translates these trees into natural language. The text describes prices of burgers and sandwiches. The “Cut” corresponds to deleting the sentence that links the burger’s price to the scrambled egg’s price.

This method allows the researchers to precisely control the difficulty. They can change:

ansDepth: How deep is the tree? (How many steps to solve it?)
numVars: How many total variables are distracting the model?
Composite Names: Are we talking about “$x$” and “$y$”, or “a burger at Bistro Nice” vs. “a burger at Urban Plate”?
cutDepth: Where exactly did the bridge break? Near the start or near the end?

Experiments and Results

The researchers tested several state-of-the-art models, including Llama 3.1, Qwen2.5, GPT-4o, and the reasoning-heavy o3-mini. They used a “zero-shot” prompt, meaning they didn’t give the models examples beforehand; they simply asked the model to solve the problem or state “Answer: unknown” if conditions were insufficient.

The results were stark.

Table 1: Percentage of hallucination of various LLMs at different ansDepth values for unanswerable problems, zero-shot prompting

Table 1 (above) shows the percentage of hallucinations—how often the model made up an answer when it should have said “unknown.”

Llama-8B failed almost completely, hallucinating over 80% of the time regardless of complexity.
GPT-4o, widely considered one of the best general-purpose models, struggled significantly as the problems got deeper. At ansDepth = 8 (a problem requiring 8 logical steps), GPT-4o hallucinated 64.0% of the time.
o3-mini, a model designed specifically for reasoning, performed better but was surprisingly unstable on simpler problems (ansDepth = 2), hallucinating 44.0% of the time.

The “Reasoning” Trap

Why did o3-mini perform poorly on simple problems? The paper uncovers a fascinating behavior. Reasoning models are trained to find solutions. When o3-mini encountered a missing link (e.g., knowing the price of a “Burger at Shop A” but needing the price of a “Burger at Shop B”), it often invented an assumption.

It would reason: “Typically, in math problems, items with the same name cost the same. Therefore, I assume the price is equal.”

This is a hallucination. In the context of these strictly defined word problems, assuming two different variables are equal just because they share a name is logically invalid. The model was “smart” enough to realize the data was missing, but “biased” enough to force a solution anyway.

Analysis: What Triggers Hallucination?

Because TREECUT is synthetic, the researchers could tweak parameters to see exactly what confuses the AI. They analyzed GPT-4o’s performance against varying conditions.

1. Complexity and Distraction

The first major finding is that complexity invites hallucination.

Figure 2: Hallucination percentage under different configurations of unanswerable problems, plotted against varying ansDepth.

Figure 2 illustrates how hallucination rates climb as the problems get “deeper” (the X-axis). But look at the separation between the lines:

Orange Lines (Complex Structure): These represent trees with extra branches—distractor variables that aren’t part of the main solution path. The hallucination rate is consistently higher here than in the Blue lines (simple paths).
Solid Lines (Composite Names): This is the most interesting psychological insight. The solid lines represent problems using complex names like “A Greek Salad at Texas BBQ” versus simple names. The solid lines are significantly higher than the dashed lines.
Takeaway: If you simply give variables complex names and add a few irrelevant facts, GPT-4o becomes much more likely to hallucinate an answer to an unsolvable problem. The semantic noise distracts the reasoning capabilities.

2. The “Valley” of Confusion

The researchers also investigated where the missing information was located. Does it matter if the missing link is at the beginning of the chain or the end?

Figure 3: Hallucination percentage versus cutDepth. Left panel has ansDepth = 7. Right panel has ansDepth = 8

Figure 3 plots hallucination against cutDepth (how far the cut is from the questioned variable).

Low CutDepth (Left of graph): The cut is near the questioned variable (the end of the chain). The model usually spots this.
High CutDepth (Right of graph): The cut is near the root (the start of the chain). The model usually spots this too.
The Middle (The Peak): Look at the massive spike in the middle (around depth 3-5).

When the logical break happens in the middle of a long reasoning chain, the model becomes confused. It successfully starts the reasoning process, and it can see the goal at the end, but it glazes over the gap in the middle, likely hallucinating a bridge to connect the two valid halves of the derivation.

Conclusion

The TREECUT paper provides a sobering check on the “reasoning” capabilities of modern LLMs. While they can perform arithmetic and follow patterns beautifully, they lack the robust self-awareness to recognize when a logical chain is broken—especially when that break is hidden deep within a complex problem or obscured by wordy descriptions.

The key takeaways from this research are:

High Hallucination Rates: Even top-tier models like GPT-4o and o3-mini frequently invent answers to unanswerable math problems.
Structural Vulnerability: Models are most vulnerable when problems are deep, contain distractor information, or use composite names (semantic complexity).
The Middle-Cut Effect: Missing information is hardest for AI to detect when it occurs in the middle of a reasoning chain.

TREECUT offers a valuable tool for developers and researchers. By using synthetic datasets that systematically probe these weaknesses, we can move beyond simple accuracy metrics and start evaluating whether our AI models are truly reasoning, or just very good at guessing. Future work will likely focus on training models specifically to recognize these “cuts” in logic, moving us one step closer to AI that knows what it doesn’t know.

Introduction#

The Problem: Proficiency vs. True Reasoning#

The Core Method: Growing and Pruning Trees#

Understanding the Tree Structure#

The “Cut”: creating Unanswerability#

Experiments and Results#

The “Reasoning” Trap#

Analysis: What Triggers Hallucination?#

1. Complexity and Distraction#

2. The “Valley” of Confusion#

Conclusion#