Large Language Models (LLMs) like GPT-4 and LLaMA have become eerily good at answering complex scientific and mathematical questions. If you ask an LLM, “Why does ReLU activation train faster than sigmoid?”, it will likely give you a coherent, textbook-quality paragraph about gradients and saturation.
But this capability triggers a nagging question for researchers and students alike: Is the model actually reasoning, or is it just parroting a memorized block of text?
Does the model understand the underlying concepts—like what a gradient is or how backpropagation works—that are required to construct that answer? Or did it simply see the “ReLU vs. Sigmoid” comparison enough times in its training data to autocomplete the response?
In a fascinating paper titled Hierarchical Deconstruction of LLM Reasoning, researchers from KAIST and NAVER AI Lab propose a new framework to audit the “brain” of an LLM. By deconstructing complex questions into a graph of simpler, foundational sub-questions, they expose exactly where models fail to reason (forward discrepancy) and where they are likely just hallucinating competence through memorization (backward discrepancy).
In this post, we will break down their framework, the DEPTHQA dataset, and what their experiments reveal about the fragile nature of machine reasoning.
The Black Box of Reasoning
To understand if a student truly understands a complex topic, a teacher rarely accepts a simple correct answer. They ask, “Show your work,” or “Why is that the case?”
Standard benchmarks for LLMs often miss this nuance. They check if the final answer is correct, but they don’t check if the model possesses the precursor knowledge required to derive that answer.
The researchers address this by adopting Webb’s Depth of Knowledge (DOK), a framework widely used in education assessment. They categorize knowledge into three distinct depths:
- \(D_1\) (Conceptual Knowledge): Basic recall. What is the information? (e.g., “What is a gradient?”)
- \(D_2\) (Procedural Knowledge): Application. How is knowledge used? (e.g., “How do gradients affect training speed?”)
- \(D_3\) (Strategic Knowledge): Reasoning and analysis. Why is this applicable? (e.g., “Why does ReLU training take less time than sigmoid?”)
The core hypothesis is simple: To genuinely answer a \(D_3\) question, you must master the underlying \(D_2\) procedures and \(D_1\) concepts.
The Graph-Based Framework
The researchers developed a method to deconstruct a complex “target” question (\(D_3\)) into a dependency graph of simpler questions.
Imagine the target question as the top node of a tree. To answer it, you need the supporting nodes below it (\(D_2\)), and to answer those, you need the leaf nodes (\(D_1\)).

As shown in Figure 1, answering the complex question about ReLU vs. Sigmoid (at the top) relies on understanding procedural questions about gradient effects, which in turn rely on defining terms like “backpropagation” and “vanishing gradient.”
Constructing the DEPTHQA Dataset
To test this at scale, the team created DEPTHQA. They took high-quality, complex scientific questions (from the TutorEval dataset) and used GPT-4 to recursively deconstruct them.
However, simply generating sub-questions isn’t enough. The team enforced three critical criteria for the edges connecting these nodes:
- Comprehensiveness: The lower-level questions must cover all the background info needed for the higher level. No missing links.
- Implicitness: The lower-level questions must not give away the answer to the higher level. They should provide the ingredients, not the cake.
- Non-binary Questioning: No Yes/No questions. This forces the model to generate an explanation, preventing it from guessing the right answer by chance (or bias).
This resulted in a dataset where every complex problem is supported by a “reasoning graph” of foundational knowledge.

As seen in the table above, the questions require diverse reasoning types—ranging from comparative and causal analysis at Depth 3 to procedural steps at Depth 2.
Measuring the “Reasoning Gap”
With this hierarchy in place, the researchers could measure two specific types of failure modes in LLMs. These are defined as discrepancies.
1. Forward Discrepancy (The Reasoning Failure)
This occurs when a model answers the simpler sub-questions correctly but fails the complex target question.
- Scenario: The model knows what a gradient is (\(D_1\)) and how it works (\(D_2\)), but it still cannot explain why ReLU is faster (\(D_3\)).
- Implication: The model has the knowledge but lacks the reasoning capability to synthesize it into a complex conclusion.
The researchers quantify this using the following equation:

Essentially, this measures the gap between the average score of the “parent” (predecessor) questions and the target question.
2. Backward Discrepancy (The Memorization/Hallucination Indicator)
This is the more surprising phenomenon. It occurs when a model answers the complex question correctly but fails the simpler sub-questions.
- Scenario: The model gives a perfect explanation of ReLU vs. Sigmoid (\(D_3\)), but when asked “What is a vanishing gradient?” (\(D_1\)), it hallucinates or answers incorrectly.
- Implication: The model did not reason its way to the answer. It likely memorized the complex answer from its training data (rote memorization) without understanding the constituent parts.
This is calculated as:

Here, we are looking at cases where the “children” (successor) questions score lower than the complex target node.
The visualization below summarizes these two concepts beautifully:

In Figure 2, the red arrows represent Forward Discrepancy (ingredients present, cake failed), and the blue arrows represent Backward Discrepancy (cake present, ingredients missing).
Experimental Results: What Do the Models Reveal?
The researchers tested several open-source models, including LLaMA 2, LLaMA 3, Mistral, and Mixtral, ranging from 7B to 70B parameters.
1. Size Matters for Consistency
Unsurprisingly, larger models (like LLaMA 3 70B) generally had higher accuracy and lower discrepancies than smaller models. Smaller models (7B) were highly inconsistent—they often failed to connect the dots (Forward Discrepancy) or lucked into high-level answers they couldn’t support (Backward Discrepancy).
However, even the best models weren’t immune. While LLaMA 3 70B showed the lowest discrepancy, the phenomenon of “knowing the complex but failing the simple” persisted.
2. The Memorization Factor
To prove that Backward Discrepancy is caused by memorization, the researchers used a metric called Min-K% probability. Without getting too bogged down in the math, a lower Min-K% suggests the text was likely present in the training data (memorized), while a higher value suggests the model is generating novel text.

Look at the top row of Figure 3. Notice how the curve shifts to the right as the depth increases (\(D_1 \to D_3\)). This suggests that models rely less on memorization for complex (\(D_3\)) questions than for simple (\(D_1\)) ones.
However, the bottom row tells a nuanced story about the gaps.
- Forward Discrepancies (Negative values): These often occur in samples that are less memorized. When the model can’t rely on rote recall, it has to reason, and it often fails to bridge the gap from \(D_2\) to \(D_3\).
- Backward Discrepancies (Positive values): These are strongly linked to memorization. The model has seen the “textbook answer” to the complex question so many times it can recite it, but the specific, isolated procedural questions (\(D_2\)) might be less common in the training set or require variable manipulation that the model can’t fake.
3. Qualitative Failures
The paper provides qualitative examples that are quite revealing. In one case of Backward Discrepancy, LLaMA 3 70B correctly recalled a complex formula for a specific math problem (\(D_3\)). However, when asked to explain the procedural steps to solve that very equation (\(D_2\)), it hallucinated non-existent methods and incorrect steps.
It’s the equivalent of a student memorizing “The answer is 42” but writing nonsense when asked “How did you calculate that?”
Can We Fix It? The Power of Structured Interaction
If models struggle to link simple concepts to complex answers, can we help them? The researchers tested “scaffolding” the models. Instead of asking the \(D_3\) question immediately, they guided the model through the graph: asking \(D_1\), then \(D_2\), and finally \(D_3\).
They tested three methods:
- Prompt (Gold): Giving the model the sub-questions and their correct answers in the prompt.
- Prompt (Pred): Giving the model the sub-questions and its own predicted answers.
- Multi-turn: Asking the questions sequentially in a conversation format.

Figure 4 illustrates the results of this intervention. The y-axis shows the improvement in score compared to a standard zero-shot attempt.
- Small Models (7B): They benefit massively (the tall blue bars). Guiding a smaller model through the reasoning steps significantly helps it answer the complex question. It needs the “hand-holding.”
- Large Models (70B): The effect is mixed. Sometimes, forcing a large model to answer simple sub-questions first actually hurts performance (negative bars).
- Why? Large models often have their own internal reasoning pathways. Forcing them to follow an external, possibly rigid structure might disrupt their superior internal representations.
- Multi-turn wins: The Multi-turn approach (blue bars) was generally the most stable and effective method across the board. By establishing context in a conversation history, the model naturally builds up the “state” required to answer the final question accurately.
Conclusion
The “Hierarchical Deconstruction” paper offers a sobering but optimistic view of LLMs. It moves us away from binary “Correct/Incorrect” evaluations and toward a diagnostic view of how knowledge is constructed.
Key Takeaways:
- Reasoning is Hierarchical: Real-world questions are graphs, not points. Evaluating them requires looking at the dependencies.
- Beware the Parrot: A correct answer to a hard question doesn’t imply mastery. Backward discrepancy reveals that models often memorize outcomes without understanding processes.
- Scaffolding Helps: For smaller, more efficient models, structured Chain-of-Thought or multi-turn interaction isn’t just a prompt engineering trick—it’s a necessary bridge to connect conceptual knowledge to strategic reasoning.
As we move toward agents and autonomous AI, this kind of discrepancy analysis will be vital. We don’t just need models that get the answer right; we need models that get the answer right for the right reasons.
](https://deep-paper.org/en/paper/2406.19502/images/cover.png)