Reinforcement Learning (RL) has been a game-changer for Large Language Models (LLMs), dramatically boosting their ability to solve complex reasoning problems. As models improve, a fundamental question has remained unanswered: how exactly does this improvement happen?
The training process often feels like a black box, producing curious phenomena such as sudden “aha moments” where a model appears to acquire a new emergent skill, or “length-scaling,” where longer, more detailed solutions lead to higher accuracy.
Are these just random artifacts of a complex system, or are they clues to a deeper underlying mechanism?
A recent paper, Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning, argues for the latter. The researchers propose that RL doesn’t simply improve every skill simultaneously. Instead, it guides the LLM to rediscover a powerful, human-like strategy for problem-solving: hierarchical reasoning — the cognitive approach humans use, separating high-level strategic planning (“What’s my plan of attack?”) from low-level procedural execution (“Now I’ll add these two numbers.”).
In this article, we’ll unpack how this reasoning hierarchy emerges during RL training, why it explains puzzling phenomena like “aha moments” and “length-scaling,” and how this insight led to a more efficient RL algorithm: Hierarchy-Aware Credit Assignment (HICRA).
The Ghost in the Machine: Human Priors and Hierarchical Thinking
LLMs are not trained from scratch — they are pre-trained on massive datasets of human-generated text, including many step-by-step problem solutions. This text encodes the patterns of human reasoning: planning, strategizing, and execution.
The authors hypothesize that RL fine-tuning doesn’t invent a new form of reasoning. Instead, it enables the model to leverage the pre-existing hierarchical structure from pre-training, revealing that separating planning from execution is an effective way to solve complex problems.
Figure 1: (Left) Human-like hierarchical reasoning: high-level strategic planning and low-level execution. (Right) Emergence during RL training: Phase ① consolidates low-level skills (drop in execution token entropy), Phase ② shifts learning to strategic planning (rise in semantic diversity, improved accuracy, and longer reasoning chains).
To study this, the researchers needed a method to automatically distinguish between:
- High-level Planning Tokens: Strategic moves guiding the reasoning, e.g., “First, I need to understand…”, “Let’s try a different approach,” “But wait…”.
- Low-level Execution Tokens: Operational steps, such as calculations, substitutions, and formula applications.
A token’s function depends on its context, making automatic classification challenging.
Finding the Scaffolding: Strategic Grams
The researchers introduced Strategic Grams (SGs) — n-grams of 3–5 words that act as semantic units steering logical flow. Examples: “let’s consider the case,” “the key insight is.”
These SGs have a distinct statistical signature: they appear often across different solutions but rarely multiple times in a single solution. This makes them ideal for identifying planning tokens.
SG identification pipeline:
- Semantic Clustering: Extract all n-grams from a large corpus of correct solutions. Use a pre-trained sentence transformer to embed them, cluster semantically similar n-grams (e.g., “try another way” and “an alternative path is”).
- Frequency Analysis: Calculate how many different solutions contain an n-gram from each cluster.
- SG Construction: Take the top 20% most frequent clusters. Any token belonging to an SG in this set is a planning token; all others are execution tokens.
Figure 2: A reasoning trace from Qwen3-4B-GRPO with planning tokens highlighted. These phrases represent high-level moves such as deduction, branching, and backtracking.
The Two-Phase Emergence of Hierarchical Reasoning
Across experiments with eight different LLMs and VLMs, reasoning improvement followed a consistent two-phase pattern.
Phase 1: Procedural Consolidation
Initially, the model focuses on mastering low-level skills. One wrong calculation can sink an answer, so RL pressures the model to achieve procedural reliability.
Metrics for execution tokens reveal this phase:
- Relative Perplexity: Measures surprise — lower values indicate higher confidence. Execution token perplexity drops sharply early in training.
- Token Entropy: Measures uncertainty over next-token predictions — execution token entropy starts low and decreases further.
Figure 3: Phase ① — execution token perplexity and entropy drop (procedural consolidation). Phase ② — semantic diversity of planning tokens increases (strategic exploration).
Takeaway 1: The first phase builds a robust low-level skillset, paving the way for improvements driven by strategic reasoning.
Phase 2: Strategic Exploration
Once procedural skills plateau, improvement comes from diversifying strategic plans.
To track this, the authors measured Semantic Entropy — diversity in SG usage — and Conditional Entropy for procedural tokens following a strategic move.
Figure 4: Semantic entropy measures diversity of ideas, distinct from token entropy’s focus on single-word uncertainty.
Results:
- Semantic entropy of planning tokens rises steadily (Figure 3, Column 3), showing active exploration of new strategies.
- This correlates with increases in accuracy and solution length (Column 4) — complex plans naturally require longer forms.
Takeaway 2: Sustained reasoning gains after procedural mastery come from expanding strategic diversity — explaining “aha moments” and “length-scaling.”
HICRA: Focused Credit Assignment
This two-phase insight reveals a flaw in standard RL methods like GRPO: they apply rewards or penalties evenly across all tokens. Most of these are low-level execution tokens, diluting the learning signal.
Hierarchy-Aware Credit Assignment (HICRA) modifies this:
For planning tokens \( t \in S_i \):
\[ \hat{A}_{i,t}^{\mathrm{HICRA}} = \hat{A}_{i,t} + \alpha \cdot |\hat{A}_{i,t}| \]Otherwise:
\[ \hat{A}_{i,t}^{\mathrm{HICRA}} = \hat{A}_{i,t} \]Where \(\alpha\) (e.g., 0.2) controls amplification.
\[ \mathcal{J}(\theta) = \mathbb{E}[ \hat{A}_{i,t}^{\mathrm{HICRA}} ], \quad \nabla \mathcal{J}(\theta) = \mathbb{E}[ \hat{A}_{i,t}^{\mathrm{HICRA}} \cdot \nabla \log \pi_{\theta}(o_{i,t} | \dots) ] \]This channels optimization pressure toward strategic elements, accelerating discovery and reinforcement of effective high-level reasoning patterns.
Experimental Proof
Main Results
Table 1: Text-only benchmarks — HICRA consistently beats GRPO and base models across multiple LLMs.
Table 2: Multimodal benchmarks — HICRA shows similar gains for VLMs.
Error Analysis
Figure 5: Largest improvements come from correcting high-level strategic faults — procedural error rates change little.
Targeted vs. Indiscriminate Exploration
Figures 6 & 7: HICRA sustains higher semantic entropy, correlating with accuracy; blanket entropy boosts waste learning capacity.
Semantic Entropy as Progress Compass
Figure 8: On VLMs, token entropy misleads — semantic entropy reveals continued strategic exploration.
Planning Tokens vs. High-Entropy “Fork” Tokens
Figure 9: Many high-entropy tokens lack strategic function.
Figure 10: High token entropy ≠ strategic importance; functional identification is more precise.
Conclusion and Future Directions
Core findings:
- Reasoning is Hierarchical: RL shifts the bottleneck from procedural skill to strategic planning, rediscovering a human-like thinking pattern.
- Focused Credit Works Better: HICRA amplifies learning for high-impact planning tokens, yielding superior results.
- Measure What Matters: Semantic entropy better tracks meaningful exploration than aggregate token entropy.
Implications:
- Shift from treating text as flat token sequences to recognizing semantic/strategic units in RL optimization.
- Develop process-oriented rewards valuing correct strategic moves, even if final answers are flawed.
- Extend to other reasoning-intensive domains like code generation and tool-using agents.
By uncovering the emergent hierarchy in LLM reasoning, this research not only explains existing phenomena but also provides a roadmap for building more capable, efficient AI reasoners in the future.