Reinforcement Learning (RL) has been a game-changer for Large Language Models (LLMs), dramatically boosting their ability to solve complex reasoning problems. As models improve, a fundamental question has remained unanswered: how exactly does this improvement happen?

The training process often feels like a black box, producing curious phenomena such as sudden “aha moments” where a model appears to acquire a new emergent skill, or “length-scaling,” where longer, more detailed solutions lead to higher accuracy.

Are these just random artifacts of a complex system, or are they clues to a deeper underlying mechanism?

A recent paper, Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning, argues for the latter. The researchers propose that RL doesn’t simply improve every skill simultaneously. Instead, it guides the LLM to rediscover a powerful, human-like strategy for problem-solving: hierarchical reasoning — the cognitive approach humans use, separating high-level strategic planning (“What’s my plan of attack?”) from low-level procedural execution (“Now I’ll add these two numbers.”).

In this article, we’ll unpack how this reasoning hierarchy emerges during RL training, why it explains puzzling phenomena like “aha moments” and “length-scaling,” and how this insight led to a more efficient RL algorithm: Hierarchy-Aware Credit Assignment (HICRA).


The Ghost in the Machine: Human Priors and Hierarchical Thinking

LLMs are not trained from scratch — they are pre-trained on massive datasets of human-generated text, including many step-by-step problem solutions. This text encodes the patterns of human reasoning: planning, strategizing, and execution.

The authors hypothesize that RL fine-tuning doesn’t invent a new form of reasoning. Instead, it enables the model to leverage the pre-existing hierarchical structure from pre-training, revealing that separating planning from execution is an effective way to solve complex problems.

LLMs learn a two-phase reasoning process that mirrors human cognition, first mastering low-level skills and then exploring high-level strategies.

Figure 1: (Left) Human-like hierarchical reasoning: high-level strategic planning and low-level execution. (Right) Emergence during RL training: Phase ① consolidates low-level skills (drop in execution token entropy), Phase ② shifts learning to strategic planning (rise in semantic diversity, improved accuracy, and longer reasoning chains).

To study this, the researchers needed a method to automatically distinguish between:

  • High-level Planning Tokens: Strategic moves guiding the reasoning, e.g., “First, I need to understand…”, “Let’s try a different approach,” “But wait…”.
  • Low-level Execution Tokens: Operational steps, such as calculations, substitutions, and formula applications.

A token’s function depends on its context, making automatic classification challenging.


Finding the Scaffolding: Strategic Grams

The researchers introduced Strategic Grams (SGs) — n-grams of 3–5 words that act as semantic units steering logical flow. Examples: “let’s consider the case,” “the key insight is.”

These SGs have a distinct statistical signature: they appear often across different solutions but rarely multiple times in a single solution. This makes them ideal for identifying planning tokens.

SG identification pipeline:

  1. Semantic Clustering: Extract all n-grams from a large corpus of correct solutions. Use a pre-trained sentence transformer to embed them, cluster semantically similar n-grams (e.g., “try another way” and “an alternative path is”).
  2. Frequency Analysis: Calculate how many different solutions contain an n-gram from each cluster.
  3. SG Construction: Take the top 20% most frequent clusters. Any token belonging to an SG in this set is a planning token; all others are execution tokens.

An example of a reasoning trace from a Qwen model, with planning tokens (Strategic Grams) highlighted in color. These phrases guide logical flow through deduction, branching, and backtracking.

Figure 2: A reasoning trace from Qwen3-4B-GRPO with planning tokens highlighted. These phrases represent high-level moves such as deduction, branching, and backtracking.


The Two-Phase Emergence of Hierarchical Reasoning

Across experiments with eight different LLMs and VLMs, reasoning improvement followed a consistent two-phase pattern.

Phase 1: Procedural Consolidation

Initially, the model focuses on mastering low-level skills. One wrong calculation can sink an answer, so RL pressures the model to achieve procedural reliability.

Metrics for execution tokens reveal this phase:

  • Relative Perplexity: Measures surprise — lower values indicate higher confidence. Execution token perplexity drops sharply early in training.
  • Token Entropy: Measures uncertainty over next-token predictions — execution token entropy starts low and decreases further.

Training dynamics for three different LLMs reveal a consistent two-phase learning process. Phase ① shows a sharp drop in perplexity and entropy for execution tokens (gray), indicating procedural consolidation. Phase ② shows a rise in semantic diversity for planning tokens (red), correlating with accuracy gains and longer reasoning chains.

Figure 3: Phase ① — execution token perplexity and entropy drop (procedural consolidation). Phase ② — semantic diversity of planning tokens increases (strategic exploration).

Takeaway 1: The first phase builds a robust low-level skillset, paving the way for improvements driven by strategic reasoning.


Phase 2: Strategic Exploration

Once procedural skills plateau, improvement comes from diversifying strategic plans.

To track this, the authors measured Semantic Entropy — diversity in SG usage — and Conditional Entropy for procedural tokens following a strategic move.

A comparison of Token Entropy and Semantic Entropy. Token entropy is calculated over next-token probabilities; semantic entropy measures diversity in meaningful n-grams.

Figure 4: Semantic entropy measures diversity of ideas, distinct from token entropy’s focus on single-word uncertainty.

Results:

  • Semantic entropy of planning tokens rises steadily (Figure 3, Column 3), showing active exploration of new strategies.
  • This correlates with increases in accuracy and solution length (Column 4) — complex plans naturally require longer forms.

Takeaway 2: Sustained reasoning gains after procedural mastery come from expanding strategic diversity — explaining “aha moments” and “length-scaling.”


HICRA: Focused Credit Assignment

This two-phase insight reveals a flaw in standard RL methods like GRPO: they apply rewards or penalties evenly across all tokens. Most of these are low-level execution tokens, diluting the learning signal.

Hierarchy-Aware Credit Assignment (HICRA) modifies this:

For planning tokens \( t \in S_i \):

\[ \hat{A}_{i,t}^{\mathrm{HICRA}} = \hat{A}_{i,t} + \alpha \cdot |\hat{A}_{i,t}| \]

Otherwise:

\[ \hat{A}_{i,t}^{\mathrm{HICRA}} = \hat{A}_{i,t} \]

Where \(\alpha\) (e.g., 0.2) controls amplification.

\[ \mathcal{J}(\theta) = \mathbb{E}[ \hat{A}_{i,t}^{\mathrm{HICRA}} ], \quad \nabla \mathcal{J}(\theta) = \mathbb{E}[ \hat{A}_{i,t}^{\mathrm{HICRA}} \cdot \nabla \log \pi_{\theta}(o_{i,t} | \dots) ] \]

This channels optimization pressure toward strategic elements, accelerating discovery and reinforcement of effective high-level reasoning patterns.


Experimental Proof

Main Results

Table 1 shows HICRA outperforming GRPO and the base model across various text-only math benchmarks and for different LLMs, including Qwen3-4B and Llama-3.1-8B.

Table 1: Text-only benchmarks — HICRA consistently beats GRPO and base models across multiple LLMs.

Table 2 shows HICRA’s superior performance on multimodal reasoning benchmarks for Vision-Language Models (VLMs) like MiMO-VL and Qwen2.5-VL-7B.

Table 2: Multimodal benchmarks — HICRA shows similar gains for VLMs.

Error Analysis

Planning & Strategy errors decrease more sharply than other errors during training.

Figure 5: Largest improvements come from correcting high-level strategic faults — procedural error rates change little.


Targeted vs. Indiscriminate Exploration

HICRA achieves higher semantic entropy and validation accuracy than GRPO (Figure 6, left), while Entropy Regularization boosts token entropy but fails to improve accuracy (Figure 7, right).

Figures 6 & 7: HICRA sustains higher semantic entropy, correlating with accuracy; blanket entropy boosts waste learning capacity.


Semantic Entropy as Progress Compass

Training dynamics on a Vision-Language Model show token entropy collapsing and Pass@8 saturating, while semantic entropy stays predictive of accuracy gains for HICRA.

Figure 8: On VLMs, token entropy misleads — semantic entropy reveals continued strategic exploration.


Planning Tokens vs. High-Entropy “Fork” Tokens

A reasoning example showing planning tokens (blue/purple) and high-entropy tokens (red/purple). Many high-entropy tokens are not strategic.

Figure 9: Many high-entropy tokens lack strategic function.

Left: majority of planning tokens are high-entropy. Right: few high-entropy tokens are planning tokens.

Figure 10: High token entropy ≠ strategic importance; functional identification is more precise.


Conclusion and Future Directions

Core findings:

  1. Reasoning is Hierarchical: RL shifts the bottleneck from procedural skill to strategic planning, rediscovering a human-like thinking pattern.
  2. Focused Credit Works Better: HICRA amplifies learning for high-impact planning tokens, yielding superior results.
  3. Measure What Matters: Semantic entropy better tracks meaningful exploration than aggregate token entropy.

Implications:

  • Shift from treating text as flat token sequences to recognizing semantic/strategic units in RL optimization.
  • Develop process-oriented rewards valuing correct strategic moves, even if final answers are flawed.
  • Extend to other reasoning-intensive domains like code generation and tool-using agents.

By uncovering the emergent hierarchy in LLM reasoning, this research not only explains existing phenomena but also provides a roadmap for building more capable, efficient AI reasoners in the future.