Large Language Models (LLMs) have grown remarkably adept at complex reasoning, successfully tackling competitive mathematics problems, logic puzzles, and intricate coding tasks. A core driver of this progress has been Reinforcement Learning with Verifiable Rewards (RLVR) — a training approach where solutions are automatically checked, and correct outputs earn rewards while incorrect ones incur penalties. This creates a powerful feedback loop for learning.

Yet there’s a persistent challenge. After an initial improvement phase, RLVR-trained models often hit a stubborn plateau in performance — and then collapse. This collapse is coupled with a strong drop in policy entropy, a metric reflecting how much the model explores alternative ideas. In practical terms, when entropy falls, the model stops experimenting, becomes overconfident in familiar solution paths, and loses its creative reasoning ability.

Traditionally, researchers have tried countering entropy collapse by forcing models to be more random — adding penalties for low entropy to keep the “mind” open. But as recent work from Tencent Hunyuan shows, this brute-force randomness can be harmful. It inflates noisy, irrelevant tokens alongside genuinely useful ones, destabilizing training and sometimes accelerating collapse.

The paper, Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward, shifts the focus to a subtler dynamic. It identifies the real culprit: the gradual elimination of rare, low-probability reasoning tokens — what the authors call Reasoning Sparks. These are tokens like “Wait…”, “Perhaps…”, or “Alternatively…” that, though infrequent, open entirely new lines of reasoning.

To protect these crucial sparks without boosting meaningless noise, the team introduces Low-probability Regularization (Lp-Reg) — a precise, targeted mechanism that preserves valuable rare tokens and fosters sustained exploration.

An illustration of a reasoning spark, and graphs showing how standard RL training collapses while Lp-Reg sustains performance by preserving valuable exploratory tokens and ignoring noise.

Figure 1: (a) A “reasoning spark” like “Wait…” can initiate a new, potentially correct reasoning path despite its low probability. (b) Standard GRPO training collapses; adding indiscriminate entropy bonuses accelerates failure. Lp-Reg remains stable. (c,d) GRPO suppresses valuable sparks, entropy bonuses amplify irrelevant noise, whereas Lp-Reg’s selective approach keeps exploration productive.


Background: Training LLMs to Reason with Rewards

Before diving into Lp-Reg, let’s recap the RLVR training pipeline and the baseline method it builds upon.

Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR capitalizes on tasks with verifiable final answers. For each problem:

  1. The LLM receives a prompt.
  2. It generates a chain-of-thought reasoning process and a final answer.
  3. An automated verifier — like a math solution checker — grades the final answer.
  4. The model receives a reward: positive if correct, negative if incorrect.

Formally:

\[ \mathcal{J}_{\mathrm{RL}}(\boldsymbol{\theta}) = \mathbb{E}_{(q,a) \sim D, o \sim \pi_{\boldsymbol{\theta}}(\cdot|q)} \left[ r(o,a) \right] \]

Here, the model’s policy \(\pi_{\boldsymbol{\theta}}\) generates output \(o\) for a question \(q\), and we aim to maximize the reward compared to the correct answer \(a\).

Group-Relative Policy Optimization (GRPO)

GRPO is the RL algorithm that Tencent uses as the baseline. For one prompt, it generates multiple responses (\(o_1, ..., o_G\)), scores each, and normalizes the scores to create an advantage for every token:

\[ A_{i,t} = \frac{R(o_i) - \operatorname{mean}(\mathcal{G})}{\operatorname{std}(\mathcal{G})} \]

where \(\mathcal{G}\) contains the rewards of the group. These advantages guide policy updates, encouraging high-reward outputs while discouraging low-reward ones.

However, GRPO’s updates can overly punish low-probability reasoning sparks when they appear in unsuccessful outputs. Over time, their probability collapses toward zero.


The Core Method: Low-probability Regularization (Lp-Reg)

Lp-Reg directly addresses the suppression of reasoning sparks by introducing a selective protective mechanism. It works by leveraging the model’s own predictions to create a less-noisy reference distribution and using it to gently regularize training updates.

Step 1: Creating a “Less-Noisy” Proxy Distribution

Lp-Reg constructs a filtered variant of the current policy: \(\pi_{\text{proxy}}\). This happens in two steps:

  1. Filter out probable noise: Tokens with \(\pi_\theta(o|\cdot) \leq \tau\) are presumed irrelevant and discarded. The threshold \(\tau\) can be:

    • Fixed — a constant value (e.g., 0.02).
    • Dynamic (min-p) — a fraction \(\kappa\) of the highest-probability token’s probability, adapting to distribution sharpness.
  2. Renormalize probabilities: Remaining tokens’ probabilities are rescaled so they sum to 1, amplifying the relative weight of preserved reasoning sparks.

A diagram showing how the proxy distribution is created by filtering out low-probability tokens and renormalizing the rest.

Figure 2: Constructing \(\pi_{\text{proxy}}\). Tokens below threshold \(\tau\) are removed; the rest are renormalized.

Mathematically:

\[ \pi_{\text{proxy}}(o|\cdot) = \begin{cases} \frac{\pi_{\theta}(o|\cdot)}{\sum_{o'} \pi_{\theta}(o'|\cdot) \mathbb{I}[\pi_{\theta}(o'|\cdot) > \tau]} & \text{if }\pi_{\theta}(o|\cdot) > \tau \\ 0 & \text{otherwise} \end{cases} \]

Step 2: Conditional Regularization

The proxy is integrated into GRPO with a forward KL divergence penalty:

\[ \mathcal{D}_{KL}(\pi_{\text{proxy}} \ \|\ \pi_{\theta}) \]

This penalty is selective — it only applies if:

  1. The token’s sampling probability is in the bottom \(\rho\%\) of the batch.
  2. It survives filtering (\(\pi_{\text{proxy}} > 0\)).
  3. It received a negative advantage (\(A_{i,t} < 0\)).

This prevents the model from eliminating reasoning sparks entirely while letting normal learning proceed elsewhere.

The full objective function for Lp-Reg, combining the GRPO policy gradient with the conditional KL divergence penalty.

Lp-Reg’s objective: Standard GRPO update plus a targeted KL penalty to protect valuable low-probability tokens.


Experiments and Results

State-of-the-Art Accuracy

On five math reasoning benchmarks, Lp-Reg delivers consistently superior performance. On the Qwen3-14B model, Lp-Reg reaches 60.17% average accuracy — 2.66% higher than the next best method.

Table showing the main results on five math benchmarks. Lp-Reg (on-policy) achieves the highest average accuracy on both the 14B and 32B models.

Table 1: Across two model scales, Lp-Reg leads in average performance.

Stability in Training

Training dynamics reveal Lp-Reg’s stability: it sustains on-policy learning for ~1,000 steps, while baselines plateau or collapse.

Training dynamic curves showing response length, entropy, and accuracy over training steps. Lp-Reg shows stable and superior accuracy.

Figure 3: On Qwen3-14B, Lp-Reg (purple) maintains higher, steadier accuracy versus GRPO (teal) and Clip-Higher (orange). Entropy follows a healthy adaptive curve.


Why It Works: Ablations & Analysis

Noise Filtering is Crucial

Regularizing all low-probability tokens without filtering leads to collapse and entropy spikes.

Ablation study results showing that removing the noise filter leads to performance collapse.

Figure 4: Removing the \(\tau\) filter destabilizes training. Key takeaway: protect sparks, ignore noise.


Anatomy of Exploration

Low Probability vs. High Entropy

Word clouds illustrate why targeting low-probability tokens is better. High-entropy tokens are often mundane functional words or symbols (sqrt, \n, times). Low-probability tokens are rich with exploratory language: “But”, “Wait”, “Perhaps”, “Alternatively”.

Word clouds comparing high-entropy tokens with low-probability tokens.

Figure 5: High-entropy ≠ meaningful exploration. Low-probability tokens house reasoning sparks.

Token Dynamics Under Different Methods

Scatter plots show “wait” under three methods:

  • GRPO: Forced into high-probability, low-entropy usage — deterministic, non-exploratory.
  • GRPO + Entropy Loss: Scattered, noisy usage; exploration is random and incoherent.
  • Lp-Reg: Balanced probabilities and entropies — enabling both confident and exploratory use.

Scatter plots showing the probability-entropy distribution of exploratory tokens for different training methods.

Figure 6: Lp-Reg preserves a healthy diversity of contexts for exploratory tokens.

Frequency charts confirm sustained use of sparks during training.

Bar chart showing the frequency of exploratory tokens over training steps for GRPO and Lp-Reg.

Figure 7: Lp-Reg consistently uses exploratory tokens more than GRPO.

Statistical Basis for Filtering

Analysis shows meaningful sparks have consistently higher average next-token probabilities than irrelevant noise in the low-probability range.

A line chart showing that the average probability of exploratory tokens is consistently higher than that of irrelevant tokens.

Figure 8: Measurable probability gap enables effective filtering.


Conclusion & Takeaways

This research reframes the exploration collapse problem in RLVR: it’s not just about maintaining high entropy, but about preventing the loss of rare, valuable reasoning sparks.

Key insights:

  1. Collapse stems from spark elimination, not just entropy reduction.
  2. Selective protection is essential — indiscriminate randomness amplifies noise.
  3. Lp-Reg’s filter + conditional KL preserves sparks without destabilizing training.
  4. Quality matters more than quantity in exploration; meaningful diversity beats blind randomness.

By safeguarding the subtle tokens that inspire fresh reasoning paths, Lp-Reg pushes LLMs toward richer, more creative problem-solving — setting a new standard for stable, high-performance training in advanced reasoning tasks.