Large Language Models (LLMs) are getting remarkably good at complex reasoning tasks, from solving math competition problems to writing code. A key technique driving this progress is Reinforcement Learning (RL), specifically a paradigm called Reinforcement Learning from Verifiable Rewards (RLVR). In RLVR, we treat an LLM’s reasoning process—its chain of thought—as a sequence of actions. If the final answer is correct, the model gets a reward. It’s a simple yet powerful way to teach models to “think” better.

But there’s a catch. Most RLVR methods are on-policy. Imagine you’re practicing for a math test: you solve a problem, check the answer, learn a little something—then crumple up your scratch paper and throw it away, never to be seen again. That’s essentially what on-policy training does: it generates a batch of reasoning attempts (experiences), uses them for a single gradient update, and then discards them. This is incredibly inefficient, wasting massive amounts of computation and missing valuable learning opportunities from past successes and failures.

What if, instead of throwing away the scratch paper, we kept a well-organized notebook of our attempts? What if we could revisit our most insightful solutions and learn from them again? This is the core idea behind experience replay, a classic RL technique. But for the complex, nuanced world of LLM reasoning, a simple replay buffer isn’t enough. We need to answer a fundamental question:

What makes a reasoning experience valuable?

A new paper, EXGRPO: Learning to Reason from Experience, tackles this question head-on. The researchers first investigate the properties of valuable experiences and then use those insights to build ExGRPO (Experiential Group Relative Policy Optimization)—a framework that intelligently manages and reuses past reasoning trajectories. Their results are striking: ExGRPO not only boosts reasoning performance significantly but also stabilizes training for models where standard on-policy methods completely fail.

Let’s dive in.

Background: The Building Blocks of RL for Reasoning

Before we get to the secret sauce of ExGRPO, let’s briefly review the two foundational concepts it’s built upon: RLVR and GRPO.

Reinforcement Learning with Verifiable Rewards (RLVR)

In RLVR:

  • Agent: The LLM.
  • Action: Generating the next token in a reasoning chain (such as a step in a proof).
  • Trajectory: The completed chain-of-thought solution.
  • Reward: The “verifiable” part—tasks like math let us check answers automatically. The reward function is usually binary: +1 for correct, 0 for incorrect.

A simple binary reward function for verifiable tasks.

Figure: Binary reward in RLVR — +1 for correct answer, 0 otherwise.

This setup enables RL algorithms to optimize the model’s policy (its token-generation strategy) to maximize those +1 rewards.

Group Relative Policy Optimization (GRPO)

Estimating how much better a given action is compared to average (its advantage) often requires a separate value model, adding complexity. GRPO is a clever alternative: it generates a group of \(K\) solutions for the same problem, then compares each solution’s reward to the group’s mean.

If a trajectory earns a reward of 1 while the group average is low, it gets a high advantage signal. Formally:

The GRPO advantage estimation formula.

Figure: GRPO estimates advantage by normalizing each trajectory’s reward against the group’s mean.

The on-policy GRPO objective increases the probability of generating high-advantage trajectories:

The on-policy GRPO objective function.

Figure: On-policy GRPO objective.

GRPO is strong—but still on-policy. Each group of solutions is generated, used once, then discarded. That’s the inefficiency ExGRPO aims to solve.

What Makes a Reasoning Experience Valuable? A Preliminary Study

Before building a replay buffer, the ExGRPO authors asked: what should go into it?

They trained a model using standard on-policy RLVR and analyzed thousands of reasoning trajectories.

Finding #1: The “Goldilocks Zone” of Difficulty

They bucketed questions by the model’s real-time success rate:

  • Easy: 75–100% success.
  • Medium: 25–75%.
  • Hard: 0–25%.

Separate models trained only on one difficulty level produced the results in Figure 1a: Medium difficulty yielded the best performance.

Analysis of question difficulty and entropy in on-policy RLVR training: (a) performance by difficulty; (b) entropy gaps between correct/wrong reasoning chains; (c) difficulty vs. entropy distribution.

Figure 1: (a) Medium questions lead to best performance. (b) Correct reasoning chains have lower entropy than wrong ones. (c) Medium difficulty yields the tightest low-entropy distributions.

Easy questions give little new information; Hard ones offer sparse, noisy signals. Medium sits in the sweet spot—the “zone of proximal development”—where the model can be challenged yet able to learn.

Finding #2: Low Entropy Marks High-Quality Reasoning

Final correctness doesn’t guarantee the reasoning was sound—the model could guess correctly. The authors used Qwen3-32B as an external judge for logical validity.

They measured entropy (uncertainty at each token). Correct reasoning consistently showed lower entropy than incorrect reasoning. Hence, when multiple correct solutions exist, choosing the lowest-entropy one yields the highest likelihood of truly sound reasoning.

Medium difficulty questions also concentrated these low-entropy correct solutions, reinforcing their value.

From these findings:

  1. Prioritize medium difficulty questions.
  2. Within them, replay the lowest-entropy successful trajectories.

The ExGRPO Framework: Smart Experience Management + Optimization

Armed with these insights, the team designed ExGRPO (Figure 2), with two main phases: Experience Management and Mixed-Policy Experiential Optimization.

Overview of ExGRPO: (a) Experience Management pipeline; (b) Policy Optimization integrating on-policy and replayed data.

Figure 2: ExGRPO flow — manage experiences, then combine them with fresh rollouts in optimization.

Phase 1: Experience Management

It’s not FIFO—it’s structured to surface the most valuable data at the right moment.

  1. Collection: During training, every successful trajectory is stored: question + trajectory + latest success rate.
  2. Partition: The buffer is divided into buckets by current success rate (0–25%, 25–50%, etc.). Questions fully mastered are moved to a Retired Set to avoid wasting updates on trivially solved items.
  3. Selection:
    • Question Sampling: Gaussian sampling centered at a 50% success rate biases toward Medium difficulty.
    • Trajectory Selection: Among stored successes, choose the one with lowest entropy under the current policy.

Phase 2: Experiential Policy Optimization

Batches mix on-policy items (exploration) with experiential items (replay), using a ratio \( \rho \). The final loss combines standard GRPO for on-policy with an importance-weighted term for replayed data.

The on-policy part of the ExGRPO objective function.

Figure: On-policy GRPO loss.

The experiential off-policy part of the ExGRPO objective function.

Figure: Off-policy experiential loss with importance weighting to correct stale policy data.

Two stabilizers:

  • Policy Shaping: Apply a non-linear transform to importance weights, dampening overly confident signals to preserve exploration.
  • Delayed Start: Replay begins only after a performance threshold is met, avoiding low-quality early data.

Results: Better, Faster, and More Stable

The team tested ExGRPO on five backbone models (Qwen and Llama families, 1.5B–8B parameters) across nine math and general reasoning benchmarks.

Consistent Gains

In Table 1, ExGRPO beats on-policy RLVR every time. On Qwen2.5-Math-7B: +3.5 in-distribution and +7.6 out-of-distribution average gains. Particularly strong boosts appear on hard sets like AIME.

Overall performance: Qwen2.5-Math-7B. ExGRPO shows significant gains over the baseline.

Figure: Comparative benchmark performance: ExGRPO outperforms on-policy RLVR.

And across architectures and scales (Figure 3), ExGRPO stays robust:

Benchmark performance across different models: ExGRPO (pink) surpasses on-policy RLVR (blue) consistently.

Figure 3: Gains across scales and tuning states.

Stabilizing Weak Models

For Llama-3.1 8B base, on-policy collapsed—rewards near zero, entropy exploding, responses bloating. ExGRPO prevented collapse by replaying early “lucky hits” for stable signals.

Training dynamics: On-policy collapses (blue), ExGRPO stabilizes (pink).

Figure 4: ExGRPO rescues training from collapse.

It’s Not Just What You Replay — But How

Ablations (Figure 7) show removing question selection, trajectory selection, or policy shaping each hurts performance. Replay ratio matters too: \( \rho = 50% \) hits the sweet spot. Too high (75%) stifles exploration; too low (25%) underuses experience.

Dynamics of the experience buffer: principled selection beats sheer volume.

Figure 6: Replay efficiency hinges on smart selection/balancing.

Ablation results: Full ExGRPO (pink) beats variants missing components.

Figure 7: Each component is essential for peak performance.

Conclusion: The Era of Experience

The ExGRPO paper makes a powerful case: for LLMs to become better reasoners, they must learn more effectively from their own experiences. By breaking from wasteful on-policy paradigms, ExGRPO shows that principled experience management unlocks efficiency, stability, and scalability.

Key takeaways:

  • Not all experience is equal: Medium difficulty problems with low-entropy successes are premium training fuel.
  • Structure matters: Organized replay with difficulty/entropy-based prioritization pays dividends.
  • Balance is vital: The right mix of past and new data stabilizes and accelerates learning, even rescuing failing runs.

As models grow, learning to reason from experience won’t just be nice—it’ll be necessary.