Large Language Models (LLMs) have shown an extraordinary capacity to improve themselves through reinforcement learning (RL). By generating solutions, receiving feedback, and adjusting their strategy, they can learn to tackle complex problems such as advanced mathematical reasoning. This process hinges on a critical step: exploration—trying many different approaches, or “rollouts,” to discover what works.

The catch? Exploration is computationally expensive. Generating thousands of possible solutions for thousands of different problems consumes massive amounts of GPU time. To keep costs manageable, current methods typically assign a small, fixed exploration budget to every problem—often 8 possible solutions per task.

This one-size-fits-all approach hides a serious flaw:

  • For an easy problem, 8 attempts may be overkill, wasting compute.
  • For a hard problem, 8 attempts may be hopelessly insufficient, leading to repeated failure.

In both cases, the model can end up always succeeding or always failing on a given task. This produces a “zero gradient”—no learning signal—wasting precious computational resources.

A recent paper, “Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation”, tackles this problem head-on. The researchers propose: treat exploration budget allocation as a classic optimization problem. Frame each task as an item with a cost (computation) and value (learning potential). Then, use the well-known knapsack problem to intelligently distribute a fixed budget, giving more resources to the tasks that need them most.

Illustration of the Knapsack RL framework comparing uniform “Homogeneous Allocation” to dynamic “Knapsack-RL Allocation.” Resources (GPUs) are shifted from an easy task to a hard one to maximize learning value.

Figure 1 Illustration of our framework for allocating exploration budgets among tasks. Tasks are modeled as items with learning value and computational cost, and allocation is optimized via knapsack.


Background: The Zero-Gradient Problem in Policy Optimization

To see the significance of this idea, let’s look at RL fine-tuning for LLMs. The goal is to train model parameters \(\theta\) to generate responses \(y\) for prompt \(x\) that maximize a reward:

\[ \max_{\theta} \ \mathbb{E}_{y \sim \pi_{\theta}(\cdot|x)}\,[\,r(x,y)\,] \]

In mathematical reasoning tasks, the reward is often binary:

\[ r(x,y) = \mathbb{I}(\text{answer is correct}) \]

A popular RL algorithm is Group Relative Policy Optimization (GRPO). For a batch of \(M\) prompts, the model generates \(N\) rollouts for each prompt. The gradient update is:

\[ g(\theta) = \sum_{i=1}^{M} \sum_{j=1}^{N} \nabla_{\theta} \log \pi_{\theta}(y_{ij}|x_i) \cdot (\,r(x_i,y_{ij}) - b_i\,) \cdot c_i \]

Here:

  • \(b_i\) — average reward for prompt \(i\) (baseline).
  • \(r - b_i\) — “advantage” for the rollout.

If a response’s reward is above average, the model increases its probability; if below average, it decreases it.


Critical issue: If all \(N\) rollouts for a prompt have the same reward (all successes or all failures):

  1. All successes: Rewards = 1, baseline \(b_i\) = 1, advantage = 0 → gradient = 0.
  2. All failures: Rewards = 0, baseline \(b_i\) = 0, advantage = 0 → gradient = 0.

In both cases, the model spends computation but learns nothing.


Measuring the Inefficiency of Uniform Budgets

The authors introduce effective-gradient-ratio: the fraction of generated responses actually contributing a non-zero gradient:

\[ \text{effective-gradient-ratio} = \frac{1}{M N} \sum_{i=1}^{M} \sum_{j=1}^N \mathbb{I}(g_{i,j} \neq 0) \]

Tracking this during training for a 7B-parameter math model (uniform budget \(N=8\)) revealed worrying trends.

Chart showing the effective gradient ratio (blue) declining over iterations. Zero-gradient ratios from all-success (orange) and all-failure (green) patterns indicate many tasks are too easy or too hard.

Figure 2 Effective gradient ratio and zero-gradient breakdown during training. Early training: high all-failure rate; late training: high all-success rate.

Observations:

  • Early Training: Model fails most tasks; all-failure prompts dominate.
  • Mid Training: Mix of successes and failures yields highest efficiency.
  • Late Training: Easy tasks dominate all-success prompts; hard tasks persist as all-failure; efficiency drops sharply.

How Many Rollouts Do You Really Need?

Probability of a non-zero gradient for prompt success rate \(p_i\):

\[ \text{Prob} = 1 - p_i^{N} - (1 - p_i)^{N} \]

For:

  • \(p = 0.5\): Average 3 rollouts.
  • \(p = 0.01\): Average 100 rollouts; 229 rollouts for 90% certainty.

Bar chart showing required exploration budget versus success rate. Tasks near 0 or 1 require hundreds of rollouts; mid-range tasks need few.

Figure 3 Rollouts needed for non-zero gradient, grouped by success rate.

Uniform \(N=8\) budgets are only effective for \(p \in [0.1, 0.9]\).

This is the computation–exploration dilemma:

  • Raise \(N\) uniformly → huge cost.
  • Filter easy/hard tasks → lose training opportunities on challenging problems.

The Knapsack RL Solution

Instead of uniform budgets, reallocate fixed total compute more intelligently.

RL Exploration SettingKnapsack Analogy
Task with exploration budgetItem in knapsack
Number of rolloutsItem weight
Learning benefit from budgetItem value
Total available computationKnapsack capacity
Available GPUsKnapsack physical container

Table mapping RL exploration to knapsack problem terms.

Table 1 Connection between RL exploration and the knapsack problem.


Defining Task “Value”

The value of allocating \(N_i\) rollouts to task \(i\):

\[ \text{Value}(N_i, p_i) = \text{ProbNonZeroGradient}(N_i, p_i) \times \text{InfoGain}(p_i) \]
  • \(\text{ProbNonZeroGradient}\): \(1 - p_i^{N_i} - (1-p_i)^{N_i}\) — likelihood of mixed successes/failures.
  • \(\text{InfoGain}(p_i)\): Estimated learning potential, approximated as \(p_i(1-p_i)^2\).

Implications:

  • Low value for very easy (\(p_i \to 1\)) or very hard (\(p_i \to 0\)) tasks.
  • Peak value near \(p_i = 1/3\) — “learning sweet spot.”

Contour plot of value vs. success rate and budget. Sweet spot near p=1/3 with low N; harder/easier tasks need more N to match value.

Figure 4 Relationship between success rate, budget, and learning value.


Implementation Highlights

  • Estimating success rates: Use prior step’s observed rates as estimates.
  • Fallback strategy: Reallocate freed budget from easy tasks to extremely hard ones (\(p=0\)) to keep them in play.
  • Rollout balancing: Distribute allocated rollouts evenly among workers to avoid idle GPUs.

Experiments: Proof of Concept

Setup: Compare Knapsack-GRPO to baseline GRPO across multiple LLMs for math-intensive benchmarks (AIME, AMC, MATH, MINERVA, OLYMPIAD) and GPQA.

ModelAIMEAMCMATHMINERVAOLYMPIADGPQAAvg
DPSK-R1-Distill-1.5B25.362.181.425.841.739.142.9
+ GRPO27.671.184.027.646.436.745.9
+ Knapsack-GRPO34.075.186.728.549.740.349.7
Qwen3-4B-Base6.629.948.019.423.126.422.9
+ GRPO20.756.980.631.944.946.643.2
+ Knapsack-GRPO20.866.081.035.746.245.545.1
Qwen2.5-Math-7B12.341.061.211.826.122.026.7
+ GRPO23.970.681.733.641.940.845.2
+ Knapsack-GRPO24.377.483.934.544.143.847.5

Table of benchmark results showing Knapsack-GRPO outperforming GRPO across all settings.

Table 2 Performance comparison across models and benchmarks. Knapsack-GRPO consistently leads.


Understanding the Gains

Dynamic allocation: While average \(N=8\), hard tasks sometimes receive up to 93 rollouts.

Histogram of exploration budgets showing most tasks near low N, some up to N=93.

Figure 5 Distribution of budgets during training — long tail for challenging tasks.

Gradient efficiency boost: Effective gradient ratio improves by 20–40%.

Line charts comparing effective gradient ratio for Knapsack-GRPO (blue) vs GRPO (orange).

Figure 6 Knapsack-GRPO maintains high efficiency over training.

Better learning dynamics: More transitions from hard to easy tasks.

Two heatmaps showing task transitions. Knapsack-GRPO transitions more tasks to “extremely-easy.”

Figure 7 Prompt transition matrices — larger proportion reaching extremely-easy.

Bar chart of final task status counts showing fewer extremely-hard and more extremely-easy tasks for Knapsack-GRPO.

Figure 8 Final distribution of task difficulty — Knapsack-GRPO yields more mastered tasks.


The 2× Efficiency “Free Lunch”

Bar chart showing Knapsack-GRPO at budget 2048 matches GRPO at budget 4096.

Figure 9 Performance vs total budget. Knapsack-GRPO matches GRPO with half the compute.


Conclusion and Future Directions

The Knapsack RL framework delivers a clear message: optimize exploration allocation to match task difficulty, and you unlock both performance and efficiency — without extra compute.

Future opportunities:

  • Richer value functions: Model learning potential more precisely.
  • Multi-dimensional costs: Include text length or conversation depth.
  • Advanced exploration integration: Combine with techniques like tree search (AlphaGo).

By treating computation as a resource to optimize rather than just spend, Knapsack RL offers a sustainable path to training smarter, more capable language models.