Fine-tuning large language models (LLMs) is a critical step in making them useful for specific, real-world tasks. After a model is pre-trained on a vast corpus of text, fine-tuning adapts it to follow instructions, align with human preferences, or master specialized domains like coding, medicine, or scientific reasoning.

For years, the undisputed champion of this process has been Reinforcement Learning (RL), particularly Reinforcement Learning from Human Feedback (RLHF), which powered landmark systems like ChatGPT.

But RL isn’t a perfect solution. It often struggles with:

  • Low sample efficiency, meaning it needs huge amounts of training data to improve.
  • Instability across runs, with inconsistent performance even under identical setups.
  • A tendency toward reward hacking—gaming its reward system to score highly without truly solving the problem.

These challenges make fine-tuning costly, brittle, and sometimes frustrating.

What if there was another way?
A new paper—“Evolution Strategies at Scale: LLM Fine-tuning Beyond Reinforcement Learning”—revives an old idea from optimization and proves it can work wonders for today’s LLMs. The authors show that Evolution Strategies (ES), once dismissed as too simple and inefficient for billion-parameter models, can now match or surpass RL in accuracy, stability, and efficiency. This discovery challenges long-standing assumptions and opens a bold new path for LLM optimization.


RL vs. ES: The Fundamental Difference

Before diving into the method, let’s clarify how RL and ES differ:

Reinforcement Learning (RL):
RL treats the LLM as an agent exploring an action space—in language, these “actions” are token choices. The model generates tokens one-by-one, gets rewarded at the end for the whole output, and must figure out which token sequences led to the success. This credit assignment problem becomes harder when rewards only arrive at the end (“long-horizon rewards”).

Evolution Strategies (ES):
ES skips action tweaking and goes straight to the parameter space—the billions of weights inside the model. In each iteration:

  1. Start with a base model (“parent”).
  2. Perturb: Make a “population” of slightly different models by adding small Gaussian noise to the parent’s parameters.
  3. Evaluate each perturbed model on the task and assign a “fitness” reward.
  4. Update the parent by averaging the noise vectors, weighted by reward—nudging it toward good solutions.
  5. Repeat until convergence.

For decades, many believed ES couldn’t efficiently explore the astronomical dimensions of LLM parameter space—it seemed like “searching for a needle in a cosmic haystack.” This paper proves otherwise.


Scaling ES for Billion-Parameter LLMs

The authors’ main contribution is a memory-efficient, parallelizable ES tailored for fine-tuning huge models.

The Basic ES Loop

Given parameters \(\theta_{t-1}\), we:

  • Sample \(N\) noise vectors \(\varepsilon_n\)
  • Compute rewards \(R_n\) for each perturbed model
  • Update: \[ \theta_t \leftarrow \theta_{t-1} + \alpha \cdot \frac{1}{N} \sum_{n=1}^{N} R_n\, \varepsilon_n \] Here, \(\alpha\) is the learning rate.

Algorithm 1 shows the basic structure of an Evolution Strategies loop.
Figure: Algorithm 1: A high-level view of the ES loop—perturb, evaluate, and update.

Making It Feasible at Scale

Storing and running \(N\) full 7B-parameter copies at once is impossible on typical GPUs. The authors solved this with several innovations:

Algorithm 2 details the memory-efficient and parallelized implementation of ES for LLM fine-tuning.
Figure: Algorithm 2: ES fine-tuning optimized for memory and parallel compute.

  1. Noise Seeds:
    Instead of storing massive \(\varepsilon_n\) vectors, store only their random seeds. This allows on-demand, exact reproduction of the noise later.

  2. Parallelized Evaluation:
    Perturbed models are independent—perfect for distributing across GPUs or cluster nodes.

  3. Layer-by-Layer, In-Place Perturbation:
    Add noise to one layer, run the forward pass, record reward, then subtract the noise to restore—all in-place. Dramatically reduces memory overhead.

  4. Deterministic Decoding:
    Use greedy decoding during evaluation to ensure differences come solely from parameter changes, not token sampling randomness.

  5. Reward Normalization:
    Convert rewards to z-scores within each generation to maintain consistent scaling.

These engineering tricks make billion-parameter ES fine-tuning not only possible but efficient.


Experiment 1: Countdown Reasoning Task

The Countdown task challenges a model to form an arithmetic expression from given numbers to exactly reach a target.
Example: With \(\{100, 50, 6, 3\}\), produce 950. One solution:

\[ 100 \times (6+3) + 50 = 950 \]

This is a long-horizon task: a reward only if the whole answer is correct. Perfect for testing where RL struggles.

Table 1 shows accuracy of models on the Countdown task after fine-tuning with PPO, GRPO, and ES. ES consistently achieves highest accuracy across all sizes.
Figure: Table 1: Accuracy (%) on Countdown across Qwen & LLaMA models—ES wins every time.

Key Findings

  • ES Outperforms RL Everywhere:
    From smallest (Qwen-0.5B) to largest (LLaMA-8B), ES produced higher accuracy.

  • Small Models Benefit Too:
    PPO/GRPO barely improved Qwen-0.5B (0.3% accuracy). ES boosted it to 14.4%, unlocking reasoning even in weak bases.

  • Sample Efficiency:
    Despite exploring billions of parameters, ES hit higher accuracy with fewer total samples.

  • Small Population Suffices:
    Past ES needed 10,000+ candidates for million-parameter models. Here, just N=30 worked for billions.

Figure 6 displays the training curves for ES and RL methods. ES rises faster, reaching higher accuracy in all cases.
Figure: Training curves—ES climbs faster and higher.
Figure 5 shows percentage improvement over base models; ES bars are tallest in every case.
Figure: Relative improvements—ES leads all models consistently.


Experiment 2: Fine-Tuning for Conciseness

The second test targeted behavioral tuning: making answers shorter. Reward was solely based on length—no correctness check. This setup is ripe for reward hacking.

Performance metrics:

  • Conciseness reward
  • KL divergence from base model (low KL means base capabilities are preserved)

Results are plotted on a Pareto front (reward vs. KL trade-offs):

Figure 1 shows Pareto fronts for ES (blue) vs GRPO (black). ES achieves higher reward at lower KL.
Figure: ES Pareto front dominates—better reward/KL trade-offs.

Observations:

  • Better trade-offs: ES front beats GRPO’s across the board.
  • No reward hacking: GRPO often output nonsense to game the reward without a KL penalty. ES never did, even without explicit constraints.
  • Higher reliability: Across runs, ES had far lower variance in both reward and KL.

Table 2 shows mean & std deviation for conciseness reward and KL across runs. ES is far more stable; GRPO shows reward hacking (*).
Figure: Run-to-run stability—ES is consistently better.


Why Does ES Shine? The Jagged Landscape Hypothesis

The authors attribute ES’s success to how it explores:

  • RL:
    Noise at every token → high variance. Credit assignment is messy. Gradients have poor signal-to-noise in long responses. Susceptible to reward hacks.

  • ES:
    One noise injection per model, deterministic rollouts → low variance estimates. Optimizes a solution distribution—harder to exploit with hacks.

Jagged Reward Landscapes:
Parameter → reward mappings are often wildly irregular for big LLMs. Gradients can get trapped in tiny local spikes. ES’s Gaussian parameter noise smooths the landscape, helping find truly high peaks.


Implications: Rethinking LLM Fine-Tuning

This work shows ES isn’t a relic—it’s a potent alternative to RL, with tangible advantages:

  • Accuracy & Efficiency:
    ES beats RL on challenging, sparse-reward reasoning tasks.

  • Safety & Stability:
    Less hacking, more consistency—critical for production fine-tuning.

  • Parameter-Space Exploration is Back:
    This revives a once-abandoned avenue, potentially transforming large-scale post-training.

By leveraging evolution’s simple yet powerful principles, the authors have demonstrated that even billion-parameter, cutting-edge LLMs can be optimized through ES—efficiently, robustly, and safely.

This paper doesn’t just introduce a method—it invites the field to reconsider how we train AI at scale. Evolution, it turns out, still has a few tricks up its sleeve.