Training a large neural network can sometimes feel like alchemy. Beyond designing the architecture itself, you face a tangle of hyperparameters—learning rates, regularization strengths, optimizer momentum, and dozens more. Finding the right balance is often a painful process of trial and error.
But what if we could teach our models to tune these hyperparameters automatically?
That’s the promise of meta-learning, a branch of machine learning focused on “learning to learn.” By treating hyperparameters as learnable “meta-parameters,” we can use gradient-based optimization to discover optimal settings rather than manually searching for them. The catch: most existing gradient-based meta-learning methods are computationally expensive. They rely on second-order derivatives, which demand huge amounts of memory and drastically slow training. This makes them impractical for modern, deep architectures.
A paper presented at NeurIPS 2021, “EvoGrad: Efficient Gradient-Based Meta-Learning and Hyperparameter Optimization”, introduces a strikingly simple idea. The authors—Ondrej Bohdal, Yongxin Yang, and Timothy Hospedales—propose EvoGrad, a meta-learning method that borrows ideas from evolutionary optimization to avoid second-order derivatives entirely. The result? A faster, lighter, and more scalable approach to meta-learning.
In this post, we’ll unpack the core intuition behind EvoGrad, walk through its evolutionary-inspired update, and explore how it performs on a range of real-world machine learning challenges.
The Challenge: Meta-Learning as a Two-Level Optimization Game
Meta-learning problems can be framed as bilevel optimization: two interdependent loops that occur simultaneously.
- Inner Loop — The standard model training. Given a set of hyperparameters \( \boldsymbol{\lambda} \), the model parameters \( \boldsymbol{\theta} \) are optimized to minimize training loss \( \ell_T \).
- Outer Loop — The meta-learning step. We optimize the hyperparameters themselves so that, after training completes, the resulting model achieves minimal validation loss \( \ell_V \).
We’re looking for the hyperparameters \( \boldsymbol{\lambda}^* \) that minimize validation loss after the model has finished learning.

Figure 1: Meta-learning as a bilevel optimization: the outer loop (meta-parameters) oversees the inner training loop (model parameters).
To update these hyperparameters via gradient descent, we need the hypergradient—the gradient of validation loss with respect to \( \boldsymbol{\lambda} \):
[ \frac{\partial \ell_V^(\boldsymbol{\lambda})}{\partial \boldsymbol{\lambda}} = \frac{\partial \ell_V(\boldsymbol{\lambda}, \boldsymbol{\theta}^(\boldsymbol{\lambda}))}{\partial \boldsymbol{\lambda}}
- \frac{\partial \ell_V(\boldsymbol{\lambda}, \boldsymbol{\theta}^(\boldsymbol{\lambda}))}{\partial \boldsymbol{\theta}^(\boldsymbol{\lambda})} \frac{\partial \boldsymbol{\theta}^*(\boldsymbol{\lambda})}{\partial \boldsymbol{\lambda}} ]

Figure 2: The hypergradient expands through the model parameters—they mediate how hyperparameters affect validation loss.
The bottleneck lies in that last term, \( \frac{\partial \boldsymbol{\theta}^*(\boldsymbol{\lambda})}{\partial \boldsymbol{\lambda}} \). Since \( \boldsymbol{\theta}^* \) is the result of gradient-based training, computing this derivative involves differentiating through an optimization step. That requires second-order derivatives such as Hessians, which are computationally costly and demand extended computational graphs.
Traditional methods—like the well-known T1–T2 algorithm—must compute these higher-order gradients at every update. The result is intense memory pressure that prevents scaling meta-learning to large models.
The EvoGrad Solution: An Evolutionary Inner Step
The authors of EvoGrad introduced a surprisingly effective twist: What if we estimate the hypergradient using an evolutionary process instead of differentiating through gradients?
Instead of calculating how the model’s parameters change due to gradient descent, EvoGrad imagines a hypothetical inner loop inspired by evolutionary optimization—no gradient computation required.

Figure 3: A single EvoGrad update. Randomly perturbed model copies compete to form a new, weighted average model used for hypergradient estimation.
Here’s how a single EvoGrad update unfolds:
1. Create a Population
Start from your current model parameters \( \boldsymbol{\theta} \). Generate a small population of \( K \) perturbed models:
\[ \boldsymbol{\theta}_k = \boldsymbol{\theta} + \boldsymbol{\epsilon}_k, \quad \boldsymbol{\epsilon}_k \sim \mathcal{N}(0, \sigma^2 I) \]The authors typically use a population size of K = 2, which keeps computation minimal.
2. Evaluate Fitness
Each candidate model \( \boldsymbol{\theta}_k \) is evaluated on training data to compute a training loss \( \ell_k \). Lower loss means higher fitness.
3. Assign Weights
Convert these losses into normalized weights using a softmax over the negative losses:
\[ w_1, w_2, \dots, w_K = \text{softmax}([-\ell_1, -\ell_2, \dots, -\ell_K]/\tau) \]where \( \tau \) is a temperature parameter controlling how sharply the weights differ.

Figure 4: Lower losses yield higher weights—the temperature parameter τ adjusts sensitivity.
4. Form the Hypothetical Update
Combine the candidate models into a single “offspring” by taking the weighted average:
\[ \boldsymbol{\theta}^* = \sum_{k=1}^{K} w_k \boldsymbol{\theta}_k \]
Figure 5: The new hypothetical parameter set \( \boldsymbol{\theta}^* \) combines perturbed models by their relative fitness.
5. Compute the Hypergradient
Finally, evaluate the validation loss \( \ell_V = f(\mathcal{D}_V | \boldsymbol{\theta}^*) \) and compute its gradient with respect to the hyperparameters:
\[ \frac{\partial \ell_V}{\partial \boldsymbol{\lambda}} \]
Figure 6: The EvoGrad hypergradient—computed on the validation loss evaluated at the evolutionary “offspring” \( \boldsymbol{\theta}^* \).
Here’s the key insight: Since the inner step used random perturbations—not gradients—computing \( \frac{\partial \ell_V}{\partial \boldsymbol{\lambda}} \) involves only first-order derivatives. We no longer need second-order terms or extended graphs. The real model training still uses standard gradient descent, but the hypergradient estimation is radically simplified.
Why It Matters: Efficiency Unlocked
Comparing EvoGrad to the classic T1–T2 algorithm highlights its efficiency advantages.

Figure 7: EvoGrad’s hypergradient approximation sidesteps costly second-order computations required by T1–T2.

Figure 8: EvoGrad achieves comparable results at lower asymptotic complexity—both time and memory.
By eliminating second-order gradient computation:
- Memory usage drops since shorter computation graphs are used.
- Runtime improves because we skip expensive Hessian-like operations.
- Scalability increases, enabling meta-learning on larger models with standard GPUs.
Experimental Results: From Toy Problems to Real Applications
To verify the method, the authors tested EvoGrad on a diverse suite of tasks—from toy problems to computer vision and natural language processing.
1. Proof of Concept: 1D Hyperparameter Optimization
EvoGrad was first evaluated on a simple 1D function where the true hypergradient can be computed analytically.

Figure 9: EvoGrad tracks the true hypergradient trend even with noisy estimates; larger populations reduce variance.
Even with a small population of two candidate models, EvoGrad reproduces the correct trend of the true gradient. Increasing the population size smoother the estimate further.
They also examined optimization trajectories:

Figure 10: Trajectories guided by EvoGrad’s estimated hypergradient converge similarly to those led by the true hypergradient.
Parameters trained using EvoGrad converge to the validation optimum, validating that its approximate hypergradient works effectively in practice.
2. Learning Rotations in MNIST
Next, EvoGrad was tested in a small vision task—learning to classify rotated MNIST digits. When the validation set is rotated by 30°, a regular CNN trained on upright images performs poorly. EvoGrad meta-learns the rotation angle so that the model generalizes correctly.

Figure 11: EvoGrad accurately learns the 30° rotation, matching baseline performance on unrotated data.
EvoGrad successfully discovers the correct rotation, raising test accuracy from 82% to nearly 98%—proving its capability to learn meaningful hyperparameters.
Real-World Impact: Large-Scale Meta-Learning
With its effectiveness confirmed on simple cases, EvoGrad was deployed in three major applications known for computational intensity.
1. Cross-Domain Few-Shot Learning (LFT)
The Learned Feature-Wise Transformation (LFT) method regularizes few-shot learners to improve generalization under domain shift. The authors replaced LFT’s T1–T2 optimizer with EvoGrad.

Figure 12: EvoGrad matches the accuracy of second-order LFT across four unseen datasets.
EvoGrad not only matches accuracy but drastically reduces time and memory consumption.

Figure 13: EvoGrad’s memory and runtime improvements enable scaling from ResNet10 to ResNet34 on standard GPUs.
With EvoGrad’s efficiency, the team could scale from ResNet-10 to ResNet-34—an impossible leap using standard meta-learning methods limited by GPU memory. This improvement led to new state-of-the-art results on multiple few-shot benchmarks.
2. Learning from Noisy Labels (Meta-Weight-Net)
Training data is often imperfect; mislabeled examples can degrade model performance. Meta-Weight-Net (MWN) combats this by learning a meta-network that adjusts per-sample weights. Replacing its second-order optimizer with EvoGrad yields comparable or better results.

Figure 14: EvoGrad matches or surpasses performance under noisy-label scenarios.
EvoGrad halves memory use and cuts runtime by about a third:

Figure 15: Efficiency gains of EvoGrad over T1–T2 in label-noise experiments.
3. Low-Resource Cross-Lingual Learning (MetaXL)
Finally, EvoGrad was tested in Natural Language Processing through MetaXL, which enhances cross-lingual representation transfer between resource-rich and low-resource languages.

Figure 16: EvoGrad achieves higher average F1 scores on low-resource language NER tasks.

Figure 17: EvoGrad accelerates training and reduces GPU memory usage significantly in MetaXL experiments.
Across all eight target languages, EvoGrad matches or surpasses accuracy while being faster and lighter—demonstrating its versatility beyond vision tasks.
Scaling Analysis: Efficiency Grows with Model Size
The researchers further investigated scalability using Meta-Weight-Net. As base models grew from 0.5M to 11M parameters, EvoGrad’s advantage widened.

Figure 18: EvoGrad scales gracefully—the efficiency gap increases with model size.
For large models, EvoGrad ran nearly three times faster and consumed less than half the memory. The bigger the model, the more pronounced the savings—a critical advantage for modern deep learning workloads.
Key Takeaways
EvoGrad introduces a refreshing approach to meta-learning:
- Efficiency: By replacing gradient-based inner-loop differentiation with simple evolutionary perturbations, EvoGrad eliminates second-order gradient computations, cutting memory and time costs dramatically.
- Performance: Despite using approximate hypergradients, EvoGrad consistently matches or exceeds accuracy achieved by traditional second-order methods.
- Scalability: Its lightweight nature enables meta-learning on much larger architectures—previously infeasible on standard GPUs.
The Bottom Line
EvoGrad bridges theory and practice in meta-learning. It shows that clever approximations inspired by evolutionary principles can retain accuracy while making large-scale meta-learning feasible.
With EvoGrad, tuning thousands of hyperparameters online no longer feels out of reach—even on deep architectures like ResNet-34 or transformer-based NLP models.
For the community, this method is more than an efficiency improvement—it’s a step toward democratizing meta-learning, bringing “learning to learn” approaches into everyday deep learning pipelines.
As the authors note, code is available on GitHub. EvoGrad opens the door to smarter, scalable hyperparameter optimization—one evolutionary step at a time.
](https://deep-paper.org/en/paper/2106.10575/images/cover.png)