Reptile: A Simple Trick for Powerful Meta-Learning

Humans are remarkably fast learners. We can often grasp a new concept from just a handful of examples—show a child a picture of a zebra, and they can spot a zebra again even in a completely different scene. In contrast, deep learning models, despite their superhuman performance on many benchmarks, are notoriously data-hungry. They typically require thousands or even millions of examples to reach similar levels of accuracy. This gap reveals a fundamental challenge in artificial intelligence: how can we build models that learn quickly and efficiently from limited data?

This is where meta-learning, or “learning to learn,” comes in. Rather than training a single model on one dataset for one specific task, meta-learning trains a model on a distribution of tasks. The goal isn’t to master any one of them individually, but to learn how to learn them efficiently—developing internal representations that can be adapted rapidly when faced with a new, unseen task.

One of the most influential meta-learning algorithms is Model-Agnostic Meta-Learning (MAML). MAML seeks a parameter initialization that makes fine-tuning on new tasks fast and effective. However, the full MAML algorithm requires computing second-order derivatives—an expensive and technically complex process.

In their paper “On First-Order Meta-Learning Algorithms” (Nichol, Achiam, and Schulman, OpenAI, 2018), the authors explore a simpler and more efficient alternative: first-order methods that rely only on standard gradients. They analyze a streamlined variant called First-Order MAML (FOMAML) and introduce a new algorithm named Reptile. This blog post takes a deep dive into their work—showing how these first-order methods achieve remarkable results, and what they reveal about the nature of learning itself.

The Quest for a Good Starting Point: MAML

Before we jump into Reptile, let’s revisit the idea behind MAML—the foundation of several meta-learning methods.

The central objective of MAML is to find an initial set of parameters, $ \phi $, that enables rapid learning on any task sampled from a given distribution. Imagine a collection of tasks, such as image classification problems across different sets of classes. MAML optimizes for an initialization that minimizes the expected loss after a few gradient updates on each sampled task.

The MAML objective is to find parameters φ that minimize the expected loss after k updates on a new task τ.

The meta-learning objective aims to minimize the expected loss after $ k $ updates on tasks sampled from a distribution.

The MAML process uses a clever setup for generalization. For each sampled task, the algorithm draws a small training set (A) and a test set (B). The network performs a few inner-loop gradient steps on A, then computes a meta-loss on B. In effect, every task contains a mini cross-validation experiment.

The MAML objective modified to optimize for generalization, using a training set A for the inner update and a test set B for the outer loss computation.

MAML’s design encourages strong generalization by splitting task data into train (A) and test (B) subsets.

To optimize this objective, MAML differentiates through the entire inner training process. This involves computing second-order derivatives—specifically, the Jacobian and Hessian of the update operator.

$The full MAML gradient calculation, which involves the Jacobian of the update operator, \$ U' \$, leading to second-order derivatives.$

Full MAML relies on expensive higher-order derivatives, making it powerful but computationally heavy.

First-Order Simplification: FOMAML

The FOMAML approximation simplifies this by ignoring second derivatives altogether—treating the Jacobian $ U'_{\tau,A}(\phi) $ as an identity matrix. In practice, this translates into four simple steps:

Start with the initialization $ \phi $.
Apply $ k $ gradient steps on the training data A to get $ \tilde{\phi} = U_{\tau,A}(\phi) $.
Compute the gradient of the test loss on B using $ \tilde{\phi} $.
Use that gradient to update the initialization $ \phi $.

Despite being an approximation, FOMAML was shown to perform nearly as well as full MAML on benchmarks such as Mini-ImageNet. This surprising result inspired further exploration into how far first-order methods can go.

Enter Reptile: Meta-Learning at Its Simplest

The star of the paper, Reptile, takes simplicity even further. It is a purely first-order meta-learning algorithm that looks deceptively like standard multi-task training—but contains a subtle trick that allows it to behave as a true meta-learner.

Here’s the entire Reptile algorithm:

The pseudocode for the serial version of the Reptile algorithm.

Reptile performs repeated inner optimization on sampled tasks and nudges the initialization toward the task-optimized weights.

Let’s unpack the steps:

Initialize the model parameters $ \phi $.
Repeat for each meta-iteration:

Sample a task $ \tau $.
Train on that task for $ k $ gradient steps using SGD or Adam, starting from $ \phi $. Denote the resulting weights as $ \tilde{\phi} $.
Update $ \phi \leftarrow \phi + \epsilon(\tilde{\phi} - \phi) $, where $ \epsilon $ is the meta step-size.

That’s all. Reptile doesn’t require train/test splits within each task—it directly moves $ \phi $ toward the parameters obtained by partially training on sampled tasks.

The method can also be parallelized by sampling multiple tasks per iteration and averaging their updates:

The batched update rule for the Reptile algorithm.

Batched Reptile updates the shared initialization by averaging task-specific weight changes.

At first glance, Reptile resembles joint training, where a single model learns across all tasks simultaneously. Indeed, with only one inner update ($ k=1 $), Reptile is exactly equivalent to joint training.

When k=1, the Reptile gradient is equivalent to the gradient of the expected loss over tasks, which is joint training.

When only one inner step is used, Reptile performs standard joint optimization across tasks.

However, when $ k > 1 $, things change dramatically. Multiple inner gradient steps capture the curvature of each task’s loss surface, introducing implicit higher-order information without explicitly computing it. This shadow of second-order behavior is what transforms Reptile into a true meta-learning algorithm.

A Visual Example: Few-Shot Sine Wave Regression

To build intuition, consider a simple case: regression on 1D sine waves. Each task corresponds to learning a sine wave with a random amplitude and phase. The model sees only 10 sample points and must reconstruct the whole curve.

This setup is perfect for meta-learning because the average of all possible sine waves—with random phases—is flat, $ f(x)=0 $. Traditional joint training would simply learn that trivial zero output.

Meta-learning, however, finds an initialization that starts near zero but encodes internal features enabling fast adaptation to any sine wave after just a few gradient steps.

Figure 1 shows that before training, the model outputs zero. After MAML or Reptile training, the model can accurately fit a new sine wave from just 10 sample points after 32 gradient steps.

After Reptile or MAML training, a network can reconstruct a sine wave accurately after only a few updates—demonstrating rapid adaptation.

Before meta-training, a random initialization fails even after 32 gradient steps. But a Reptile- or MAML-trained initialization allows near-perfect fits after the same number of updates. The network has learned to learn sine waves efficiently.

Why Does Reptile Work? Two Complementary Views

1. The Hidden Second-Order Term

The first explanation comes from a Taylor series analysis. Even though Reptile uses only first-order gradients, its updates implicitly contain a second-order effect—similar to MAML’s explicit higher-order term.

The expected gradient updates across tasks can be decomposed into two terms:

The definition of the AvgGrad term.

AvgGrad: the gradient of the expected loss over tasks (joint training component).

The second term is:

The definition and derivation of the AvgGradInner term.

AvgGradInner: the meta-learning component that maximizes agreement between gradients of different batches within the same task.

The AvgGradInner term encourages within-task generalization. If gradients from different mini-batches of the same task point in similar directions, learning from one batch improves performance on others. Reptile, MAML, and FOMAML all include this term in their effective update dynamics.

The expected gradients for MAML, FOMAML, and Reptile, showing they are all a combination of AvgGrad and AvgGradInner terms.

All three first-order algorithms optimize a combination of joint-training and within-task generalization objectives.

While MAML emphasizes the second-order meta-learning term most strongly, Reptile contains the same effect implicitly—explaining its surprisingly strong performance.

2. Finding a Point Near All Solutions

The second explanation is geometric. Each task $ \tau $ has a solution manifold $ \mathcal{W}_{\tau} $ —the set of all optimal parameter configurations for that task. Meta-learning can be viewed as trying to find an initialization $ \phi $ that is close, on average, to all manifolds.

Figure 2 illustrates finding an initialization φ that is close to the optimal solution manifolds for different tasks, W₁ and W₂.

Meta-learning seeks an initialization close to the solution manifolds of all tasks.

Formally, the goal is to minimize the expected squared distance to these manifolds:

The objective function for minimizing the expected squared distance to the solution manifolds.

Reptile indirectly minimizes average squared distance between initialization and each task’s solution manifold.

A gradient step on this objective moves $ \phi $ slightly toward its projection onto the manifold $ P_{\mathcal{W}_{\tau}}(\phi) $:

A stochastic gradient descent step on the distance minimization objective.

Reptile approximates this geometric update by replacing the true projection with partial optimization on each sampled task.

Since $ P_{\mathcal{W}_{\tau}}(\phi) $ can’t be computed exactly, Reptile approximates it by performing several SGD steps on the task to obtain $ \tilde{\phi} $, and then nudging $ \phi $ toward $ \tilde{\phi} $. Thus, Reptile performs stochastic gradient descent on an implicit distance-to-solution objective—a geometric interpretation of “learning to learn.”

Putting It to the Test: Experiments and Results

The authors evaluated Reptile, MAML, and FOMAML on two standard benchmarks for few-shot classification: Omniglot and Mini-ImageNet. Both are widely used to measure how well algorithms learn under limited data.

Few-Shot Classification Performance

All three algorithms perform strongly. Reptile matches or surpasses the others depending on the dataset.

Table 1 shows the few-shot classification results on the Mini-ImageNet dataset. Reptile performs slightly better than MAML and FOMAML.

On Mini-ImageNet, Reptile slightly outperforms both MAML and FOMAML.

Table 2 shows the few-shot classification results on the Omniglot dataset. MAML performs best, but Reptile is still competitive.

On Omniglot, Reptile is slightly behind MAML but remains highly competitive.

All methods see performance boosts in the transductive setting (where all test samples for a task are processed simultaneously), indicating that shared batch normalization statistics help across the board.

Ablation Studies: What Drives Meta-Learning?

The paper also explores what happens when we vary the inner-loop gradient combinations. Each inner loop computes gradients $ g_1, g_2, g_3, g_4 $ from different mini-batches. Using their sum resembles Reptile; using only the last one resembles FOMAML.

Figure 3 shows learning curves for different inner-loop gradient combinations on Omniglot.

Combining more inner-loop gradients improves few-shot learning performance.

The results show clear trends:

Using only $ g_1 $ (joint training) performs poorly.
Using $ g_2 $ (FOMAML) performs well.
Summing or averaging multiple gradients (Reptile-style) performs best as the number of inner steps increases.

Another study examines the overlap between mini-batches in the inner loop. If the meta-gradient in FOMAML is computed from a batch that overlaps with earlier batches—called shared-tail FOMAML—performance degrades. Reptile and separate-tail FOMAML (which use disjoint data sets) remain stable.

Figure 4b shows how performance changes with inner batch size. Shared-tail FOMAML degrades as batch size increases, while Reptile and separate-tail FOMAML remain robust.

Overlapping inner-loop batches harm FOMAML but not Reptile, highlighting Reptile’s robustness to task sampling strategy.

Conclusion: The Subtle Power of Simplicity

On First-Order Meta-Learning Algorithms makes a compelling case for simple gradient-based meta-learning. By introducing Reptile, the authors show that an algorithm requiring only standard first-order updates can achieve state-of-the-art few-shot learning performance.

Two core insights explain its success:

Taylor-Series View: Multiple gradient steps on each task implicitly introduce a MAML-like second-order effect that promotes within-task generalization.
Geometric View: Reptile performs stochastic gradient descent on an objective that minimizes distance to each task’s solution manifold.

Perhaps the most profound implication lies beyond meta-learning: the analysis suggests that even ordinary stochastic gradient descent naturally encourages generalization between mini-batches. This explains why fine-tuning highly pre-trained networks—like those trained on ImageNet—works so well. By repeatedly adapting to diverse mini-batches, SGD effectively masters the art of fast adaptation itself.

In the end, Reptile is far more than a clever algorithm. It’s a lens revealing the hidden meta-learning behavior inside the optimization methods we use every day—a reminder that sometimes, the simplest tricks in learning are the most powerful.

The Quest for a Good Starting Point: MAML#

First-Order Simplification: FOMAML#

Enter Reptile: Meta-Learning at Its Simplest#

A Visual Example: Few-Shot Sine Wave Regression#

Why Does Reptile Work? Two Complementary Views#

1. The Hidden Second-Order Term#

2. Finding a Point Near All Solutions#

Putting It to the Test: Experiments and Results#

Few-Shot Classification Performance#

Ablation Studies: What Drives Meta-Learning?#

Conclusion: The Subtle Power of Simplicity#