Beyond Adam: Can We Learn a Better Optimizer for Neural Networks?

In machine learning, optimization is everywhere. Algorithms like Stochastic Gradient Descent (SGD), Adam, and RMSprop are the engines that power model training — from simple regressions to massive deep networks. We spend endless hours adjusting their hyperparameters — learning rates, momentum, decay factors — treating them as sophisticated but ultimately fixed tools.

But what if we applied machine learning’s own philosophy to these tools? What if, instead of hand-crafting optimizers, we could learn them?

That’s the question behind a fascinating research direction known as meta-learning, or “learning to learn.” The goal is to design algorithms that can improve their own learning process. In a notable paper, “Learning to Optimize for High-Dimensional Stochastic Problems” (Li & Malik, 2016), researchers Ke Li and Jitendra Malik took this idea to the next level. Building on their earlier “Learning to Optimize” framework, they tackled one of the most challenging problems in machine learning — training neural networks.

Their approach led to a learned optimizer that, after being trained on a single simple task (a small MNIST network), generalized to outperform well-known, hand-designed optimizers on completely different datasets, architectures, and noise levels. This is a powerful demonstration: the process of optimization itself can be learned.

Optimization as a Reinforcement Learning Problem

Li and Malik reframed optimization as a Reinforcement Learning (RL) task — a clever and intuitive insight.

Think of an optimizer as an intelligent agent navigating a landscape:

Agent: The optimization algorithm we seek to learn.
Environment: The loss landscape of the model being trained.
State (\(s_t\)): Information available at each step — current parameters \(x^{(t)}\), recent gradients \(\nabla \hat{f}(x^{(t)})\), and past updates.
Action (\(a_t\)): The step \(\Delta x\) taken to update parameters, producing \(x^{(t+1)} = x^{(t)} + \Delta x\).
Policy (\(\pi\)): A function mapping states to actions — the optimizer itself (\(\pi(a_t | s_t)\)).
Cost (\(c(s_t)\)): The objective value \(f(x^{(t)})\); lower values are better.

The RL goal is to find the optimal policy \(\pi^*\) that minimizes the total expected cost across the optimization trajectory — that is, learning how to take efficient steps toward lower loss values.

The standard objective in reinforcement learning is to find a policy that minimizes the cumulative cost.

The RL framework provides a natural way to describe optimization as sequential decision-making over iterations.

The overall trajectory is determined by the agent’s policy and how the environment (the loss surface) responds to each action.

The probability of a given trajectory depends on the initial state, the policy, and the environment’s transition dynamics.

Each optimization run can be seen as a trajectory — a sequence of states and actions — governed by both the learned policy and the underlying geometry of the loss surface.

By formulating optimization as policy search, learning an optimizer becomes an RL problem. The policy — the optimizer — is modeled as a recurrent neural network (RNN), specifically an LSTM, since optimizers need memory of past gradients and updates (similar to momentum or Adam).

The Challenge: High-Dimensional Optimization

While the RL formulation is elegant, its application to neural networks introduces a massive obstacle — high dimensionality. Even small neural networks have thousands or millions of parameters. The state space (parameter configurations) and action space (the parameter update vector) are huge.

Standard RL algorithms can’t handle this scale efficiently. For example, Guided Policy Search (GPS) — the method Li and Malik use — has computational cost that grows cubically with the size of the state space. For neural networks, this would be prohibitively expensive.

To overcome this, the authors introduced their main technical innovation: Convolutional Guided Policy Search (Convolutional GPS).

A Primer on Guided Policy Search (GPS)

Guided Policy Search is an RL technique designed for continuous, high-dimensional problems with complex non-linear policies, like our RNN-based optimizer. It cleverly combines two types of policies:

Complex policy (\(\pi\)) – The expressive, non-linear policy we ultimately want (the learned optimizer).
Guiding policy (\(\psi\)) – A simpler, time-varying linear-Gaussian policy that is easier to solve analytically.

In GPS, both policies interact in a loop:

The guiding policy generates trajectories given a locally linear model of dynamics.
The complex policy is trained to mimic the guiding policy using supervised learning.

The optimization problem looks like this:

The constrained optimization problem solved by Guided Policy Search.

GPS frames learning as minimizing expected cost under the guiding policy, constrained to match the complex policy.

Because enforcing complete equality between \(\pi\) and \(\psi\) is intractable, the problem is relaxed to matching expected actions at each time step. This yields a tractable algorithm using Bregman ADMM, which alternates updates for:

\(\eta\): parameters of the guiding policy (\(\psi\))
\(\theta\): parameters of the complex policy (\(\pi\))
\(\lambda_t\): dual variables enforcing agreement between the policies

The update steps for the Bregman ADMM algorithm used in GPS.

Each GPS iteration alternates between optimizing the guiding policy analytically and training the complex policy to imitate it.

For the complex (non-linear) policy, the subproblem reduces to a supervised learning task — matching the mean actions (step directions) from the guiding policy.

The optimization subproblem for updating the parameters of the complex policy. This is essentially a supervised regression task.

This regression-based update lets GPS leverage standard deep learning tools inside an RL framework.

This combination transforms a difficult RL problem into a sequence of tractable optimization and supervised learning steps. However, it still struggles when dimensionality is massive — which is where Convolutional GPS comes in.

Convolutional GPS: Exploiting Structure in Neural Networks

The key insight: neural network parameters are structured. They aren’t just a long vector of unrelated values.

For instance, all weights within a single layer can be permuted without changing the network’s output. Therefore, an optimizer should behave invariantly across these parameters. That is, it should treat all weights in the same layer similarly.

Li and Malik imposed this idea formally through coordinate groups:

Each group corresponds to a set of parameters with equivalent roles (e.g., weights or biases of a layer).
The optimizer learns a shared update rule for all parameters within a group.

Convolutional GPS enforces this structure throughout:

The local dynamic models used by GPS share parameters across coordinates in a group.
The guiding policy (\(\psi\)) has identical settings for all parameters in a group.
The learned RNN policy (\(\pi\)) outputs identical actions for coordinates within the same group.

This structured sharing drastically reduces the dimensionality of the problem. Instead of learning millions of independent update rules, the optimizer learns only a few — one per coordinate group — making high-dimensional training tractable.

What the Optimizer Sees: Designing Informative Features

A smart optimizer needs rich information to make good decisions. Li and Malik carefully engineered features for both the training state and the observations that feed the learned policy.

Rather than rely solely on the raw current gradient, they computed summary statistics over the most recent few iterations, helping to smooth out the noise inherent in stochastic training.

The summary statistics used to create robust features for the optimizer, averaging over the last 3 steps.

Averaging gradients and objective values over several steps provides stability against mini-batch noise.

From these summaries, they constructed compound features that capture the dynamics of training:

Relative change in the objective function.
Gradients normalized by previous gradient magnitude (for scale invariance).
Ratios of past step magnitudes (to capture momentum-like behaviour).

Examples of the features provided to the learned optimizer. These features capture historical trends in the loss, gradients, and parameter updates.

The feature design gives the optimizer a time-aware representation of progress, enabling adaptive and context-sensitive updates.

Together, these engineered signals give the optimizer a multi-step view of the evolving loss landscape — key for learning robust update strategies.

Experiments: Testing the Learned Optimizer

The team trained their optimizer — dubbed Predicted Step Descent — on a single meta-training task: a small two-layer neural network (48 input units, 48 hidden units, 10 output units) on a reduced MNIST dataset. Meta-training used a horizon of 400 iterations and mini-batches of 64.

Then came the real test: seeing if this optimizer generalizes to completely new tasks.

They compared Predicted Step Descent against seven standard optimizers (SGD, Momentum, Conjugate Gradient, L-BFGS, AdaGrad, Adam, RMSprop) and one learned optimizer from Andrychowicz et al. (2016), referred to as L2LBGDBGD.

1. Generalizing to New Datasets

First, they evaluated on new datasets — the Toronto Faces Dataset (TFD), CIFAR-10, and CIFAR-100 — using the same network architecture as training.

Figure 1: Performance on new datasets with the original network architecture. Predicted Step Descent, trained only on MNIST, consistently converges fastest on TFD (a), CIFAR-10 (b), and CIFAR-100 (c).

Predicted Step Descent shows fast, stable convergence across datasets, unlike hand-designed optimizers whose ranking varies from task to task.

The learned optimizer, though trained on MNIST alone, achieved the best performance across all datasets — a remarkable degree of generalization.

2. Scaling to Larger Architectures

Next, they increased the network size eightfold (100 input units, 200 hidden units). Despite never being trained on such large networks, Predicted Step Descent outperformed all baselines.

Figure 2: Performance on a larger network architecture (8x more parameters). Predicted Step Descent again shows the best performance across all three datasets, demonstrating it can generalize to different model sizes.

Even with drastically more parameters, Predicted Step Descent maintains its edge, showcasing architectural generalization.

Although some oscillation occurred early on, the learned optimizer quickly adapted — an indication of dynamic correction behavior.

3. Robustness to Gradient Noise

To probe robustness, the researchers reduced the mini-batch size from 64 to 10, introducing significant noise into gradient estimates.

Figure 3: Performance on the original architecture with a smaller mini-batch size (more noise). The learned optimizer handles the increased stochasticity well, outperforming baselines that struggle or diverge.

Despite the noisier gradients, Predicted Step Descent still converges efficiently.

Figure 4: Performance on the larger architecture with a smaller mini-batch size. Even in this very challenging setting, Predicted Step Descent achieves a better final objective value than all other methods.

The learned optimizer handles extreme stochasticity far better than traditional methods, which often diverge under these conditions.

Predicted Step Descent remained stable and effective across noise levels that derailed other algorithms, showing strong resilience.

4. Behavior Beyond Its Training Horizon

An intriguing experiment doubled the training horizon from 400 to 800 iterations. Would the optimizer “fall apart” after its trained range?

Figure 5: Performance when run for 800 iterations, double the training horizon. The learned optimizer continues to make steady progress, unlike an algorithm that has simply memorized a 400-step trajectory.

Predicted Step Descent continues improvement beyond its training horizon — it didn’t memorize fixed behaviors, but learned underlying principles of optimization.

The optimizer maintained sensible progression, suggesting it learned generalizable strategies rather than short-term sequences.

Conclusion: Toward Learned Optimization

Li and Malik’s work stands as a milestone in the quest for self-improving learning systems. They showed that:

Optimization can be reframed as a reinforcement learning task.
The curse of dimensionality can be tackled by leveraging structure in parameter spaces through Convolutional GPS.
A learned optimizer trained on a single task can generalize across new datasets, architectures, and noise settings.

Predicted Step Descent’s success points toward a future where optimizers are not painstakingly handcrafted but instead learned directly from data — potentially surpassing our best existing methods.

While this study focused on shallow networks, its implications are far-reaching. It invites us to imagine a machine learning ecosystem where even the core tools — optimizers, architectures, hyperparameter schedules — evolve through learning, not engineering.

Can we learn better ways to learn? Li and Malik’s answer is an emphatic yes — and it may just transform how we build the next generation of intelligent systems.

Optimization as a Reinforcement Learning Problem#

The Challenge: High-Dimensional Optimization#

A Primer on Guided Policy Search (GPS)#

Convolutional GPS: Exploiting Structure in Neural Networks#

What the Optimizer Sees: Designing Informative Features#

Experiments: Testing the Learned Optimizer#

1. Generalizing to New Datasets#

2. Scaling to Larger Architectures#

3. Robustness to Gradient Noise#

4. Behavior Beyond Its Training Horizon#

Conclusion: Toward Learned Optimization#