If you’ve ever trained a deep learning model, you’ve almost certainly encountered the Adam optimizer. Since its introduction in 2014, it has become one of the most popular—and often the default—optimization algorithms for training neural networks. But what exactly is Adam? How does it work, and why is it so effective?

In this article, we’ll unpack the original paper that introduced Adam: Adam: A Method for Stochastic Optimization by Diederik P. Kingma and Jimmy Lei Ba. We’ll break down the core concepts, walk through the algorithm step-by-step, and explore the experiments that demonstrated its power. Whether you’re a student just starting in machine learning or a practitioner looking for a deeper understanding, this guide will demystify one of the most fundamental tools in the modern deep learning toolkit.


The Challenge of Optimization in Deep Learning

Training a machine learning model is fundamentally an optimization problem. We define a loss function that measures how poorly our model is performing, and our goal is to find the set of model parameters (weights and biases) that minimizes this loss.

The most common method to achieve this is gradient descent: calculate the gradient of the loss function with respect to the parameters, and take a small step in the opposite direction. For large datasets, computing the gradient over the entire dataset is computationally expensive, so we use Stochastic Gradient Descent (SGD), which estimates the gradient using random subsets of data called mini-batches. This speeds up training and often improves generalization.

However, vanilla SGD comes with challenges:

  1. Choosing the learning rate: Too small, and training drags; too large, and training diverges.
  2. Uniform learning rate for all parameters: Every parameter is updated with the same step size, which isn’t always optimal.
  3. Navigating complex loss landscapes: Deep networks have loss surfaces full of ravines, plateaus, and saddle points. Some parameters require aggressive updates, while others need fine-tuning.

Over the years, optimizers with enhancements to SGD emerged—Momentum to accelerate traversal through ravines, AdaGrad to tailor learning rates to individual parameters (good for sparse gradients), and RMSProp to adapt learning rates dynamically for non-stationary problems.

Adam combines the strengths of these methods into one robust approach.


The Core Idea: Adaptive Moment Estimation

The name “Adam” comes from Adaptive Moment Estimation. Adam adapts learning rates for each parameter using the first and second moments of the gradients:

  • First moment — the mean of the gradients, estimated with an exponentially decaying average. This is similar to momentum.
  • Second moment — the uncentered variance of the gradients (average of squared gradients), also estimated with an exponentially decaying average.

These two statistics give Adam a per-parameter scaling for updates, blending acceleration toward minima with adaptivity in step size.


The Adam Algorithm Step-by-Step

At each timestep \(t\):

  1. Compute the gradient of the loss with respect to parameters \(\theta\):

    \[ g_t = \nabla_{\theta} f_t(\theta_{t-1}) \]
  2. Update the first moment estimate (momentum-like term):

    \[ m_t \leftarrow \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t \]

    Here, \(\beta_1\) (commonly 0.9) is the decay rate for the moving average.

  3. Update the second moment estimate (variance-like term):

    \[ v_t \leftarrow \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 \]

    This uses element-wise squares \(g_t^2\), with \(\beta_2\) typically 0.999.


Bias Correction — The Secret Sauce

Both \(m_t\) and \(v_t\) initially start at zero, biasing them toward zero early in training—especially when \(\beta_1\) and \(\beta_2\) are close to 1. The paper remedies this with bias-corrected estimates:

  1. Expand \(v_t\) to see dependence on past gradients:

Expanded update rule for the second moment \\(v_t\\).

  1. Taking expectations reveals the bias factor:

Derivation showing bias in expected value of \\(v_t\\).

  1. Correct the bias by:

    \[ \widehat{m}_t \leftarrow \frac{m_t}{1 - \beta_1^t}, \quad \widehat{v}_t \leftarrow \frac{v_t}{1 - \beta_2^t} \]

This correction is critical when \( \beta_2 \) is close to 1, ensuring stable updates for sparse gradients.


Final Parameter Update

The parameters are updated as:

\[ \theta_t \leftarrow \theta_{t-1} - \alpha \cdot \frac{\widehat{m}_t}{\sqrt{\widehat{v}_t} + \epsilon} \]

Where:

  • \(\alpha\) is the learning rate (default 0.001).
  • \(\epsilon\) (small constant, e.g. \(10^{-8}\)) ensures numerical stability.

This step blends momentum (via \(\widehat{m}_t\)) with adaptive scaling (\(1/\sqrt{\widehat{v}_t}\)), giving each parameter its own effective learning rate.


Why Adam Excels

The authors highlight key properties:

  • Adaptive learning rates per parameter, using both mean and variance of gradients.
  • Stable step sizes bounded by \(\alpha\), simplifying learning rate tuning.
  • Good defaults (\(\alpha\)=0.001, \(\beta_1\)=0.9, \(\beta_2\)=0.999, \(\epsilon\)=\(10^{-8}\)) work well across diverse tasks.
  • Robustness to noisy, non-stationary, and sparse gradients.

Experimental Validation

Adam’s power is demonstrated across multiple tasks.

1. Logistic Regression (Convex)

Adam was compared against SGD with Nesterov momentum and AdaGrad:

Comparison of optimizers on logistic regression for MNIST and IMDB datasets.

  • On MNIST, Adam matches SGD with momentum.
  • On IMDB (sparse bag-of-words features), Adam and AdaGrad outperform SGD significantly.

This shows Adam inherits AdaGrad’s strength in handling sparsity.


2. Multi-Layer Neural Networks (Non-Convex)

Adam trained fully connected networks with dropout on MNIST:

Training a multi-layer neural network with dropout on MNIST. Adam (purple) shows fastest convergence.

Adam (purple) converges faster than AdaGrad, RMSProp, and AdaDelta. It also surpasses the SFO quasi-Newton method in both iteration efficiency and wall-clock time.


3. Convolutional Neural Networks (CIFAR-10)

For deep CNNs, weight sharing causes varying gradient scales across layers:

Training a CNN on CIFAR-10. Adam converges faster than AdaGrad in the long run.

Early training sees Adam and AdaGrad progress rapidly, but AdaGrad slows due to aggressive learning rate decay. Adam maintains momentum and converges faster, adapting learning rates per layer without manual tuning.


4. Bias Correction Ablation

Bias correction was put to the test on a Variational Autoencoder:

The effect of bias correction. With bias correction (red), training is more stable and achieves better loss than without (green), especially for β₂ close to 1.

Without bias correction (green), training destabilizes for high \(\beta_2\) values. With correction (red), performance is consistently better—validating its necessity.


Extension: AdaMax

The paper also proposes AdaMax, an \(L^\infty\) norm variant of Adam. Instead of tracking squared gradients, AdaMax tracks the maximum magnitude of past gradients:

Simple recursive update for \\(u_t\\) in AdaMax.

With:

\[ u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|) \]

The parameter update:

\[ \theta_t \leftarrow \theta_{t-1} - \frac{\alpha}{1 - \beta_1^t} \cdot \frac{m_t}{u_t} \]

AdaMax offers stability and a simpler bound on update size.


Conclusion and Legacy

Adam elegantly blends momentum and adaptive learning rates with bias correction, yielding an optimizer that is fast, stable, and broadly applicable.

Empirically, Adam consistently performs well—whether on simple convex problems like logistic regression, sparse data tasks, complex multi-layer nets, or deep CNNs. Its robustness to a variety of training challenges has made it a default choice for practitioners.

Today, Adam remains a benchmark in optimizer research. Understanding Adam is more than academic—it provides insight into the foundations of modern deep learning optimization.