Training massive deep learning models—especially Large Language Models (LLMs) with billions of parameters—is a monumental task. One of the biggest bottlenecks isn’t just computation, but the sheer memory required. A significant chunk of that memory is consumed not by the model weights themselves, but by the optimizer’s state—the extra data it needs to track in order to update the model effectively.

The Adam optimizer, a go-to choice for training LLMs, is particularly memory-hungry. It stores two additional values for every parameter in the model, effectively doubling the memory footprint compared to simpler methods like SGD. For a model like LLaMA-2 7B, this means over 25 GB of memory for the optimizer alone!

This has led to a race to create more memory-efficient optimizers. We’ve seen clever solutions like 8-bit Adam, which quantizes optimizer states, and GaLore, which applies low-rank gradient projections. While useful, these methods are mostly heuristics, often lacking rigorous mathematical guarantees of convergence. Sometimes that trade-off leads to notable drops in model accuracy.

This raises a critical question:
Can we have the best of both worlds—drastic memory reductions and strong convergence guarantees, without sacrificing performance?

A recent paper from ISTA and KAUST researchers, “MICROADAM: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence,” answers with a resounding yes. They introduce MicroAdam, a new optimizer that cleverly compresses gradient information to slash memory use—while, for the first time, offering solid theoretical convergence guarantees.


Background: The Cost of Adaptation and the Promise of Compression

To understand MicroAdam, we need to see why Adam is so memory-intensive—and where we might slim it down.

Adam maintains two moving averages for each parameter \( \theta \):

  1. First moment (\( m_t \)): an exponential moving average of gradients, acting like momentum to smooth and accelerate updates.
  2. Second moment (\( v_t \)): an exponential moving average of squared gradients, adapting learning rates per parameter. Parameters with large updates get slower learning rates; infrequently updated parameters get faster ones.

Storing both \( m_t \) and \( v_t \) for all parameters is what makes Adam effective—but expensive. If a model has \(d\) parameters stored in 16-bit floats (2 bytes), Adam’s state uses:

  • float32: \( 8d \) bytes (m and v as 4 bytes each)
  • float16: \( 4d \) bytes

One natural idea: don’t store the full gradient.
Gradient compression—in particular, sparsification—keeps only a subset of “important” gradient entries. For example, a Top-K method picks the \(k\) entries with largest magnitude and zeroes the rest. In massive models, compression can be extreme: 99% sparsity means storing just 1% of the gradient.

However, naive sparsification has a big problem: you systematically ignore small gradient values. This introduces bias, potentially leaving certain weights untouched or steering updates off-course.


Enter Error Feedback

Error Feedback (EF), a concept from distributed ML, elegantly addresses this. Whatever gradient components you drop during compression? Save them in an error buffer and add them back to the gradient next time. Over time, no information is truly lost—just delayed.

For distributed training, EF works wonders. But in our single-machine scenario, standard EF has a frustrating paradox: the error buffer is the same size as the full gradient! This erases your memory savings.

MicroAdam’s breakthrough is simple but profound:
Compress the error buffer itself.


How MicroAdam Works

MicroAdam combines Top-K sparsification with error feedback—but crucially, compresses the error feedback using low-precision quantization. This allows massive memory savings while preserving convergence.

Algorithm pseudocode for MicroAdam and its helper functions. The left side shows the main loop, while the right side details the ADAMSTATS and Quantization procedures.

1. Correct, Then Compress

At step \( t \):

  1. Start with the gradient \( g_t \).
  2. Correct it using the dequantized error feedback \( e_t \):
\[ a_t = g_t + \text{dequantize}(e_t) \]
  1. Apply Top-K sparsification (e.g., top 1%):
\[ I_t, V_t = \text{TopK}(a_t) \]

Here \(I_t\) are indices and \(V_t\) are values of the largest-magnitude entries. This sparse vector updates Adam’s moment estimates.


2. Compress the Error

The residual after removing top-K entries:

\[ e_{t+1}^{\text{raw}} = a_t - \text{TopK}(a_t) \]

Instead of storing this dense vector, MicroAdam quantizes it to 4 bits per value:

\[ e_{t+1} = \text{quantize}(e_{t+1}^{\text{raw}}, \text{bits}=4) \]

This reduces error buffer size by a factor of 8 compared to float32—making EF’s cost negligible.


3. Dynamic Statistics with a Sparse Window

Adam normally stores dense \( m_t \) and \( v_t \) for all \(d\) parameters. MicroAdam skips this:

  • Keep only a sliding window of recent sparse gradients \((I_t, V_t)\), size \( m \) (e.g., 10 steps).
  • Recompute \( m_t \) and \( v_t \) on the fly by unrolling moment formulas over sparse history.
  • With extreme sparsity (99% zeros) and a custom CUDA implementation, recomputation is fast and memory-light.

Error Feedback in Action

Optimization trajectories of Adam, TopK-Adam and TopK-Adam with EF applied on the Rosenbrock function. The standard Adam optimizer takes a smooth path. TopK-Adam without EF takes a jagged, inefficient path. TopK-Adam with EF recovers the smooth, efficient path of the original Adam.

In the classic Rosenbrock function minimization:

  • Adam (left): Smooth, direct trajectory to optimum.
  • Top-K Adam (center): Jagged path—without EF, information loss severely hurts convergence.
  • Top-K Adam + EF (right): Smooth path restored. EF accumulates dropped gradient components and reintroduces them, preserving trajectory.

MicroAdam applies this principle at massive scale—efficiently.


Memory Footprint

For LLaMA-2 7B (\(d\) parameters):

  • AdamW (bf16): \(4d\) bytes → 25.10 GB
  • AdamW-8bit: \(2d\) bytes → 12.55 GB
  • MicroAdam (m=10, 1% density): \(0.5d\) (4-bit EF) + \(4mk\) (sparse window) = 0.9d bytes5.65 GB

That’s less than half the memory of 8-bit Adam—and over lighter than standard AdamW.


Convergence Guarantees

MicroAdam isn’t just practical—it’s theoretically sound.

Two compression operators matter here:

  1. Gradient Compressor \(C\) — Top-K sparsifier, q-contractive:
\[ \|\mathcal{C}(x) - x\| \le q \|x\| \]

For Top-K, \(q\) relates directly to sparsity level.

  1. Error Compressor \(Q\) — 4-bit quantizer, unbiased and bounded:
\[ \mathbb{E}[Q(x)] = x, \quad \|Q(x) - x\| \le \omega \|x\| \]

The key condition:

\[ (1 + \omega)q < 1 \]

This ensures compression doesn’t lose too much cumulative information.

Under this, the authors prove:

  • Theorem 1 (Non-convex): Convergence rate \( \mathcal{O}(1/\sqrt{T}) \), matching AMSGrad.

Equation showing the convergence rate for MicroAdam on non-convex functions.

  • Theorem 2 (PL condition): Convergence rate \( \mathcal{O}(\log T / T) \) for functions meeting the Polyak-Łojasiewicz property.

Equation showing the convergence rate for MicroAdam under the PL condition.

Asymptotic rates identical to uncompressed Adam—compression only affects constants.


Experiments: Real-World Proof

Fine-tuning BERT and OPT

BERT-Base (110M), BERT-Large (335M), OPT-1.3B — compared against Adam, Adam-8bit, CAME, and GaLore.

Table 1 shows finetuning results on GLUE/MNLI. MicroAdam achieves competitive or better accuracy and loss compared to baselines, with memory usage similar to other efficient optimizers.

MicroAdam matches or beats Adam-8bit’s accuracy, with similar memory use—significantly outperforming GaLore and CAME.

Loss curves:

Training loss curves for BERT-Base, BERT-Large, and OPT-1.3B. MicroAdam’s curve is consistently among the lowest. Training loss curves for BERT-Base, BERT-Large, and OPT-1.3B. MicroAdam’s curve is consistently among the lowest. Training loss curves for BERT-Base, BERT-Large, and OPT-1.3B. MicroAdam’s curve is consistently among the lowest.


Fine-tuning LLaMA-2 7B and 13B

GSM-8k Math Reasoning

Table 2 shows results for finetuning Llama-2 on the GSM-8k math reasoning dataset. MicroAdam (m=10) matches the accuracy of full Adam while fitting into a 40GB GPU.

MicroAdam enabled full fine-tuning of LLaMA-2 7B on a single 40 GB GPU:

  • Accuracy: 34.72% (MicroAdam m=10) vs. 34.50% (Adam)
  • Fits comfortably in memory; Adam did not.

Open-Platypus Instruction Tuning

Table 3 shows results on the Open-Platypus dataset. MicroAdam achieves the highest average accuracy while using the least memory among AdamW, Adam-8b, and itself.

Highest average accuracy across tasks, lowest memory use among Adam variants.


Pre-training on ImageNet

ResNet-18 and ResNet-50, trained from scratch.

Table 4 shows pre-training results for ResNets on ImageNet. MicroAdam achieves the highest validation accuracy for both ResNet-18 and ResNet-50, surpassing even the highly-tuned SGD baseline.

MicroAdam achieved top validation accuracies—even surpassing tuned-SGD—while using the least optimizer state memory.

Training curves:

Training and validation curves for ResNet-18 and ResNet-50 on ImageNet. MicroAdam (blue) consistently achieves high validation accuracy. Training and validation curves for ResNet-18 and ResNet-50 on ImageNet. MicroAdam (blue) consistently achieves high validation accuracy.

The authors hypothesize an implicit regularization effect: sparse updates (about 10% of weights per step) reduce overfitting.


Conclusion & Implications

MicroAdam is more than just another optimizer. It’s a principled, memory-efficient design with provable convergence—bridging the longstanding gap between efficiency and rigor.

Key takeaways:

  1. Compressed Error Feedback — The EF buffer itself is quantized to 4 bits, slashing its memory footprint to almost nothing.
  2. Massive Memory Savings — Less than half the memory of Adam-8bit; >4× lighter than standard AdamW.
  3. Strong Theory — Converges at the same asymptotic rate as uncompressed Adam; compression affects only constants.
  4. Outstanding Practical Results — Matches or exceeds Adam on LLM fine-tuning; beats SGD on CV pre-training in validation accuracy.

By proving that error feedback remains effective—even heavily compressed—MicroAdam paves the way for future optimizers that are both memory-lean and mathematically sound.
For practitioners, this could mean fully fine-tuning state-of-the-art models on accessible hardware, lowering costs and opening cutting-edge AI research to a wider community.