Training massive deep learning models—especially Large Language Models (LLMs) with billions of parameters—is a monumental task. One of the biggest bottlenecks isn’t just computation, but the sheer memory required. A significant chunk of that memory is consumed not by the model weights themselves, but by the optimizer’s state—the extra data it needs to track in order to update the model effectively.
The Adam optimizer, a go-to choice for training LLMs, is particularly memory-hungry. It stores two additional values for every parameter in the model, effectively doubling the memory footprint compared to simpler methods like SGD. For a model like LLaMA-2 7B, this means over 25 GB of memory for the optimizer alone!
This has led to a race to create more memory-efficient optimizers. We’ve seen clever solutions like 8-bit Adam, which quantizes optimizer states, and GaLore, which applies low-rank gradient projections. While useful, these methods are mostly heuristics, often lacking rigorous mathematical guarantees of convergence. Sometimes that trade-off leads to notable drops in model accuracy.
This raises a critical question:
Can we have the best of both worlds—drastic memory reductions and strong convergence guarantees, without sacrificing performance?
A recent paper from ISTA and KAUST researchers, “MICROADAM: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence,” answers with a resounding yes. They introduce MicroAdam, a new optimizer that cleverly compresses gradient information to slash memory use—while, for the first time, offering solid theoretical convergence guarantees.
Background: The Cost of Adaptation and the Promise of Compression
To understand MicroAdam, we need to see why Adam is so memory-intensive—and where we might slim it down.
Adam maintains two moving averages for each parameter \( \theta \):
- First moment (\( m_t \)): an exponential moving average of gradients, acting like momentum to smooth and accelerate updates.
- Second moment (\( v_t \)): an exponential moving average of squared gradients, adapting learning rates per parameter. Parameters with large updates get slower learning rates; infrequently updated parameters get faster ones.
Storing both \( m_t \) and \( v_t \) for all parameters is what makes Adam effective—but expensive. If a model has \(d\) parameters stored in 16-bit floats (2 bytes), Adam’s state uses:
- float32: \( 8d \) bytes (m and v as 4 bytes each)
- float16: \( 4d \) bytes
One natural idea: don’t store the full gradient.
Gradient compression—in particular, sparsification—keeps only a subset of “important” gradient entries. For example, a Top-K method picks the \(k\) entries with largest magnitude and zeroes the rest. In massive models, compression can be extreme: 99% sparsity means storing just 1% of the gradient.
However, naive sparsification has a big problem: you systematically ignore small gradient values. This introduces bias, potentially leaving certain weights untouched or steering updates off-course.
Enter Error Feedback
Error Feedback (EF), a concept from distributed ML, elegantly addresses this. Whatever gradient components you drop during compression? Save them in an error buffer and add them back to the gradient next time. Over time, no information is truly lost—just delayed.
For distributed training, EF works wonders. But in our single-machine scenario, standard EF has a frustrating paradox: the error buffer is the same size as the full gradient! This erases your memory savings.
MicroAdam’s breakthrough is simple but profound:
Compress the error buffer itself.
How MicroAdam Works
MicroAdam combines Top-K sparsification with error feedback—but crucially, compresses the error feedback using low-precision quantization. This allows massive memory savings while preserving convergence.
1. Correct, Then Compress
At step \( t \):
- Start with the gradient \( g_t \).
- Correct it using the dequantized error feedback \( e_t \):
- Apply Top-K sparsification (e.g., top 1%):
Here \(I_t\) are indices and \(V_t\) are values of the largest-magnitude entries. This sparse vector updates Adam’s moment estimates.
2. Compress the Error
The residual after removing top-K entries:
\[ e_{t+1}^{\text{raw}} = a_t - \text{TopK}(a_t) \]Instead of storing this dense vector, MicroAdam quantizes it to 4 bits per value:
\[ e_{t+1} = \text{quantize}(e_{t+1}^{\text{raw}}, \text{bits}=4) \]This reduces error buffer size by a factor of 8 compared to float32—making EF’s cost negligible.
3. Dynamic Statistics with a Sparse Window
Adam normally stores dense \( m_t \) and \( v_t \) for all \(d\) parameters. MicroAdam skips this:
- Keep only a sliding window of recent sparse gradients \((I_t, V_t)\), size \( m \) (e.g., 10 steps).
- Recompute \( m_t \) and \( v_t \) on the fly by unrolling moment formulas over sparse history.
- With extreme sparsity (99% zeros) and a custom CUDA implementation, recomputation is fast and memory-light.
Error Feedback in Action
In the classic Rosenbrock function minimization:
- Adam (left): Smooth, direct trajectory to optimum.
- Top-K Adam (center): Jagged path—without EF, information loss severely hurts convergence.
- Top-K Adam + EF (right): Smooth path restored. EF accumulates dropped gradient components and reintroduces them, preserving trajectory.
MicroAdam applies this principle at massive scale—efficiently.
Memory Footprint
For LLaMA-2 7B (\(d\) parameters):
- AdamW (bf16): \(4d\) bytes → 25.10 GB
- AdamW-8bit: \(2d\) bytes → 12.55 GB
- MicroAdam (m=10, 1% density): \(0.5d\) (4-bit EF) + \(4mk\) (sparse window) = 0.9d bytes → 5.65 GB
That’s less than half the memory of 8-bit Adam—and over 4× lighter than standard AdamW.
Convergence Guarantees
MicroAdam isn’t just practical—it’s theoretically sound.
Two compression operators matter here:
- Gradient Compressor \(C\) — Top-K sparsifier, q-contractive:
For Top-K, \(q\) relates directly to sparsity level.
- Error Compressor \(Q\) — 4-bit quantizer, unbiased and bounded:
The key condition:
\[ (1 + \omega)q < 1 \]This ensures compression doesn’t lose too much cumulative information.
Under this, the authors prove:
- Theorem 1 (Non-convex): Convergence rate \( \mathcal{O}(1/\sqrt{T}) \), matching AMSGrad.
- Theorem 2 (PL condition): Convergence rate \( \mathcal{O}(\log T / T) \) for functions meeting the Polyak-Łojasiewicz property.
Asymptotic rates identical to uncompressed Adam—compression only affects constants.
Experiments: Real-World Proof
Fine-tuning BERT and OPT
BERT-Base (110M), BERT-Large (335M), OPT-1.3B — compared against Adam, Adam-8bit, CAME, and GaLore.
MicroAdam matches or beats Adam-8bit’s accuracy, with similar memory use—significantly outperforming GaLore and CAME.
Loss curves:
Fine-tuning LLaMA-2 7B and 13B
GSM-8k Math Reasoning
MicroAdam enabled full fine-tuning of LLaMA-2 7B on a single 40 GB GPU:
- Accuracy: 34.72% (MicroAdam m=10) vs. 34.50% (Adam)
- Fits comfortably in memory; Adam did not.
Open-Platypus Instruction Tuning
Highest average accuracy across tasks, lowest memory use among Adam variants.
Pre-training on ImageNet
ResNet-18 and ResNet-50, trained from scratch.
MicroAdam achieved top validation accuracies—even surpassing tuned-SGD—while using the least optimizer state memory.
Training curves:
The authors hypothesize an implicit regularization effect: sparse updates (about 10% of weights per step) reduce overfitting.
Conclusion & Implications
MicroAdam is more than just another optimizer. It’s a principled, memory-efficient design with provable convergence—bridging the longstanding gap between efficiency and rigor.
Key takeaways:
- Compressed Error Feedback — The EF buffer itself is quantized to 4 bits, slashing its memory footprint to almost nothing.
- Massive Memory Savings — Less than half the memory of Adam-8bit; >4× lighter than standard AdamW.
- Strong Theory — Converges at the same asymptotic rate as uncompressed Adam; compression affects only constants.
- Outstanding Practical Results — Matches or exceeds Adam on LLM fine-tuning; beats SGD on CV pre-training in validation accuracy.
By proving that error feedback remains effective—even heavily compressed—MicroAdam paves the way for future optimizers that are both memory-lean and mathematically sound.
For practitioners, this could mean fully fine-tuning state-of-the-art models on accessible hardware, lowering costs and opening cutting-edge AI research to a wider community.