The world of Artificial Intelligence is in an arms race, but the weapons aren’t missiles—they’re parameters. From BERT (340 million) to GPT-2 (1.5 billion) and T5 (11 billion), we’ve seen a clear trend: bigger models tend to deliver better accuracy. But this relentless growth comes at a steep price—training these behemoths demands an astronomical amount of memory, far exceeding what a single GPU can handle.

Consider this: even a modest 1.5-billion-parameter model, like GPT-2, requires more than 24 GB of memory just for training states when using standard methods. That already pushes the limits of a high-end 32 GB GPU—and that’s before you account for the activations and all the temporary data. So how can we possibly train models with tens, hundreds, or even a trillion parameters?

The standard approach—data parallelism—scales computation but not memory. It replicates the entire model on every GPU, so you hit a memory wall no matter how many GPUs you add. Model parallelism, where you split the model across GPUs, can bypass this wall but is notoriously difficult to implement, and its performance plummets when scaling beyond a single machine due to slower inter-node communication.

This is the challenge a team of researchers at Microsoft set out to solve. Their groundbreaking paper, ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, introduces a novel and elegant solution that fundamentally changes the economics of training massive models. ZeRO, short for Zero Redundancy Optimizer, is a set of memory optimization techniques that allows model size to scale linearly with the number of GPUs—while maintaining high computational efficiency.


The Memory Hog: Where Does All the GPU RAM Go?

Before we can appreciate ZeRO’s brilliance, we need to understand exactly why training a model consumes so much more memory than just storing its parameters. Training memory usage breaks down into two main categories:

  • Model States
  • Residual States

Model States

This is the biggest consumer of memory for large models, and it consists of:

  1. Parameters — the weights and biases learned during training.
  2. Gradients — computed during backprop to update parameters, matching them in size.
  3. Optimizer States — extra variables maintained by optimizers like Adam, such as momentum and variance, enabling adaptive learning rates.

Mixed-precision training compounds the problem. To leverage Tensor Cores on modern GPUs, training is often done in FP16. But stability demands an FP32 master copy and FP32 versions of optimizer states.

For a model with \( \Psi \) parameters using Adam:

  • Parameters: \( 2\Psi \) bytes (FP16)
  • Gradients: \( 2\Psi \) bytes (FP16)
  • Optimizer States:
    • FP32 copy of parameters: \( 4\Psi \) bytes
    • Momentum: \( 4\Psi \) bytes
    • Variance: \( 4\Psi \) bytes

That totals:

\[ 2\Psi + 2\Psi + (4\Psi + 4\Psi + 4\Psi) = 16\Psi \ \text{bytes} \]

For GPT-2 with 1.5B parameters: \( 16 \times 1.5 \ \text{B} \approx 24\ \text{GB} \). The optimizer states alone consume 12× the size of the FP16 parameters.

Residual States

This is everything else:

  • Activations — outputs from each layer during the forward pass, needed for backward pass. For long sequences and large batch sizes, activations can easily reach tens of GB. Even with activation checkpointing (recompute instead of store), costs remain high.
  • Temporary Buffers — used in operations like gradient aggregation; can be as large as the model itself.
  • Memory Fragmentation — allocation/deallocation cycles leave small non-contiguous free blocks. Even if total free memory is sufficient, lacking a contiguous block leads to out-of-memory errors.

ZeRO’s Core Idea: Killing Memory Redundancy

Here’s the key insight: in standard data parallelism, most of this memory is redundant. If you have \( N_d \) GPUs, you have \( N_d \) identical copies of parameters, gradients, and optimizer states. ZeRO eliminates that redundancy.

It consists of two groups of optimizations:

  1. ZeRO-DP — cuts down model state memory.
  2. ZeRO-R — trims residual state memory.

ZeRO-DP: Partitioning Model States

ZeRO-DP enhances traditional data parallelism by partitioning the model states across GPUs instead of replicating them. This happens in three cumulative stages:

A schematic comparison of memory usage across ZeRO-DP stages for parameters, gradients, and optimizer states.

Figure 1: Memory savings from ZeRO-DP. The baseline replicates all states; each stage partitions more components (green = optimizer states, orange = gradients, blue = parameters), drastically cutting per-GPU usage.

Stage 1 — Optimizer State Partitioning (\(P_{os}\))

Optimizer states are the largest portion. ZeRO partitions them across GPUs so each holds only \( 1/N_d \) of the states.

  • Memory Reduction: ~4× — from \( 16\Psi \) to ~\( 4\Psi \) bytes.
  • Comm. Cost: Same as standard data parallelism.

Stage 2 — Gradient Partitioning (\(P_{os+g}\))

Each GPU only needs gradients for its optimizer state slice. Instead of all-reduce, ZeRO uses reduce-scatter to sum and distribute partitioned gradients.

  • Memory Reduction: ~8× — gradients shrink by \( N_d \) factor, total per-GPU memory ~\( 2\Psi \).
  • Comm. Cost: Unchanged from baseline (reduce-scatter + all-gather volume equals all-reduce).

Stage 3 — Parameter Partitioning (\(P_{os+g+p}\))

The parameters themselves are partitioned. GPUs fetch parameter slices as needed for each layer’s computation, then free them after use.

  • Memory Reduction: Linear in \( N_d \); per-GPU = \( 16\Psi / N_d \).
  • Comm. Cost: ~50% higher than baseline, but enables colossal models.

With all three stages enabled, ZeRO can theoretically train a 1 trillion-parameter model on 1024 GPUs, with only ~16 GB of model states per GPU.


ZeRO-R: Tackling Residual Memory

Once model state redundancy is gone, residual memory becomes the new bottleneck. ZeRO-R addresses this with:

  • Partitioned Activation Checkpointing — partitions across GPUs instead of replicating activations; reconstructs them on-demand; can offload to CPU memory for huge models.
  • Constant-Size Buffers — caps temp buffer sizes, preventing blow-up with large models.
  • Memory Defragmentation — separates short- and long-lived tensors, keeping contiguous space available.

Putting ZeRO to the Test

The team implemented ZeRO-100B, including ZeRO-DP Stage 2 (\(P_{os+g}\)) plus all ZeRO-R optimizations. Tests ran on a 400-GPU NVIDIA V100 cluster, compared against Megatron-LM, the state-of-art large model framework.

Unprecedented Scale and Speed

Performance comparison chart showing ZeRO’s high throughput vs. baseline’s steep drop beyond 40B parameters.

Figure 2: ZeRO maintains ~38 TFLOPS per GPU even for 100B+ models, while baseline collapses beyond 40B due to inefficient cross-node model parallelism.

Results: ZeRO trains models up to 170B parameters—an increase over Megatron-LM’s limit—and delivers up to 10× higher throughput.


Super-Linear Scalability

Bar and line chart showing per-GPU throughput increasing as GPUs scale from 64 to 400.

Figure 3: ZeRO’s super-linear scaling—more GPUs not only increase total performance, but also raise per-GPU throughput.

Why? As GPUs increase, ZeRO-DP reduces per-GPU memory footprint, enabling larger batches per GPU. Larger batches => greater arithmetic intensity => better utilization => higher TFLOPS.


Democratizing Large Model Training

Chart showing ZeRO-DP training up to 13B parameters with no model parallelism.

Figure 4: ZeRO-DP can train 13B-parameter models without model parallelism. Standard DP fails beyond ~1.4B parameters.

ZeRO lets data scientists train huge models without complex model refactoring, using vanilla DP workflows.


Real-World Impact: Turing-NLG

Validation perplexity chart showing ZeRO-trained Turing-NLG (17B) outperforming Megatron-LM (8.3B).

Figure 5: ZeRO enabled training of Turing-NLG (17B params), achieving state-of-art perplexity.

Turing-NLG—a 17B-parameter language model—set new accuracy records thanks to ZeRO-100B’s efficiency.


Ablation Study: Where Do Gains Come From?

Table of the five ZeRO configurations combining different DP and R optimizations.

Table 3: Five configs used to isolate impact of different ZeRO optimizations.

Max Model Size

Bar chart showing jump from 40B to 140B between configs when enabling gradient & activation partitioning.

Figure 6: Largest trainable model size increases dramatically with more ZeRO features.

Memory Usage

Chart comparing max cached memory for 40B & 100B models under different configs.

Figure 7: Peak memory declines as optimizations are added, freeing space for larger batches.

Throughput

Chart showing per-GPU throughput for 60B & 170B models under different configs.

Figure 8: Throughput generally rises as memory savings allow bigger batches; CPU offload helps only when it’s the sole way to fit.


Conclusion: The Path to a Trillion Parameters

ZeRO signals a paradigm shift for large model training. By systematically eliminating memory redundancy, it smashes the replication bottleneck of DP and sidesteps scaling problems of MP.

Even the partial ZeRO-100B implementation yields an 8× model size increase and 10× speedup today. Full ZeRO-DP makes per-device memory inversely proportional to device count:

Table showing theoretical per-GPU memory for different model sizes & DP degrees under three ZeRO-DP stages.

Table 1: With \(P_{os+g+p}\), a 1T-parameter model fits in 15.6 GB per GPU with 1024-way DP.

Yes, training a trillion-parameter model still demands enormous compute power—likely next-gen hardware. But ZeRO solves the memory problem, providing the system technology to make it feasible when the compute catches up.

ZeRO is open-sourced as part of Microsoft’s DeepSpeed library, democratizing access to large-scale AI and paving the way for the next generation of massive, powerful, and world-changing models.