](https://deep-paper.org/en/papers/2025-10/1910.02054/images/cover.png)
ZeRO to Trillion: A Deep Dive into the Memory Optimizations Behind Massive AI Models
The world of Artificial Intelligence is in an arms race, but the weapons aren’t missiles—they’re parameters. From BERT (340 million) to GPT-2 (1.5 billion) and T5 (11 billion), we’ve seen a clear trend: bigger models tend to deliver better accuracy. But this relentless growth comes at a steep price—training these behemoths demands an astronomical amount of memory, far exceeding what a single GPU can handle. Consider this: even a modest 1.5-billion-parameter model, like GPT-2, requires more than 24 GB of memory just for training states when using standard methods. That already pushes the limits of a high-end 32 GB GPU—and that’s before you account for the activations and all the temporary data. So how can we possibly train models with tens, hundreds, or even a trillion parameters? ...