Papers

[ZeRO: Memory Optimizations Toward Training Trillion Parameter Models 🔗](https://arxiv.org/abs/1910.02054)

ZeRO to Trillion: A Deep Dive into the Memory Optimizations Behind Massive AI Models

The world of Artificial Intelligence is in an arms race, but the weapons aren’t missiles—they’re parameters. From BERT (340 million) to GPT-2 (1.5 billion) and T5 (11 billion), we’ve seen a clear trend: bigger models tend to deliver better accuracy. But this relentless growth comes at a steep price—training these behemoths demands an astronomical amount of memory, far exceeding what a single GPU can handle. Consider this: even a modest 1.5-billion-parameter model, like GPT-2, requires more than 24 GB of memory just for training states when using standard methods. That already pushes the limits of a high-end 32 GB GPU—and that’s before you account for the activations and all the temporary data. So how can we possibly train models with tens, hundreds, or even a trillion parameters? ...

[Scaling Laws for Neural Language Models 🔗](https://arxiv.org/abs/2001.08361)

More is Different — The Surprising Predictability of Language Model Performance

In the world of artificial intelligence, Large Language Models (LLMs) can seem like a form of modern alchemy. We mix massive datasets, gargantuan neural networks, and mind-boggling amounts of computation—and out comes something that can write poetry, debug code, and explain complex topics. But why does this work? And if we had ten times the resources, how much better could we make it? Is there a method to this madness, or are we just hoping for the best? ...

T5 Explained: How Google's Text-to-Text Transformer Pushed the NLP Frontier

The recent explosion of progress in Natural Language Processing (NLP) largely rides on one lesson: pre-train large models on lots of text, then adapt them to specific tasks. Models like BERT, GPT-2, RoBERTa, and XLNet all lean on this transfer-learning paradigm, but they differ in architecture, pre-training objectives, and datasets — and those differences can be hard to disentangle. In “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” the Google Brain team took a different tack. Instead of proposing just another tweak, they built a unified experimental playground and ran a massive, principled sweep of variables: architectures, unsupervised objectives, pre-training corpora, fine-tuning strategies, and scaling regimes. The result is both a thorough empirical guide and a family of state-of-the-art models called T5 (Text-to-Text Transfer Transformer). ...

Before ChatGPT: How Generative Pre-Training Revolutionized NLP (The GPT-1 Paper Explained)

In the world of AI today, models like ChatGPT seem almost magical. They can write code, compose poetry, and answer complex questions with remarkable fluency. But this revolution didn’t happen overnight—it was built on a series of foundational breakthroughs. One of the most crucial was a 2018 paper from OpenAI titled “Improving Language Understanding by Generative Pre-Training”. This paper introduced what we now call GPT-1, presenting a simple yet profoundly effective framework that changed the trajectory of Natural Language Processing (NLP). The core idea: first let a model learn the patterns of language from a massive amount of raw text, and then fine-tune that knowledge for specific tasks. ...

[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 🔗](https://arxiv.org/abs/1909.08053)

Megatron-LM: Scaling Language Models to Billions of Parameters with Elegant PyTorch Parallelism

The world of Natural Language Processing (NLP) has entered an era of giant models. From GPT-2 to BERT and beyond, one trend is crystal clear: the bigger the model, the better the performance. These massive transformers can generate coherent articles, answer complex questions, and parse language with unprecedented nuance. But this capability comes at a steep engineering cost. These models have billions—and increasingly, trillions—of parameters. How can such colossal networks fit into the memory of a single GPU? ...

[AutoAugment: Learning Augmentation Strategies from Data 🔗](https://arxiv.org/abs/1805.09501)

Beyond Flipping and Cropping: How AutoAugment Teaches AI to Augment Its Own Data

Deep learning models are notoriously data-hungry. The more high-quality, labeled data you can feed them, the better they perform. But what happens when you can’t just collect more data? You get creative. For years, the go-to technique has been data augmentation: taking your existing images and creating new, slightly modified versions—flipping them, rotating them, shifting colors—to expand your dataset for free. This approach works wonders. It teaches the model what features are truly important and which are just artifacts of a specific image. A cat is still a cat whether it’s on the left or right side of the frame. This concept of invariance—knowing which changes don’t alter the label—is key to building robust models. ...

[Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation 🔗](https://arxiv.org/abs/2509.25849)

Knapsack RL: A Computational 'Free Lunch' for Training Smarter Language Models

Large Language Models (LLMs) have shown an extraordinary capacity to improve themselves through reinforcement learning (RL). By generating solutions, receiving feedback, and adjusting their strategy, they can learn to tackle complex problems such as advanced mathematical reasoning. This process hinges on a critical step: exploration—trying many different approaches, or “rollouts,” to discover what works. The catch? Exploration is computationally expensive. Generating thousands of possible solutions for thousands of different problems consumes massive amounts of GPU time. To keep costs manageable, current methods typically assign a small, fixed exploration budget to every problem—often 8 possible solutions per task. ...

[LongCodeZip: Compress Long Context for Code Language Models 🔗](https://arxiv.org/abs/2510.00446)

LongCodeZip: Making LLMs Read Your Entire Codebase Without Breaking the Bank

Large Language Models (LLMs) are transforming software development. From autocompleting entire functions to answering complex, repository-level questions, these AI assistants are quickly becoming indispensable. But they have an Achilles’ heel: context length. When you ask an LLM to work with a large project, you often end up feeding it tens of thousands of lines of code. This long context creates a perfect storm of problems: Lost in the middle: Models can struggle to identify relevant pieces as important tokens get buried. Slow generation: Attention mechanisms scale quadratically, so long inputs cause latency to skyrocket. High costs: Commercial APIs charge by the token. Long contexts quickly run up the bill. For source code, this is particularly problematic. Unlike prose, code has intricate interdependencies. A function in one file might be essential to logic spread across dozens of others. Randomly chopping off text to fit a budget risks breaking compile-ready structure and losing critical constraints. ...

[StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions 🔗](https://arxiv.org/abs/2510.02314)

Hiding in the Void: How StealthAttack Poisons 3D Scenes

The world of 3D graphics is undergoing a revolution. For decades, creating photorealistic 3D scenes was the domain of skilled artists using complex software. But modern techniques like Neural Radiance Fields (NeRF) and, more recently, 3D Gaussian Splatting (3DGS) have profoundly changed the game. These methods can learn a stunningly accurate 3D representation of a scene from just a handful of 2D images, enabling applications from virtual reality and digital twins to advanced visual effects. ...

[ModernVBERT: TOWARDS SMALLER VISUAL DOCUMENT RETRIEVERS 🔗](https://arxiv.org/abs/2510.01149)

Small is Mighty: How ModernVBERT Redefines Visual Document Retrieval

Introduction: Beyond Just Words Imagine searching for a specific chart hidden in hundreds of pages of financial reports, or trying to locate a product in a sprawling digital catalog using both its image and a brief description. In today’s increasingly multimedia world, documents are more than just text—they are rich ecosystems of words, images, layouts, charts, and tables. Traditional text-only search engines often fail to capture the meaning locked inside these visual elements, missing vital context. ...