In the world of artificial intelligence, Large Language Models (LLMs) can seem like a form of modern alchemy. We mix massive datasets, gargantuan neural networks, and mind-boggling amounts of computation—and out comes something that can write poetry, debug code, and explain complex topics.

But why does this work? And if we had ten times the resources, how much better could we make it? Is there a method to this madness, or are we just hoping for the best?

In 2020, a team of researchers from OpenAI and Johns Hopkins University published a landmark paper, Scaling Laws for Neural Language Models, that brought remarkable clarity to this chaotic field. They discovered that language model performance isn’t random at all. Instead, it follows simple, predictable mathematical rules—specifically, power laws—that hold true across an astonishing seven orders of magnitude.

This paper provides something akin to a cheat sheet for building LLMs. It tells us how to scale our models, how much data we need, and how to best allocate a fixed computational budget. The insights are not just practical; they’re profound, suggesting that qualitative leaps in AI capability may emerge from smooth, predictable quantitative scaling.

In this article, we’ll unpack the core findings of the paper and explore the simple laws that govern one of the most complex technologies of our time.


Background: Measuring a Model’s “Goodness”

Before diving in, let’s align on the basics. The models in this study are primarily Transformers—the neural network architecture that powers nearly all modern LLMs. They are trained on a straightforward task: predicting the next word (more precisely, the next token) in a sequence of text.

How do we measure how well a model performs this task? The primary metric here is cross-entropy loss—think of it as a measure of surprise.

  • If the model strongly predicts the next word will be “sky” after “the big blue …” and the actual word is indeed “sky,” the loss is low.
  • If the actual next word is “house,” the model is surprised, and the loss is high.

Lower loss means better predictions, which correlates with stronger language understanding. The goal of training is always to drive the loss down.

The researchers trained hundreds of Transformer models, varying numerous factors:

  • the number of parameters
  • the amount of training data
  • the training duration
  • and even the model’s shape (depth vs. width).

They then measured the final test loss to see what mattered most.


The Three Pillars of Scale: Model Size, Data, and Compute

The central finding of the paper is that a language model’s performance is primarily determined by three factors:

  1. Model Size (N): The number of trainable non-embedding parameters in the network.
  2. Dataset Size (D): How many tokens the model sees during training.
  3. Compute (C): The total amount of processing power used for training.

Crucially, the relationship between loss and each of these factors follows a power law. In simple terms, as you increase N, D, or C, the loss decreases in a smooth, predictable curve. This is beautifully illustrated in the paper’s main summary figure:

Test loss decreases following a power-law as compute, dataset size, and number of non-embedding parameters increase.

Figure 1: Language modeling performance improves smoothly as we increase model size, dataset size, and compute. Each factor exhibits a power-law relationship with test loss when not bottlenecked by the other two.

When plotted on a log–log scale, the results form strikingly straight lines—the signature of a power-law relationship. This predictability is powerful: it means we can train smaller models, observe their performance, and extrapolate to predict how much better a much larger model will be before training it.

The simplified versions of these scaling relationships are:

  1. Limited by Model Size (N):

    \[ L(N) = \left( \frac{N_c}{N} \right)^{\alpha_N} \]
  2. Limited by Dataset Size (D):

    \[ L(D) = \left( \frac{D_{\rm c}}{D} \right)^{\alpha_D} \]
  3. Limited by Compute (C):

    \[ L(C_{\min}) = \left( \frac{C_{\rm c}^{\min}}{C_{\min}} \right)^{\alpha_{\rm c}^{\min}} \]

Here, the \(\alpha\) values are the slopes of the lines in log–log space, and \(N_c, D_c, C_c\) are dataset-specific constants. The key insight: the form of the relationship is universal.


It’s Not the Shape — It’s the Size That Counts

One surprising finding: what doesn’t matter much. Should you build a deep, narrow Transformer or a shallow, wide one? Within reasonable limits, it doesn’t make much difference for performance, as long as the total number of parameters \(N\) is the same.

Loss is remarkably stable across a wide range of architectural shapes for fixed parameter counts.

Figure 5: For fixed non-embedding parameter count, model performance changes by only a few percent across large variations in architecture (aspect ratios, feed-forward ratios, etc.).

This greatly simplifies model design. Instead of painstakingly tuning hyperparameters like layer count or attention head count, researchers can focus on one thing: scaling up the total parameter count.


A Crucial Detail: Count Non-Embedding Parameters

A key refinement was realizing that the “model size” \(N\) that matters should only include non-embedding parameters—those in the Transformer layers, not in the token embeddings.

A language model has:

  1. Core model: Transformer layers that process input and learn patterns.
  2. Embedding matrix: Maps each word/token to a vector representation.

Including embedding parameters in \(N\) muddles the scaling laws. Restricting \(N\) to non-embedding parameters produces cleaner, universal trends.

Including embeddings obscures trends, excluding them yields a clean power-law curve across depths.

Figure 6: Left: With embeddings included, performance depends on depth as well as size. Right: Without embeddings, trends collapse to a single curve regardless of depth.

Practical implication: reducing embedding matrix size can improve efficiency without hurting core performance.


The Data Bottleneck: How Much Is Enough?

The scaling laws for model size and dataset size raise a critical question: as models grow, how much more data do we need? If dataset size stays fixed while model size increases, the model eventually overfits, and performance plateaus.

The paper presents a unified formula combining \(N\) and \(D\):

\[ L(N,D) = \left[ \left( \frac{N_c}{N} \right)^{\frac{\alpha_N}{\alpha_D}} + \frac{D_c}{D} \right]^{\alpha_D} \]

Performance follows a power law with model size for large datasets but flattens for small datasets. Overfitting grows predictably with N^α / D.

Figure 9: Left: For large D, performance scales smoothly with N. For smaller D, gains flatten out as overfitting sets in. Right: Overfitting correlates strongly with \(N^{\alpha_N / \alpha_D} / D\).

From this, the authors derive an elegant rule of thumb for avoiding data bottlenecks:

\[ D \propto N^{0.74} \]

Meaning: for every 10× increase in model size, data need only grow ~5.5× to avoid overfitting. Data requirements grow much more slowly than model size—welcome news for practitioners.


The Efficiency Puzzle: Bigger Models Learn Faster

Another key finding concerns sample efficiency—how quickly a model learns from data. Larger models are substantially more sample-efficient.

Larger models achieve lower loss with fewer tokens processed and less compute than smaller ones.

Figure 2: Bigger models reach low loss faster—both in tokens processed and total compute—than smaller models, making them far more sample-efficient.

A billion-parameter model might reach a loss of 4.0 after just a few billion tokens; a tiny model might never reach that level no matter how long it trains.


The Optimal Strategy: Train Giant Models, Stop Early

Given a fixed compute budget \(C\), how should we spend it?

Options:

  • Train a small model to convergence.
  • Train a medium model for a moderate time.
  • Train a huge model for a short period.

The scaling laws reveal a counter-intuitive truth: it’s optimal to train the largest model you can afford and stop far short of full convergence.

Optimal allocation of a billion-fold compute increase: mostly to model size, less to batch size/steps.

Figure 3: Most extra compute should go to larger models. A modest data/batch size increase suffices, and training steps barely increase.

Under optimal allocation:

\[ N \propto C^{0.73}, \quad S \propto C^{0.03} \]

Optimal model size grows rapidly with compute; optimal steps barely increase.

Figure 14: Left: Optimal model size grows ~5× for each 10× compute increase. Right: Optimal number of steps grows negligibly.

This means as compute budgets grow, almost all extra capacity should be spent on bigger models—not longer training.


Peering into the Future: A Contradiction and a Conjecture

Power laws are consistent across scales—but they can’t continue forever. Language has inherent entropy, so loss can never be zero; curves must flatten eventually.

The authors extrapolate and find a contradiction: compute-efficient training predicts performance improvements that eventually exceed what should be possible given the slow growth in data usage (\(D \propto C^{0.27}\)).

Projected intersection where compute scaling surpasses data-limited performance.

Figure 15: Intersection where compute-based scaling surpasses what’s possible given dataset growth. Beyond this, the scaling laws must break down.

This crossover—around \(10^{12}\) parameters—may mark the scale at which Transformers extract most of the predictive information from text. The loss at this point (~1.7 nats/token) could approximate the irreducible entropy of natural language.


Conclusion: A Blueprint for AI’s Future

The Scaling Laws for Neural Language Models paper transformed the field from craft into science. It replaced guesswork with predictable, empirical laws governing language model performance at scale.

Key takeaways:

  1. Performance is Predictable: Model loss scales as a power law with size, dataset, and compute—enabling forecasts for future models.
  2. Scale Trumps Shape: Non-embedding parameter count drives performance. Depth/width choices matter little within ranges.
  3. Bigger Is Better (and Faster): Large models learn more from the same data.
  4. Train Large, Stop Early: Maximize size for your compute budget, even if you can’t fully converge.

These principles underpinned the design of models like GPT-3 and beyond. They give us confidence that pushing the boundaries of scale will yield more capable systems.

The smooth, quantitative gains these laws describe may conceal spectacular qualitative leaps in capability—a phenomenon aptly captured by physicist P.W. Anderson: “More is different.” This paper gives us the map to explore just how different things can get.