The Mathematical Paradox of LLM Scaling: How Exponential Success Creates Power Laws

In the fast-paced world of Artificial Intelligence, “scaling” is the magic word. We usually talk about scaling in terms of training—adding more parameters to the model or throwing more data at it. But recently, a new frontier has opened up: inference-time compute scaling.

The idea is simple but profound: what if, instead of making the model bigger, we just let it “think” longer? Or, more specifically, what if we let it try a problem multiple times?

Recent research has uncovered a fascinating and somewhat mysterious phenomenon. When you let a Large Language Model (LLM) attempt a task multiple times (a technique often called “Best-of-N”), the average success rate improves following a very specific mathematical pattern: a power law.

But here lies a paradox. Basic probability theory suggests that for any single problem, your chance of success should improve exponentially with more attempts, not as a power law. Exponential growth is much faster than power-law growth. So, why does the aggregate performance of these “Large Language Monkeys” slow down to a power law?

In this post, we will deep dive into the research paper “How Do Large Language Monkeys Get Their Power (Laws)?”. We will unravel this mathematical puzzle, explore the statistical distributions hiding inside your favorite LLMs, and learn how to predict model performance using a fraction of the compute usually required.

1. The Observation: Power Laws Everywhere

First, let’s establish the context. We are looking at a specific setup known as repeated sampling.

Imagine you give an LLM a math problem. You ask it to generate a solution. It might get it wrong. You ask it again. It might get it wrong again. You ask it 100 times. If any one of those 100 attempts is correct, we consider the problem “solved.”

Researchers Brown et al. (2024) and Hughes et al. (2024) ran this experiment across massive datasets—mathematical problem solving (MATH benchmark) and “jailbreaking” attempts (trying to bypass safety filters). They defined the success rate as pass@k: the probability that at least one attempt out of \(k\) is successful.

Definition of pass@k.

When they plotted the negative log of the average success rate against the number of attempts (\(k\)), they found a striking linear relationship on a log-log plot. In physics and statistics, a straight line on a log-log plot is the signature of a power law.

Power law equation.

This relationship held true across different model sizes (from small Pythia models to massive GPT-4 class models) and different domains (math and jailbreaking security).

Figure 1: Power Law Scaling in Language Models from Repeat Sampling. Top: Brown et al. (2024) found the negative log average pass rate scales polynomially with attempts. Bottom: Hughes et al. (2024) found similar results for jailbreaking.

As shown in Figure 1 above, whether it’s solving calculus or finding a security loophole, the error rate drops predictably as \(k\) increases. This regularity is beautiful, but for a mathematician, it’s also deeply disturbing. Let’s find out why.

2. The Paradox: Why Not Exponential?

To understand why a power law is surprising, we have to look at what is happening at the level of a single problem.

Let’s say we have a specific math problem, Problem \(i\). The model has a fixed probability of getting this specific problem right on a single try. Let’s call this probability \(\text{pass}_i@1\).

If the model is just rolling the dice every time it generates an answer (assuming independent attempts), the math for calculating the success after \(k\) attempts is straightforward. The only way to fail after \(k\) attempts is to fail every single time.

Equation showing pass@k as 1 minus the probability of all attempts failing.

If the probability of success on one try is \(p\), the probability of failure is \((1-p)\). The probability of failing \(k\) times in a row is \((1-p)^k\).

Simplified exponential equation.

This formula, \((1-p)^k\), describes exponential decay of the failure rate. Exponential decay is incredibly fast. If you have a 50% chance of success (\(p=0.5\)), after just 10 attempts, your chance of failure is \(0.5^{10} \approx 0.0009\) (less than 0.1%).

The researchers verified this empirically. They looked at individual problems and plotted the success rate as \(k\) increased.

Figure 3: Per-problem performance scales exponentially with the number of attempts per problem k.

In Figure 3, the colored lines represent individual problems. Notice how they curve downward rapidly? That is exponential scaling.

The Paradox:

Individually, every single problem scales exponentially (fast).
In Aggregate (averaged over the whole dataset), performance scales as a power law (slow).

How can adding up a bunch of fast exponential functions result in a slow power law function?

3. The Resolution: The Heavy Tail

The answer lies in the distribution of difficulty.

Not all problems are created equal. Some are easy (high single-attempt success probability, say \(p=0.9\)). Some are hard (\(p=0.1\)). And some are nearly impossible (\(p=0.00001\)).

To calculate the aggregate success rate for a benchmark (like the MATH dataset), we essentially average the pass@k of every problem in that dataset. Mathematically, this is an expectation over the distribution \(\mathcal{D}\) of single-attempt probabilities.

Aggregate pass rate definition as an expectation over the distribution.

This integral is the key. We are summing up exponential functions \((1-p)^k\), but we are weighting them by how frequent that probability \(p\) is in our dataset, denoted by \(p_{\mathcal{D}}(\text{pass}_i@1)\).

Visualizing the Mechanism

The authors provide a brilliant schematic to visualize this.

Figure 2: Schematic: The Origin of Power Laws.

Look at Figure 2:

Left: The aggregate power law we observe.
Center: The individual problems scaling exponentially.
Right: The distribution of single-attempt probabilities.

The “magic” happens in the right panel. Notice how the curve shoots up on the left side of the x-axis? This indicates that there are a vast number of problems with extremely low success probabilities (near 0).

These “hard” problems are the bottleneck. The easy problems are solved almost instantly (within a few attempts). The medium problems are solved quickly. But the hard problems—the ones where the model has a one-in-a-million chance—take a massive number of attempts to crack.

As \(k\) increases, we “exhaust” the easy and medium problems, leaving only the hardest ones. The dominance of these hard problems slows down the aggregate improvement from exponential to polynomial (power law).

The Math of Distributions

The researchers tested different theoretical distributions to see which ones would produce a power law.

Delta Distribution: If all problems were equally hard (e.g., every problem has a 10% chance), the aggregate would be exponential. No power law.
Uniform Distribution: If difficulty is spread evenly (same number of easy, medium, and hard problems), the scaling becomes \(1/k\). This is a power law with exponent 1.
Beta and Kumaraswamy Distributions: These are flexible probability distributions often used to model probabilities. If the distribution looks like \(p^{\alpha-1}\) near zero (a “heavy tail” of hard problems), it produces a power law with exponent \(\alpha\).

The table below summarizes these findings. The crucial takeaway is that the shape of the distribution determines the scaling law.

Table of distributions and their resulting scaling laws.

Theorem: The Power Law Equivalence

The authors formalized this intuition into a rigorous theorem. They proved that aggregate performance follows a power law \(k^{-b}\) if and only if the density of single-attempt success rates behaves like \(p^{b-1}\) near zero.

Theorem stating the relationship between the density function and the power law.

This means that if we know the distribution of problem difficulties (specifically the left tail of hard problems), we can predict exactly how the model will scale with more compute.

4. Empirical Evidence: Do Models Actually Behave This Way?

Theory is great, but does it match reality? The researchers took the data from Pythia (on MATH) and Frontier models (on Jailbreaking) and fitted distributions to their single-attempt success rates (\(pass_i@1\)).

Figure 4: Single-Attempt Success Rates Distributions.

In Figure 4, look at the histograms. The x-axis is the single-attempt success probability (log scale).

Top (Pythia on MATH): Notice the massive spike on the left side. Most problems are very hard for these small models. This heavy tail corresponds perfectly to the power law scaling observed.
Bottom (Jailbreaking): Most models show a similar curve.

The Case of Llama 3 8B IT There is a fascinating outlier in Figure 4 (bottom right, pink curve): Llama 3 8B IT. Notice that its curve dives down on the left side. It lacks the heavy tail of hard problems.

According to the theory, if there is no heavy tail of hard problems, there should be no power law. And indeed, looking back at Figure 1 (Bottom), the pink line for Llama 3 8B IT drops off like a cliff—it scales much faster than a power law!

This validates the theory: the power law isn’t a magical property of neural networks; it’s a statistical artifact of the mix of problem difficulties in the dataset.

5. A New Tool: The Distributional Estimator

Why does this matter for engineers and researchers? It comes down to efficiency.

To measure the scaling law using the traditional method (the “Least Squares” method), you need to run the model thousands of times (\(k=10,000\)) to see the trend. This is incredibly expensive in terms of compute and money.

However, the authors’ discovery provides a shortcut. Since the scaling law depends entirely on the distribution of single-attempt probabilities (\(pass_i@1\)), we can just estimate that distribution!

The Two Methods Compared

Method A: Standard Least Squares (The Hard Way)

Run the model 10,000 times per problem.
Calculate the pass rate at \(k=1, 10, 100, 1000...\)
Fit a line to the log-log plot.

Method B: Distributional Estimator (The Smart Way)

Run the model a small number of times (e.g., 100).
Estimate the single-attempt probability for each problem.
Fit a Beta or Kumaraswamy distribution to these probabilities.
Analytically calculate what the scaling law will be based on the distribution’s parameters.

Figure 5: Schematic comparing the Least Squares Estimator and the Distributional Estimator.

The researchers derived the formula for this estimator:

Distributional estimator equation.

Results: Accuracy with Less Compute

Does the smart way work? The researchers compared the exponents (\(b\)) estimated by both methods.

Figure 6: Comparing Estimators of Power Law Exponents.

As shown in Figure 6, the estimates line up almost perfectly along the diagonal. The Distributional Estimator recovers the same scaling law as the brute-force method.

But the real victory is in sample efficiency. In Figure 7 below, the authors perform a “backtest.” They simulate having fewer samples per problem.

Figure 7: Comparing Two Estimators via Backtesting.

The pink line (Least Squares) has high error when the number of samples is low. The green and blue lines (Distributional Estimators) achieve low error orders of magnitude faster. This implies that researchers can forecast how a model will scale with 2-4 orders of magnitude less inference compute.

6. Conclusion and The “Dark Matter” of Scaling

This research demystifies one of the most intriguing behaviors of Large Language Models. “Large Language Monkeys” don’t get their power laws from magic; they get them from the diversity of the problems they face.

Individual problems represent exponential struggles.
The dataset represents a mix of difficulties.
The Power Law is simply the mathematical result of averaging exponential successes over a heavy-tailed distribution of difficulty.

The implications extend beyond just saving money on benchmarks. The authors conclude with a provocative thought about pre-training scaling laws (the famous laws that say more parameters/data = lower loss).

They suggest that pre-training loss might behave similarly. Perhaps the overall “loss” is just the sum of many different capabilities being learned at different exponential rates.

Dark matter equation.

In this view, the “Power Law” is just the dominant term that remains after the easy things are learned and before the impossible things are tackled. The authors poetically refer to the non-dominant terms as the “dark matter of neural scaling laws”—functions that are hiding in the data, decaying at different rates, waiting to be discovered as we push compute further.

By understanding the statistics of success, we move from treating LLMs as black boxes to understanding them as predictable statistical engines—a crucial step for building reliable AI systems.

The Mathematical Paradox of LLM Scaling: How Exponential Success Creates Power Laws#

1. The Observation: Power Laws Everywhere#

2. The Paradox: Why Not Exponential?#

3. The Resolution: The Heavy Tail#

Visualizing the Mechanism#

The Math of Distributions#

Theorem: The Power Law Equivalence#

4. Empirical Evidence: Do Models Actually Behave This Way?#

5. A New Tool: The Distributional Estimator#

The Two Methods Compared#

Results: Accuracy with Less Compute#

6. Conclusion and The “Dark Matter” of Scaling#