The Hidden Rules of Neural Network Pruning: A Deep Dive into Scaling Laws

Shrinking Giants: Can We Predict How Neural Network Pruning Will Behave?

Modern neural networks are giants. Models like GPT‑3 and Stable Diffusion have revolutionized what’s possible with AI—but their immense size comes at a cost. They require massive computational power to train and deploy, making them inaccessible for many applications and environmentally expensive.

One of the most popular solutions is neural network pruning: systematically removing parts of a trained network to make it smaller and faster without sacrificing too much accuracy. Over the past decade, researchers have produced more than eighty pruning techniques. Yet despite this progress, we still lack a fundamental understanding of how pruning behaves across architectures and scales.

Imagine you’re a machine learning engineer who must deploy a model with less than 10% error on a smartphone. You have a family of architectures to choose from—like ResNets of varying depths and widths. Which should you pick? And how much should you prune it?

The brute‑force option is to train and prune every possible combination, but that would take months and immense computing resources. To be smarter, we must assume that pruning’s effects follow some predictable structure. If such a structure exists, we could use it to pinpoint the optimal pruned model analytically rather than by exhaustive search.

That leads directly to the question studied in the paper “On the Predictability of Pruning Across Scales” from MIT CSAIL. The authors demonstrate that for a standard pruning technique, the relationship between a network’s size, pruning level, and final error follows a surprisingly simple and predictable mathematical law. This scaling law provides deep insight into the behavior of pruning and gives a practical framework for reasoning about model efficiency.

In this article, we’ll explore how the authors uncovered these hidden rules—and what they mean for building efficient deep learning systems.

Background: What Is Iterative Magnitude Pruning?

Before turning to the scaling laws, let’s review the specific pruning technique used: Iterative Magnitude Pruning (IMP). It’s a powerful, widely adopted method with a straightforward idea—remove the least important weights determined by their small magnitudes.

The “iterative” part means pruning and retraining occur repeatedly rather than just once. The procedure looks like this:

Train a network to completion.
Prune: Remove a fixed percentage (e.g., 20%) of weights with the smallest magnitudes across all layers.
Rewind: Reset the remaining weights to their values at an earlier training epoch (such as epoch 10). This “weight rewinding” step is more effective than simply fine‑tuning after pruning.
Retrain: Train again from the rewound state with pruned weights permanently masked out.
Repeat: Continue the prune–rewind–retrain cycle until achieving the desired sparsity.

Each iteration produces a model of a different density—the fraction of weights that remain. A density of 1.0 equals the unpruned network; 0.01 means that 99 % of the weights are gone.

The MIT team applied IMP across entire families of architectures (primarily ResNets) on datasets such as CIFAR‑10 and ImageNet, systematically varying four key dimensions:

Depth (l): number of layers.
Width (w): scaling factor for channels per layer.
Dataset size (n): number of training examples.
Density (d): fraction of weights remaining.

Their goal: find a unified formula that predicts the test error, \(\varepsilon(d, l, w, n)\), for any combination of those four variables.

Modeling the Error of a Single Pruned Network

Before approaching entire network families, the authors first asked: How does one network’s error change as we prune it?

They plotted test error versus density on logarithmic axes. The result—shown in the figure below—revealed a remarkably consistent three‑phase pattern.

A three‑panel figure showing the relationship between network density and error. The left panel shows multiple curves for different network widths. The center panel annotates the three distinct regions of a single curve. The right panel visualizes the parameters of the mathematical formula used to model this curve.

Figure 1. Relationship between density and error for CIFAR‑10 ResNets. The curves reveal three distinct regimes: the low‑error plateau, the power‑law region, and the high‑error plateau.

Let’s unpack the three regions visible in the center panel:

Low‑Error Plateau. When the network remains dense, its error is nearly identical to the unpruned model’s error \(\varepsilon_{np}\). Small amounts of pruning barely affect performance.
Power‑Law Region. As pruning intensifies, error begins to rise—and on the log‑log plot, this rise is linear. A straight line on such axes indicates a power‑law relationship:
\[ \varepsilon(d)\approx c\,d^{-\gamma}, \]
where \(\gamma\) is the slope. Rather than chaos, pruning follows a consistent mathematical pattern.
High‑Error Plateau. Extreme pruning eventually cripples the network, flattening its error near a maximum value \(\varepsilon^{\uparrow}\). Beyond this, the network can no longer learn.

This consistent shape inspired a functional approximation from the rational family, which neatly captures transitions between different power‑law behaviors:

\[ \hat{\varepsilon}(\varepsilon_{np}, d | l, w, n) = \varepsilon_{np}\left\| \frac{d - j p \left(\frac{\varepsilon^{\uparrow}}{\varepsilon_{np}}\right)^{1/\gamma}} {d - j p} \right\|^{\gamma},\quad j=\sqrt{-1}. \]

Here,

\(\varepsilon_{np}\) and \(\varepsilon^{\uparrow}\) define the plateaus,
\(\gamma\) governs the slope in the power‑law region,
\(p\) controls where the transition occurs.

After fitting this function to experimental data for each network configuration, the predictions matched almost perfectly.

A three‑panel figure showing how well the proposed formula fits the actual experimental data. The left panel overlays the function’s predictions (blue dots) on real data (solid lines). The center and right panels show scatter plots of predicted vs. actual error, demonstrating a very tight correlation.

Figure 2. Model fit for CIFAR‑10 ResNets. Across thousands of pruned networks, the mean relative deviation between predicted and measured error is below 2 %.

With this, the authors achieved a precise mathematical description of how any single IMP‑pruned network behaves.

The Joint Scaling Law and the Error‑Preserving Invariant

Next came a more ambitious goal: a single joint scaling law spanning every depth, width, dataset size, and density. To unify these variables, the authors searched for an underlying invariant.

By plotting contour maps of constant error as functions of density and architecture dimensions, they discovered straight‑line contours on log‑log axes—evidence of a new power‑law trade‑off.

Two contour plots showing how network error changes with density and either depth (left) or width (right). The straight‑line contours on the log‑log axes reveal a power‑law relationship.

Figure 3. Contours of constant error for CIFAR‑10 ResNets. Straight contours indicate interchangeable relationships among depth, width, and pruning density.

These patterns led to an error‑preserving invariant:

\[ m^{*} = l^{\phi}\,w^{\psi}\,d. \]

Here \(\phi\) and \(\psi\) capture trade‑off rates between depth, width, and density. Two networks with the same \(m^{*}\) —regardless of individual architecture or sparsity—should exhibit identical error. A deep, narrow, sparse network can mirror the performance of a shallow, wide, dense one if \(m^{*}\) is equal.

Substituting this invariant into the earlier single‑network formula yielded the joint scaling law:

The joint scaling law equation. It has the same rational form as the single‑network equation but uses the invariant m* to capture the effects of depth, width, and density.

Equation 2. The unified scaling law expresses error across depth, width, dataset size, and density using the invariant \(m^{*}\).

In this expression, the parameters \(\varepsilon^{\uparrow}\), \(p'\), \(\gamma\), \(\phi\), and \(\psi\) are constants for the entire family (for example, all CIFAR‑10 ResNets). Only \(\varepsilon_{np}\) changes per individual network.

Empirical evidence strongly supports this assumption. When error is plotted against \(m^{*}\), curves for different architectures and dataset sizes align almost perfectly.

Three plots showing network error versus the invariant m*. The curves for different widths, depths, and dataset sizes all follow a similar pattern, supporting the idea that the scaling law’s parameters are constant.

Figure 4. Error versus invariant \(m^{*}\) for varying widths, depths, and dataset sizes—all follow a similar shape, validating the constancy of core parameters.

How Accurate Is the Joint Scaling Law?

The researchers fit their unified formula jointly across thousands of data points for CIFAR‑10 and ImageNet ResNets. The resulting predictions are strikingly accurate.

Scatter plots comparing the joint scaling law’s predicted error against the actual measured error for thousands of configurations on ImageNet and CIFAR‑10. A third plot shows that the prediction error is comparable to the inherent noise in the experiments.

Figure 5. Predicted vs. actual test error for all CIFAR‑10 and ImageNet configurations. The mean deviation is under 2 %, comparable to natural variation across random seeds.

Across 4,301 CIFAR‑10 configurations and 274 ImageNet configurations, the mean relative error was less than 2 %, with typical variance similar to training noise itself. Five parameters suffice to describe pruning’s behavior over orders of magnitude in size and sparsity.

How Much Data Do We Need to Fit It?

Although the full evaluation used thousands of networks, in practical scenarios we rarely have that luxury. Fortunately, the scaling law is remarkably data‑efficient.

The authors randomly sampled increasingly small subsets of their data and refit the model each time.

Two plots showing that the scaling law can be accurately fit with a small amount of data. The error of the fit stabilizes quickly as the number of data points increases.

Figure 6. Accuracy of the fit versus number of training points. Only a handful of pruned configurations are needed for stable parameter estimation.

With as few as 15 network configurations pruned across all densities, the fitted parameters stabilized and produced accurate predictions. In other words, engineers can run just a few inexpensive experiments on small models to extrapolate pruning behavior for much larger systems.

Using the Scaling Law: Finding the Most Efficient Model

Now we’re ready to solve the original design question analytically.

Given a family of networks, which should we prune—and by how much—to get the smallest possible model for a target accuracy?

Using the scaling law, this becomes a straightforward optimization:

\[ \min_{l,w,d}\;l\,w^{2}\,d \quad\text{s.t.}\quad \varepsilon_{np}\left\| \frac{l^{\phi}w^{\psi}d - j\,p' \left(\frac{\varepsilon^{\uparrow}}{\varepsilon_{np}}\right)^{1/\gamma}} {l^{\phi}w^{\psi}d - j\,p'} \right\|^{\gamma} = \varepsilon_{k}. \]

The optimization problem for finding the smallest network for a target error, formulated using the scaling law.

Figure 7. Optimization formulation for finding minimal‑parameter networks under an error constraint.

Solving this reveals a counterintuitive insight: starting with a large, accurate network and pruning until its error increases to the target value often yields a smaller model than starting with a smaller, low‑error network.

Two plots illustrating the optimal pruning strategy. The black dotted line shows the “efficient frontier” of the smallest possible parameter count for any given error. To reach this frontier, one must prune a larger network into its power‑law region, rather than staying on the low‑error plateau.

Figure 8. Optimal pruning strategy for CIFAR‑10 ResNets. The dotted line marks the efficient frontier—the minimal parameter count for each achievable error.

On the plots above, the black dotted line is the efficient frontier. Notice how the flat low‑error plateaus of individual networks never intersect it. Only by pruning larger models into their power‑law region—where error begins to rise—can we reach the frontier. Starting too small prevents reaching optimal efficiency; starting excessively large overshoots the optimum.

This conclusion provides a principled foundation for “grow‑and‑prune” strategies that engineers previously approached empirically.

Key Takeaways

The study “On the Predictability of Pruning Across Scales” turns pruning from heuristic practice into predictive science. Its discoveries reveal deep structure in how neural networks shed parameters.

Pruning Has Order. IMP‑pruned networks exhibit a consistent three‑phase error curve—low‑error plateau, power‑law region, high‑error plateau.
An Invariant Governs Trade‑offs. Depth, width, and sparsity can be exchanged while preserving error through the invariant \(m^{*}=l^{\phi}w^{\psi}d\).
A Simple Law Predicts Performance. A five‑parameter scaling law models error accurately across families of architectures and datasets.
Practical Advantages. Once fitted using a handful of small experiments, the law enables fast analytic reasoning about pruning and model efficiency.

As models continue to expand, understanding how to shrink them intelligently will be critical for sustainable AI. Scaling laws like this one illuminate the hidden rules guiding that process—offering a roadmap to build leaner, faster, and equally capable neural networks.

Shrinking Giants: Can We Predict How Neural Network Pruning Will Behave?#

Background: What Is Iterative Magnitude Pruning?#

Modeling the Error of a Single Pruned Network#

The Joint Scaling Law and the Error‑Preserving Invariant#

How Accurate Is the Joint Scaling Law?#

How Much Data Do We Need to Fit It?#

Using the Scaling Law: Finding the Most Efficient Model#

Key Takeaways#