Modern neural networks are behemoths. Models like GPT-3 contain hundreds of billions of parameters, demanding enormous datasets and staggering computational resources. It’s widely acknowledged in the deep learning community that these networks are overparameterized—they have far more connections than they truly need to solve their tasks.
For years, researchers have used a technique called pruning to trim huge networks after training. By removing up to 90% of the weights—generally those with the smallest magnitudes—they create compact networks that run faster and store efficiently, with little to no loss in accuracy. This optimization is invaluable for deploying models on mobile devices or embedded hardware.
But that success raises a perplexing question: if models can perform well with only 10–20% of their original parameters, why don’t we just train these smaller networks from the start? In principle, this could save immense time and energy. The frustrating reality is that training sparse networks from scratch doesn’t work well—they tend to learn more slowly and end up with worse accuracy than their dense counterparts.
This puzzle inspired a seminal 2019 paper from MIT’s Jonathan Frankle and Michael Carbin: “The Lottery Ticket Hypothesis.” Their insight was revolutionary—hidden inside large, randomly initialized networks are tiny subnetworks that are already perfectly suited to learn. These special subnetworks are the winning tickets. This idea has since transformed how researchers think about initialization, optimization, and overparameterization in deep learning.
In this article, we’ll unpack the Lottery Ticket Hypothesis, explore how the authors validated it through meticulous experiments, and discuss what these findings mean for building and understanding neural networks.
The Problem: Why Sparse Networks Struggle
Before jumping to the solution, let’s visualize the problem. Modern training strategies use optimization algorithms such as Stochastic Gradient Descent (SGD) or Adam to update millions of weights. Pruning is typically applied after training—removing small weights that contribute least to performance—to produce a sparse network structure.
The challenge arises when trying to train that sparse architecture anew. Take a pruned network design, apply a random initialization, and train from scratch—it usually fails to match the performance of the original model.
Frankle and Carbin illustrated this clearly. They trained several networks, randomly removed weights to mimic pruning, and measured their speed and accuracy.

Figure 1: Early-stopping iteration (left) and test accuracy (right) for several network architectures on MNIST and CIFAR-10. Dashed lines represent randomly sampled sparse networks. Solid lines are subnetworks (“winning tickets”) identified through pruning.
As shown above, randomly sampled sparse networks (dashed lines) degrade sharply in accuracy and require far more iterations to reach their best performance. Dense networks learn quickly and achieve higher final accuracy. Clearly, sparsity alone doesn’t guarantee successful training—the missing ingredient lies elsewhere.
The Core Idea: The Lottery Ticket Hypothesis
Frankle and Carbin proposed that the key is not just the structure of which weights remain, but also their original initialization values. Some weights, by chance, start out perfectly aligned for effective learning. Pruning reveals these fortuitous configurations.
They articulated the Lottery Ticket Hypothesis:
A randomly initialized neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations.
In simpler terms, initializing a big model is like buying millions of lottery tickets. Each possible subnetwork is one ticket. Most are losers—but a few have the lucky combination of connections and initial weights that learn exceptionally well. Training the full dense network is like running the lottery—it discovers and amplifies those winning tickets. Pruning afterward unveils which subnetworks won.
Finding the Winning Ticket
How do you identify a winning ticket inside a neural network?
Frankle and Carbin devised a four-step algorithm:
- Randomly initialize a dense network \(f(x; \theta_0)\) with parameters \(\theta_0\).
- Train it for \(j\) iterations, producing trained parameters \(\theta_j\).
- Prune \(p\%\) of the lowest-magnitude weights from \(\theta_j\), creating a binary mask \(m\) (1 = keep, 0 = prune).
- Reset the remaining parameters back to their original values in \(\theta_0\), forming the subnetwork \(f(x; m \odot \theta_0)\)—the candidate winning ticket.
This final step, “rewinding” to the original initialization, is crucial. It merges the learned sparse structure with its lucky initial values.
The authors tested two pruning strategies:
- One-shot pruning: prune once and rewind.
- Iterative pruning: repeatedly train, prune small fractions, and rewind several times.
Iterative pruning proved far more effective, yielding smaller tickets that still achieved strong accuracy.
Experiment 1: The First Win — LeNet on MNIST
The researchers began with LeNet, a classic fully connected network trained on the MNIST handwritten digit dataset. Using iterative pruning, they extracted subnetworks with decreasing fractions of remaining weights.

Figure 3: Test accuracy of LeNet subnetworks during training. Smaller subnetworks (e.g., 51.3%, 21.1%) learn faster and achieve equal or higher test accuracy.
The patterns are striking.
- The full network (100%) is the baseline.
- A winning ticket with 51.3% remaining weights learns faster and achieves slightly higher test accuracy.
- A smaller ticket (21.1%) outperforms both in speed and accuracy.
Even down to extreme sparsities (~3.6%), subnetworks nearly match the original performance.
The Critical Control: Re-initializing the Ticket
To isolate the role of initialization, the authors repeated the experiment but replaced the ticket’s original weights with a fresh random initialization.

Figure 4: Comparison between original winning tickets (blue) and randomly reinitialized subnetworks (orange). Both learn at similar sparsity levels, but reinitialized tickets lose accuracy dramatically.
The results were decisive: when the same sparse structure was reinitialized randomly, performance collapsed. The original initialization—the ticket’s “lucky numbers”—was essential. The architecture alone could not explain success.
Winning tickets also generalized better. They achieved higher test accuracy at the same 100% training accuracy, indicating that pruning removed redundant parameters without harming learning capacity.
Experiment 2: Winning Tickets in Convolutional Networks
To challenge the hypothesis further, the authors tested deeper convolutional networks (Conv-2, Conv-4, and Conv-6—simplified VGG variants) on the CIFAR-10 image dataset.

Figure 5: Early-stopping iteration and test accuracy for Conv-2/4/6 networks when iteratively pruned. Solid lines = winning tickets; dashed lines = reinitialized networks.
The outcomes mirrored those from LeNet, but amplified:
- Winning tickets learned up to 3.5× faster than the original dense models.
- Test accuracies improved by 3–4 percentage points, even after pruning up to 98% of the weights.
Randomly reinitialized subnetworks once again underperformed—confirming the importance of the inherent initialization.
The Power of Dropout
Could regularization techniques like dropout already be exploiting winning ticket-like behavior? Dropout randomly disables units during training, effectively sampling subnetworks every iteration.
Frankle and Carbin tested this by combining dropout with their pruning procedure.

Figure 6: Test accuracy and learning speed for Conv-2/4/6 networks with dropout (solid lines) vs. without dropout (dashed lines). Dropout increases accuracy; iterative pruning enhances it further.
Dropout improved baseline accuracy, and iterative pruning raised it even higher. This suggests that dropout helps highlight subnetworks suitable for pruning—an intriguing synergy between regularization and the lottery ticket phenomenon.
Experiment 3: Deep Architectures — VGG-19 and ResNet-18
Finally, the authors examined whether winning tickets existed in large, modern architectures—VGG-19 and ResNet-18—trained on CIFAR-10 with techniques like batch normalization and learning rate scheduling.
At standard high learning rates, iterative pruning failed to find winning tickets. The pruned models performed no better than if randomly reinitialized.
However, when the learning rate was reduced—or combined with learning rate warmup, where it ramps up gradually—the pattern returned.

Figure 7: Test accuracy for VGG-19 at different learning rates and with warmup. Warmup enables the discovery of winning tickets at high learning rates and extreme pruning levels.
With warmup, the researchers recovered highly accurate subnetworks—winning tickets as small as 1.5% of the original VGG-19 that still matched the full network’s performance. ResNet-18 exhibited similar trends.
These results indicate that training dynamics in very deep models are sensitive; warmup stabilizes early optimization, preserving the conditions under which the lottery ticket phenomenon can emerge.
What the Lottery Ticket Hypothesis Reveals
The Lottery Ticket Hypothesis reframes how we understand overparameterized models. It suggests that massive networks are “overcomplete” not merely for capacity, but because sheer size increases the chance of containing a fortuitously initialized, trainable subnetwork.
In effect, SGD discovers and develops one of these tickets.
This leads to several profound implications:
1. Improving Training Efficiency
If we can identify winning tickets early—perhaps even at initialization—we could train only those subnetworks, drastically reducing computation. This goal drives ongoing research into early-ticket-detection methods and efficient pruning algorithms.
2. Designing Better Networks and Initializations
Winning tickets show that certain structures and initial weight patterns are particularly conducive to learning. Studying them could reveal design principles for leaner architectures or smarter initialization schemes.
3. Understanding Why Overparameterization Helps
The hypothesis provides a fresh lens on why large models train easily: their overparameterization gives SGD more chances to land on a well-initialized subnetwork. Success isn’t purely about parameter count but about the combinatorial diversity of candidate subnetworks.
Conclusion
The Lottery Ticket Hypothesis elegantly explains a long-standing mystery in deep learning. Hidden inside massive networks are sparse, well-initialized subnetworks that are capable of learning just as effectively as the full model. These winning tickets prove that the seeds of success are sown at initialization—they simply need to be revealed.
Frankle and Carbin’s work bridges theory and practice, inspiring a new generation of research into pruning, optimization, and model efficiency. The challenge ahead is learning how to find these tickets faster—to train networks that are smaller, smarter, and more sustainable from the start.
In the world of neural networks, sometimes success really does come down to a lucky ticket.
](https://deep-paper.org/en/paper/1803.03635/images/cover.png)