Deploying machine learning models in the real world is a messy business. The perfect model for a high-end cloud GPU might be a terrible fit for a smartphone, which in turn is overkill for a tiny microcontroller. Each device has its own unique constraints — on latency, memory, and power — and this diversity has sparked rapid growth in Neural Architecture Search (NAS), a field dedicated to automatically designing neural networks tailored for specific hardware.

While NAS has produced impressive results, it often comes with a hidden cost: complexity and computational expense. A modern, popular approach called one-shot NAS tries to streamline the process by training a massive “super-network” that contains countless smaller “child” architectures within it. Searching for the best architecture then becomes as simple as finding the optimal path through this super-network.

However, there’s a catch: child models extracted directly from this super-network tend to perform much worse than if they were trained from scratch. Their shared weights are a compromise across vastly different architectures and are rarely optimal for any one of them. This has led to the entrenched “rule” in NAS: you must retrain your discovered architecture from scratch or apply complex post-processing to get good performance. This final step can be a huge bottleneck, especially if you need models for a wide variety of devices.

But what if that rule was wrong? What if you could train a single, massive model and instantly slice off high-performance, ready-to-deploy child models of any size, with no retraining needed?

That’s the paradigm-shifting idea behind BigNAS, from researchers at Google Brain. They challenge conventional wisdom and show how to train a single model that serves as a universal source of state-of-the-art models across diverse computational budgets.

A comparison of different model deployment workflows. On the left, one-shot and progressive shrinking methods require retraining or distillation steps. On the right, the BigNAS single-stage model allows for instant deployment of child models without retraining.

Figure 1: The BigNAS workflow (right) simplifies deployment compared to previous methods. Instead of complex, multi-step retraining or distillation pipelines, BigNAS trains a single model from which child models of various sizes can be instantly deployed.


The Problem with One-Shot NAS

In the standard one-shot NAS workflow, you define a search space—a set of possible configurations such as convolution kernel sizes, network depths, and channel widths. You construct a super-network encompassing all possibilities. During training, you sample different child architectures and update the shared weights.

This method is elegant: you only train one set of weights. But it’s also extremely difficult. A single set of parameters must work well for tiny, shallow networks and massive, deep ones simultaneously — architectures with completely different learning dynamics. The result is a compromise: shared weights that aren’t optimal for any specific child model. Accuracy of sliced models is often just an approximation, necessitating costly retraining from scratch.

BigNAS aims to eliminate this gap. The goal is to train one single-stage model so effectively that any child model sliced from it is already at peak performance.


The Core Method: Training a High-Quality Single-Stage Model

The core challenge is managing conflicting needs: the smallest models need higher learning rates and less regularization to learn effectively, while the largest models need lower learning rates and more regularization to avoid overfitting.

BigNAS introduces five key techniques to reconcile these differences.

1. The Sandwich Rule — Bounding Performance

Training focuses equally on the extremes and the middle of the search space.
In each step, instead of sampling only random architectures, BigNAS always samples:

  1. Largest model — maximum depth, width, kernel sizes, resolution
  2. Smallest model — minimal values for all dimensions
  3. Random models — a few varied architectures in between

Gradients from all these models are aggregated to update the weights. By explicitly training the largest and smallest models, the process ensures both ends of the performance spectrum improve.


2. Inplace Distillation — The Big Guiding the Small

Knowledge distillation uses a large “teacher” network’s output probabilities (soft labels) to train a smaller “student” network.

In BigNAS, the largest model acts as the teacher within every training step thanks to the Sandwich Rule. It learns from ground-truth labels, and its predictions supervise all other models in that batch.

The authors ensure teacher and students see the same image patch, resized to respective resolutions. This makes supervision signals more consistent and yields about +0.3% accuracy improvement for child models.


3. Smart Initialization — Taming the Explosion

Training large single-stage models initially caused unstable loss explosions. Lowering the learning rate stabilized training but degraded accuracy (~–1.0% top-1).

The fix was elegant: initialize the output of each residual block’s last BatchNorm layer with a learnable scaling parameter \(\gamma = 0\). This means the residual path starts as effectively zero, leaving the skip connection as the main signal path.

The network starts “simpler” and can progressively exploit residual paths during training. This stabilized learning, enabled higher learning rates, and improved ImageNet accuracy by ~+1.0%.


4. Modified Learning Rate — Helping Everyone Converge

Smaller models converge slowly while large models peak and overfit early.

The convergence dilemma in single-stage training. Big models peak early while small models are still learning. A modified learning rate schedule helps address this.

Figure 2: (a) Big models (orange) peak before small models (blue) finish learning. (b) The proposed schedule decays exponentially and then flattens to a small constant rate.

The improved schedule, “exponentially decaying with constant ending”, decays the learning rate normally but stops at 5% of its initial value, holding it constant.

Benefits:

  • Small models get extra capacity to finish converging
  • Larger models get mild weight oscillations late in training, which regularizes and reduces overfitting

5. Targeted Regularization — Less is More

Big models overfit; small models underfit. Applying identical regularization (dropout, weight decay) to both is suboptimal.

BigNAS regularizes only the largest model. Smaller models remain unregularized, maximizing their ability to fit the data.
This improved small model accuracy by +0.5% and even gave large models a modest +0.2% boost.


Batch Norm Calibration

After training, one final step remains: recompute batch normalization statistics for any chosen child model by passing a few hundred batches of data through it. This is fast, requires no weight updates, and ensures stable inference performance.


Finding the Best Architectures

Searching the enormous (> \(10^{12}\) possibilities) space for the optimal child model requires efficiency. BigNAS uses coarse-to-fine search:

  1. Coarse Search — Sweep over a small grid of network-wide settings (resolution, depth, width, kernel sizes) to find promising “skeleton” architectures.
  2. Fine Search — For each promising skeleton, locally mutate details such as stage-wise width or individual kernel sizes, honing in on the best candidate.

The coarse-to-fine search process. A broad search first identifies promising regions (left), followed by a focused local search (right).

Figure 8: Left — Coarse selection locates a good candidate (red dot). Right — Fine search refines architecture near that candidate to find optimal configuration under budget.

Because BigNAS child models are deployment-ready, the accuracy measured here is their final accuracy.


Astonishing Results on ImageNet

The BigNAS single-stage model was trained on ImageNet over a space of architectures from ~200 MFLOPs (like MobileNetV3-Small) to ~1 GFLOPs (like EfficientNet-B2). From it, the team sliced four representative models: S, M, L, XL.

Performance of BigNASModels on ImageNet compared to other state-of-the-art models. BigNAS sets a new Pareto frontier for accuracy vs. MFLOPs.

Figure 3: BigNASModels (red line) achieve consistently higher top-1 accuracy across computational budgets compared to previous models.

Highlights:

  • BigNASModel-M — 78.9% top-1 accuracy at ~400 MFLOPs (+1.6% vs EfficientNet-B0)
  • BigNASModel-S — 76.5% at ~240 MFLOPs (+1.3% vs MobileNetV3)
  • BigNASModel-XL — 80.9% at ~1 GFLOPs, more accurate and ~4× cheaper than ResNet-50

All achieved without retraining or fine-tuning.


Ablation Studies: Why It Works

Each of the five techniques was validated through ablation tests.

Initialization: Without the modified initialization, training was unstable at high learning rates; with it, training was faster and final accuracy improved. The effect of the proposed initialization method. Without it (blue/orange), training is unstable or requires a low learning rate. With it (red/green), training is stable and performance is better.

Figure 5: Modified initialization with 100% LR (red) yields the best results for both small (left) and big (right) child models.

Regularization: Applying dropout/weight decay only to the largest model boosts small and big model performance alike. The effect of the modified regularization rule. Applying regularization only to the largest model (orange) improves performance over the naive approach (blue).

Figure 7: Targeted regularization improves small (left) and big (right) child model accuracy.


Does Fine-Tuning Help?

What happens if we fine-tune sliced BigNAS models?
Results show negligible gains and sometimes drops in accuracy.

A table showing the results of fine-tuning BigNAS child models. In most cases, fine-tuning provides no significant improvement or even degrades accuracy.

Table 2: Fine-tuning generally fails to improve BigNAS child models, confirming they are already near optimal.

This confirms BigNAS slices are already at their performance peak.


Conclusion — Train Once, Deploy Anywhere

BigNAS offers a streamlined, scalable NAS approach. By directly tackling the unique training dynamics of a massive weight-sharing super-network, it removes the need for expensive post-search retraining.

The result is a single, expertly trained model that functions as an entire “model zoo in one set of weights”:

  • Need a compact model for a microcontroller? Slice it.
  • Need a larger, faster model for a flagship phone? Slice it — from the same parent.

This innovation not only delivers a family of state-of-the-art models but also points toward a more efficient future for NAS: one where we can truly train once and deploy anywhere.