Designing a high-performing neural network has long been part art, part science, and a whole lot of trial and error. For years, the best deep learning models were forged through immense human effort, intuition, and countless hours of GPU-powered experimentation. This manual design process is a significant bottleneck—one that sparked the rise of an exciting field: Neural Architecture Search (NAS).
The goal of NAS is straightforward: automate the design of neural networks. Instead of a human painstakingly choosing layers, connections, and operations, a NAS algorithm explores a vast space of possible architectures to find the best one for a given task. Early NAS methods were revolutionary, discovering state-of-the-art models like NASNet. But they came with staggering computational costs. The original NAS paper required 800 GPUs running for 28 days straight—over 60 GPU-years—for a single search.
Subsequent research made NAS faster using clever optimizations, but a core problem remains: most NAS algorithms follow a “generate-and-train” loop. They generate a candidate architecture, train it (or a proxy version) to measure its performance, and use that signal to guide the search. This training step is the main cost driver.
But what if we could skip training altogether? What if we could examine a freshly initialized network—before a single gradient step—and predict how well it will perform after training?
This radical idea is explored in Neural Architecture Search without Training. The authors propose a scoring method that predicts an architecture’s potential without any training, enabling top-tier model discovery in seconds on a single GPU. It’s a paradigm shift that challenges the long-held assumption that training is necessary to evaluate a network’s promise.
The High Cost of Finding Architectures
To appreciate this breakthrough, let’s recap the NAS landscape. The pioneering work of Zoph & Le used a controller network (often an RNN) to generate architecture descriptions. Each proposed architecture was trained from scratch, and its final accuracy was used as a reward to update the controller via reinforcement learning. This process was incredibly slow and resource-intensive.
Researchers later improved NAS efficiency with two key ideas:
- Cell-based search — Optimize smaller, reusable building blocks called cells instead of whole networks.
- Weight sharing — Train a single large “supernetwork” with shared weights among candidate architectures.
These innovations reduced search times from weeks on GPU clusters to hours on a single GPU. But trade-offs emerged—weight sharing can distort results, sometimes making random search surprisingly competitive. Search spaces also remain enormous, often making it impossible to prove that the best architecture has been found.
To enable rigorous comparison of search algorithms, researchers created NAS benchmarks like NAS-Bench-101 and NAS-Bench-201—tractable search spaces where every single architecture has been exhaustively trained and evaluated. They act as “lookup tables” for the true performance of any architecture. The authors of Neural Architecture Search without Training tested their training-free method on these benchmarks.
The Secret Life of Untrained Networks
The core insight: an architecture’s structure imparts properties predictive of its final performance, even before training begins. The key lies in how networks with Rectified Linear Unit (ReLU) activations partition the input space.
Each ReLU neuron either “fires” (positive input) or doesn’t (negative input). For a given input, the firing/non-firing pattern across the entire network produces a unique binary signature—its activation code.
A network essentially carves the input space into many linear regions. Within each region, the network behaves as a simple linear function. The binary activation code identifies which region a given input occupies.
Figure 2. Each ReLU splits the space into active/inactive regions labeled as binary codes. The intersection of codes across layers defines unique linear regions before training.
The hypothesis: if many different inputs map to the same linear region at initialization, the network will struggle to differentiate them during training. A good architecture should naturally separate inputs into distinct regions even without training.
Quantifying Input Separation
To measure this, the authors:
- Take a mini-batch of \( N \) inputs.
- Compute each input’s activation code \( \mathbf{c}_i \).
- Calculate the Hamming distance \( d_H(\mathbf{c}_i, \mathbf{c}_j) \) between pairs of codes.
They then form a kernel matrix:
\[ \mathbf{K}_H = \begin{pmatrix} N_A - d_H(\mathbf{c}_1, \mathbf{c}_1) & \cdots & N_A - d_H(\mathbf{c}_1, \mathbf{c}_N) \\ \vdots & \ddots & \vdots \\ N_A - d_H(\mathbf{c}_N, \mathbf{c}_1) & \cdots & N_A - d_H(\mathbf{c}_N, \mathbf{c}_N) \end{pmatrix} \]Here, \( N_A \) is the number of ReLU units in the network. Each entry measures the activation overlap between two inputs.
When visualizing \( \mathbf{K}_H \) for networks in different accuracy ranges, striking differences emerged:
Figure 1. High-performing architectures exhibit more diagonal-dominant matrices—inputs are highly similar to themselves but less similar to others.
The authors condensed this into a single score:
\[ s = \log |\mathbf{K}_H| \]A higher determinant means the matrix is closer to diagonal—inputs are less correlated—indicating better separation.
Results: Predictive Power Across Benchmarks
The authors sampled thousands of untrained architectures, computed their scores \( s \), and looked up their true final validation accuracy from benchmark data.
Figure 3. Across NAS-Bench-101, NAS-Bench-201, and various NDS search spaces, higher initial scores correlate strongly with higher final accuracy.
The score’s correlation (Kendall’s Tau, \( \tau \)) is particularly strong for NAS-Bench-201 and NDS-DARTS. Compared to other training-free metrics like grad_norm
and synflow
, the proposed method is consistently predictive:
Figure 4. Our method outperforms
grad_norm
andsynflow
in correlation stability across varied design spaces.
Robustness: Ablation Studies
Is the score robust? The authors tested:
Figure 5. The score’s relative ranking is stable across different input types, initializations, and batch sizes.
- Data independence: Rankings are stable across mini-batches of real images and even with Gaussian noise inputs.
- Initialization robustness: Random initialization introduces noise but rankings of good vs. bad architectures persist.
- Batch size: Rankings remain consistent across sizes.
They also tracked score changes during training:
Figure 6. Scores rise sharply in the first few epochs, then plateau. Relative rankings remain stable.
NASWOT: Neural Architecture Search Without Training
Armed with this robust, fast score, the authors propose NASWOT:
- Randomly sample \( N \) architectures.
- Compute \( s \) for each using a single mini-batch.
- Select the highest-scoring architecture.
With \( N = 100 \), NASWOT evaluates and selects a top candidate in just 30 seconds.
Table 2. NASWOT outperforms all weight-sharing baselines in accuracy while being orders of magnitude faster. It approaches the performance of top non-weight-sharing methods.
The trade-off between search time and accuracy is clear:
Figure 7. NASWOT sits at the sweet spot: high accuracy for minimal search time compared to traditional NAS methods.
AREA: Assisting Existing Algorithms
The score isn’t just for standalone search. The authors built AREA (Assisted Regularized EA)—a variant of Regularized Evolution (REA) that starts with a larger random sample, scores it, and seeds REA with the best candidates. AREA yields consistent performance boosts, showing the scoring method’s adaptability.
Conclusion: A New Path for NAS
Neural Architecture Search without Training shows that you can predict a network’s potential before training by analyzing how it separates inputs at initialization. This unlocks:
- Accessibility: NASWOT can run on a single GPU in seconds.
- Efficiency: Enables specialized architecture searches for different devices or datasets in real time.
- Hybrid potential: Can warm-start existing NAS algorithms for better efficiency.
While this work focuses on convolutional architectures for image classification, the implications reach far beyond. This proof-of-concept challenges the necessity of training in NAS evaluation, paving the way for faster, cheaper, and more accessible architecture search—putting powerful design capabilities into the hands of more practitioners.