Finding Top Neural Networks in Seconds—Without a Single Training Step

Designing a high-performing neural network has long been part art, part science, and a whole lot of trial and error. For years, the best deep learning models were forged through immense human effort, intuition, and countless hours of GPU-powered experimentation. This manual design process is a significant bottleneck—one that sparked the rise of an exciting field: Neural Architecture Search (NAS).

The goal of NAS is straightforward: automate the design of neural networks. Instead of a human painstakingly choosing layers, connections, and operations, a NAS algorithm explores a vast space of possible architectures to find the best one for a given task. Early NAS methods were revolutionary, discovering state-of-the-art models like NASNet. But they came with staggering computational costs. The original NAS paper required 800 GPUs running for 28 days straight—over 60 GPU-years—for a single search.

Subsequent research made NAS faster using clever optimizations, but a core problem remains: most NAS algorithms follow a “generate-and-train” loop. They generate a candidate architecture, train it (or a proxy version) to measure its performance, and use that signal to guide the search. This training step is the main cost driver.

But what if we could skip training altogether? What if we could examine a freshly initialized network—before a single gradient step—and predict how well it will perform after training?

This radical idea is explored in Neural Architecture Search without Training. The authors propose a scoring method that predicts an architecture’s potential without any training, enabling top-tier model discovery in seconds on a single GPU. It’s a paradigm shift that challenges the long-held assumption that training is necessary to evaluate a network’s promise.

The High Cost of Finding Architectures

To appreciate this breakthrough, let’s recap the NAS landscape. The pioneering work of Zoph & Le used a controller network (often an RNN) to generate architecture descriptions. Each proposed architecture was trained from scratch, and its final accuracy was used as a reward to update the controller via reinforcement learning. This process was incredibly slow and resource-intensive.

Researchers later improved NAS efficiency with two key ideas:

Cell-based search — Optimize smaller, reusable building blocks called cells instead of whole networks.
Weight sharing — Train a single large “supernetwork” with shared weights among candidate architectures.

These innovations reduced search times from weeks on GPU clusters to hours on a single GPU. But trade-offs emerged—weight sharing can distort results, sometimes making random search surprisingly competitive. Search spaces also remain enormous, often making it impossible to prove that the best architecture has been found.

To enable rigorous comparison of search algorithms, researchers created NAS benchmarks like NAS-Bench-101 and NAS-Bench-201—tractable search spaces where every single architecture has been exhaustively trained and evaluated. They act as “lookup tables” for the true performance of any architecture. The authors of Neural Architecture Search without Training tested their training-free method on these benchmarks.

The Secret Life of Untrained Networks

The core insight: an architecture’s structure imparts properties predictive of its final performance, even before training begins. The key lies in how networks with Rectified Linear Unit (ReLU) activations partition the input space.

Each ReLU neuron either “fires” (positive input) or doesn’t (negative input). For a given input, the firing/non-firing pattern across the entire network produces a unique binary signature—its activation code.

A network essentially carves the input space into many linear regions. Within each region, the network behaves as a simple linear function. The binary activation code identifies which region a given input occupies.

Figure 2 illustrates how ReLU activation patterns divide the input space into distinct linear regions, each identifiable by a binary code.

Figure 2. Each ReLU splits the space into active/inactive regions labeled as binary codes. The intersection of codes across layers defines unique linear regions before training.

The hypothesis: if many different inputs map to the same linear region at initialization, the network will struggle to differentiate them during training. A good architecture should naturally separate inputs into distinct regions even without training.

Quantifying Input Separation

To measure this, the authors:

Take a mini-batch of $ N $ inputs.
Compute each input’s activation code $ \mathbf{c}_i $.
Calculate the Hamming distance $ d_H(\mathbf{c}_i, \mathbf{c}_j) $ between pairs of codes.

They then form a kernel matrix:

\[ \mathbf{K}_H = \begin{pmatrix} N_A - d_H(\mathbf{c}_1, \mathbf{c}_1) & \cdots & N_A - d_H(\mathbf{c}_1, \mathbf{c}_N) \\ \vdots & \ddots & \vdots \\ N_A - d_H(\mathbf{c}_N, \mathbf{c}_1) & \cdots & N_A - d_H(\mathbf{c}_N, \mathbf{c}_N) \end{pmatrix} \]

Here, $ N_A $ is the number of ReLU units in the network. Each entry measures the activation overlap between two inputs.

When visualizing $ \mathbf{K}_H $ for networks in different accuracy ranges, striking differences emerged:

$Figure 1 shows kernel matrices \$ \\mathbf{K}_H \$ for many untrained networks, sorted by final accuracy. High performers show less off-diagonal similarity, indicating better input separation.$

Figure 1. High-performing architectures exhibit more diagonal-dominant matrices—inputs are highly similar to themselves but less similar to others.

The authors condensed this into a single score:

\[ s = \log |\mathbf{K}_H| \]

A higher determinant means the matrix is closer to diagonal—inputs are less correlated—indicating better separation.

Results: Predictive Power Across Benchmarks

The authors sampled thousands of untrained architectures, computed their scores $ s $, and looked up their true final validation accuracy from benchmark data.

Figure 3 shows scatter plots of score vs. final accuracy across benchmarks. In all cases there is a clear positive correlation.

Figure 3. Across NAS-Bench-101, NAS-Bench-201, and various NDS search spaces, higher initial scores correlate strongly with higher final accuracy.

The score’s correlation (Kendall’s Tau, $ \tau $) is particularly strong for NAS-Bench-201 and NDS-DARTS. Compared to other training-free metrics like grad_norm and synflow, the proposed method is consistently predictive:

Figure 4 compares correlations of different training-free scores. The proposed score shows stable, positive correlation across all tested search spaces.

Figure 4. Our method outperforms grad_norm and synflow in correlation stability across varied design spaces.

Robustness: Ablation Studies

Is the score robust? The authors tested:

Figure 5 shows ablations: score stability across datasets, random noise inputs, weight initializations, and various batch sizes.

Figure 5. The score’s relative ranking is stable across different input types, initializations, and batch sizes.

Data independence: Rankings are stable across mini-batches of real images and even with Gaussian noise inputs.
Initialization robustness: Random initialization introduces noise but rankings of good vs. bad architectures persist.
Batch size: Rankings remain consistent across sizes.

They also tracked score changes during training:

Figure 6 shows score evolution over training epochs. Scores jump early and stabilize, maintaining architecture rankings.

Figure 6. Scores rise sharply in the first few epochs, then plateau. Relative rankings remain stable.

NASWOT: Neural Architecture Search Without Training

Armed with this robust, fast score, the authors propose NASWOT:

Randomly sample $ N $ architectures.
Compute $ s $ for each using a single mini-batch.
Select the highest-scoring architecture.

With $ N = 100 $, NASWOT evaluates and selects a top candidate in just 30 seconds.

Table 2 shows NASWOT beating all complex weight-sharing methods and nearing top non-weight-sharing methods, with enormous speed advantages.

Table 2. NASWOT outperforms all weight-sharing baselines in accuracy while being orders of magnitude faster. It approaches the performance of top non-weight-sharing methods.

The trade-off between search time and accuracy is clear:

Figure 7 plots accuracy vs search time, with NASWOT far outperforming others in speed for high accuracy.

Figure 7. NASWOT sits at the sweet spot: high accuracy for minimal search time compared to traditional NAS methods.

AREA: Assisting Existing Algorithms

The score isn’t just for standalone search. The authors built AREA (Assisted Regularized EA)—a variant of Regularized Evolution (REA) that starts with a larger random sample, scores it, and seeds REA with the best candidates. AREA yields consistent performance boosts, showing the scoring method’s adaptability.

Conclusion: A New Path for NAS

Neural Architecture Search without Training shows that you can predict a network’s potential before training by analyzing how it separates inputs at initialization. This unlocks:

Accessibility: NASWOT can run on a single GPU in seconds.
Efficiency: Enables specialized architecture searches for different devices or datasets in real time.
Hybrid potential: Can warm-start existing NAS algorithms for better efficiency.

While this work focuses on convolutional architectures for image classification, the implications reach far beyond. This proof-of-concept challenges the necessity of training in NAS evaluation, paving the way for faster, cheaper, and more accessible architecture search—putting powerful design capabilities into the hands of more practitioners.

The High Cost of Finding Architectures#

The Secret Life of Untrained Networks#

Quantifying Input Separation#

Results: Predictive Power Across Benchmarks#

Robustness: Ablation Studies#

NASWOT: Neural Architecture Search Without Training#

AREA: Assisting Existing Algorithms#

Conclusion: A New Path for NAS#