Find Top Neural Networks in Hours, Not Days: A Deep Dive into Training-Free NAS

Neural Architecture Search (NAS) is one of the most exciting frontiers in deep learning. Its promise is simple yet profound: to automatically design the best possible neural network for a given task, freeing humans from the tedious and often intuition-driven process of manual architecture design. But this promise has always come with a hefty price tag—traditional NAS methods can consume thousands of GPU-hours, scouring vast search spaces by training and evaluating countless candidate architectures. This immense computational cost has limited NAS to a handful of well-funded research labs.

What if we could sidestep the most expensive part of this process entirely?
What if we could identify a top-performing architecture without ever training it?

This is the radical question posed by researchers from the University of Texas at Austin in their paper, “Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective”. They introduce a framework called TE-NAS (Training-free Neural Architecture Search) that leverages deep learning theory to evaluate a network’s potential at the moment of its birth—its random initialization.

The results are staggering: TE-NAS can discover state-of-the-art architectures on CIFAR-10 in about 30 minutes and on the massive ImageNet dataset in just 4 hours, all on a single GPU.

In this article, we’ll unpack the theory and methods behind TE-NAS. We’ll explore the two key indicators it uses to judge a network’s quality, how it balances them in a clever pruning-based search, and why this work represents a major shift toward making NAS efficient and accessible.

The Bottleneck of Traditional NAS

To appreciate the breakthrough of TE-NAS, we first need to understand why conventional NAS is so slow. At its core, any NAS algorithm has to answer two questions:

How to evaluate? Given a candidate architecture, how do we determine if it’s “good”?
How to optimize? How do we efficiently explore the massive space of possible architectures to find the best one?

Most NAS methods answer the first question using validation accuracy—train the architecture for a while (or to full convergence) and measure its performance. This is reliable but also the main bottleneck. Training even one deep neural network can be time-consuming; training thousands during a search is prohibitively expensive.

To address this, researchers developed shortcuts:

Supernet approaches: A giant network containing all candidate architectures, from which sub-networks are sampled and evaluated using shared weights. This saves time but can mislead the search—performance within the supernet doesn’t always match performance when trained from scratch.
Proxy evaluations: Train candidates for fewer epochs or on smaller datasets. This speeds up evaluation but introduces bias toward architectures that learn fast early but don’t generalize well.

TE-NAS takes a radically different route. It proposes a way to evaluate an architecture that requires no training, no labels, and almost no time.

The Two Pillars of a Good Network: Trainability and Expressivity

The TE-NAS authors argue that any high-performing network rests on two fundamental qualities:

Trainability – The network must be easy to optimize with gradient descent. If the loss landscape is too chaotic, optimizers will struggle.
Expressivity – The network must be capable of representing complex functions to model patterns in the data.

The magic of TE-NAS is its theoretical indicators for both properties, calculated right at initialization—before a single gradient step.

Measuring Trainability with the Neural Tangent Kernel (NTK)

How can we know if a randomly initialized network is “trainable”?

Recent theory offers a powerful tool: the Neural Tangent Kernel (NTK).

The NTK describes training dynamics for wide networks. For infinitely wide networks under gradient descent:

\[ \mu_t(\mathbf{X}_{\text{train}}) = (\mathbf{I} - e^{ -\eta \Theta_{\text{train}} t}) \mathbf{Y}_{\text{train}} \]

Here:

\(\mu_t\) = network outputs at time \(t\)
\(\mathbf{Y}_{\text{train}}\) = true labels
\(\Theta_{\text{train}}\) = NTK matrix

The spectrum of \(\Theta_{\text{train}}\) (its eigenvalues) determines convergence speed. The condition number:

\[ \kappa_{\mathcal{N}} = \frac{\lambda_0}{\lambda_m} \]

is the ratio of the largest to smallest eigenvalues. A large \(\kappa_{\mathcal{N}}\) implies slow learning along certain directions—hindering training. A lower \(\kappa_N\) suggests better trainability.

The authors evaluated \(\kappa_N\) across NAS-Bench-201 architectures and found a clear negative correlation: architectures with lower \(\kappa_N\) tend to have better final accuracy.

Figure 1: A scatter plot showing that the NTK condition number, κ_N, has a moderate negative correlation with test accuracy. Lower κ_N correlates with higher accuracy.

Figure 1: In NAS-Bench-201 (CIFAR-10), lower NTK condition numbers (\(\kappa_N\)) align with higher test accuracy.

Crucially, calculating \(\kappa_N\) requires only one mini-batch of unlabeled data and a forward/backward pass—no training.

Measuring Expressivity with Linear Regions

For ReLU networks, there’s a natural measure of expressivity: the number of linear regions into which the network partitions its input space.

A ReLU network is a piecewise linear function:

Each ReLU introduces a linear boundary, splitting the input space.
Combining them yields many distinct regions, each with a linear mapping.

The number of linear regions a network can create measures its complexity: more regions mean greater capacity to approximate complex functions.

Figure 2: A colorful mosaic showing how a ReLU network partitions input space into many distinct linear regions. More regions imply greater expressivity.

Figure 2: Example of a ReLU network’s partitioning of input space into numerous distinct linear regions.

The authors estimate this count \(\hat{R}_N\) by feeding thousands of random inputs to an initialized network and counting unique activation patterns.

On NAS-Bench-201, they observed a positive correlation: architectures with more linear regions tend to have higher test accuracy.

Figure 3: Scatter plot showing that the number of linear regions, R_N, has a moderate positive correlation with test accuracy on CIFAR-100.

Figure 3: In NAS-Bench-201 (CIFAR-100), more linear regions (\(\hat{R}_N\)) align with higher test accuracy.

A Tale of Two Metrics

Which is better—trainability or expressivity? TE-NAS discovered they are complementary.

Operator preferences differ for each metric:

Figure 4: Bar chart comparing operation preferences. κ_N favors skip-connections; R_N favors 1x1 convs; both favor 3x3 convs.

Figure 4: Operator preferences diverge: NTK condition number (\(\kappa_N\)) favors skip-connect for gradient flow, while linear region count (\(\hat{R}_N\)) favors 1×1 convolutions for expressivity.

Balance is key: both metrics value 3×3 convolutions, but \(\kappa_N\) pushes toward skip connections, and \(\hat{R}_N\) toward 1×1 convolutions.

The TE-NAS Search Strategy: Pruning-by-Importance

Armed with \(\kappa_N\) and \(\hat{R}_N\), TE-NAS opts for an efficient, deterministic pruning-based search instead of random sampling.

Steps:

Start with a supernet: Every possible operator resides on every edge—maximally expressive but poorly trainable.
Measure importance: For each operator, compute the change in \(\kappa_N\) and \(\hat{R}_N\) after its removal.
Rank instead of scale: Combine ranks for improvement in trainability (drop in \(\kappa_N\)) and preservation of expressivity (small drop in \(\hat{R}_N\)). Importance score = rank(Δ\(\kappa_N\)) + rank(Δ\(\hat{R}_N\)).
Prune iteratively: On each edge, remove the operator with the lowest importance score. Repeat until each edge has a single operator.

This reduces search complexity from exponential to linear in the number of edges.

Figure 5 visualizes pruning: from a high-\(\kappa_N\), high-\(\hat{R}_N\) supernet toward lower \(\kappa_N\) while retaining \(\hat{R}_N\).

Figure 5: Pruning trajectory across steps. κ_N drops rapidly early, while R_N stays high, then adjusts slightly as pruning continues.

Figure 5: Pruning on NAS-Bench-201 and DARTS. Point “0” is the initial supernet—high expressivity, poor trainability. Early pruning rapidly improves trainability.

Jaw-Dropping Results: SOTA Performance at Unprecedented Speed

The true test: performance and speed. TE-NAS was evaluated on NAS-Bench-201 and the DARTS search space.

NAS-Bench-201: TE-NAS finds the best architectures across CIFAR-10, CIFAR-100, and ImageNet-16-120—at 5×–19× lower search cost.

Table 1: TE-NAS achieves top accuracy across NAS-Bench-201 datasets with drastically reduced search cost.

Table 1: TE-NAS outperforms random and other training-free methods in accuracy, with significant efficiency gains.

DARTS search space (CIFAR-10):
Test error of 2.63% in just 0.05 GPU-days (about an hour) on a 1080Ti—competitive with state-of-the-art gradient-based NAS.

Table 2: TE-NAS achieves 2.63% test error on CIFAR-10 in only 0.05 GPU days.

Table 2: CIFAR-10 results: TE-NAS matches top NAS methods while being 10×–100× faster.

ImageNet (mobile setting):
Top-1 error of 24.5% with a search cost of 4 GPU-hours—remarkably efficient for a dataset where searches usually run for days.

Table 3: TE-NAS finds competitive ImageNet architecture in 4 GPU-hours.

Table 3: ImageNet results: TE-NAS achieves competitive performance with minimal search time.

The final architectures reflect the balanced trade-off TE-NAS enforces:

Figure 6: Normal and Reduction cells found by TE-NAS for CIFAR-10.

Figure 6: TE-NAS-discovered CIFAR-10 cells: a mix of operations balancing trainability and expressivity.

Figure 7: Normal and Reduction cells found by TE-NAS for ImageNet.

Figure 7: TE-NAS-discovered ImageNet cells tailored for mobile constraints.

Conclusion: A New Paradigm for NAS

TE-NAS is more than a fast NAS algorithm—it’s a conceptual shift. By bridging deep learning theory and practical search, it shows we can predict much about a network’s eventual performance from its initialization.

Key takeaways:

Training-free evaluation is viable: Indicators like \(\kappa_N\) and \(\hat{R}_N\) offer “zero-cost” performance proxies.
Trainability + Expressivity = Success: Both must be balanced for top performance.
Pruning-based search is efficient: Deterministic, fast, and effective.

TE-NAS democratizes NAS, making it practical without massive resources. It opens the door to finding other theoretical properties that can guide architecture design—ushering in an era of accessible, theory-driven NAS.

The Bottleneck of Traditional NAS#

The Two Pillars of a Good Network: Trainability and Expressivity#

Measuring Trainability with the Neural Tangent Kernel (NTK)#

Measuring Expressivity with Linear Regions#

A Tale of Two Metrics#

The TE-NAS Search Strategy: Pruning-by-Importance#

Jaw-Dropping Results: SOTA Performance at Unprecedented Speed#

Conclusion: A New Paradigm for NAS#