Neural Architecture Search (NAS) is one of the most exciting frontiers in deep learning. Imagine an algorithm that can automatically design a state-of-the-art neural network for you—perfectly tailored to your specific task. The promise of NAS is to replace the tedious, intuition-driven process of manual network design with a principled, automated search.

For years, however, this promise came with a colossal price tag. Early NAS methods required tens of thousands of GPU hours to discover a single architecture—a cost so prohibitive that it was out of reach for most researchers and engineers. To make NAS feasible, the community developed a clever workaround: instead of searching directly on the massive target task (like ImageNet), researchers would search on a smaller, more manageable proxy task—such as using CIFAR-10 instead of ImageNet, training for fewer epochs, or searching for a single reusable block rather than an entire network.

But this raises a critical question: is an architecture optimized for a small, proxy task also optimal for the real target task? Increasingly, the answer seems to be “not necessarily.” A choice that works wonders on CIFAR-10 might be suboptimal for the complexity of ImageNet. More importantly, when you factor in real-world constraints such as inference latency on a specific mobile phone or GPU, the architecture found on a proxy task might be far from ideal.

This is precisely the problem that ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware sets out to solve. The researchers propose a method to completely eliminate the proxy, enabling search directly on large-scale tasks like ImageNet while optimizing for hardware-specific metrics like latency. As we’ll see, their approach achieves state-of-the-art results—at a tiny fraction of the computational cost compared to previous methods.

A comparison between the old proxy-based approach and the direct ProxylessNAS approach. The proxy method has an intermediate step, while ProxylessNAS learns directly on the target task and hardware.

Figure 1: Proxy-based NAS relies on an intermediate proxy task before transferring architectures to the target task. ProxylessNAS removes the proxy, learning architectures directly on the target dataset and hardware.


The Problem with Differentiable NAS: A Memory Hog

NAS saw a leap in efficiency with the introduction of differentiable NAS methods like DARTS. The idea was brilliant: instead of training thousands of separate candidate networks, create one giant, over-parameterized network that contains every possible architecture in the search space.

Imagine a single layer in a network. In a normal design, you choose whether it should be a 3×3 convolution, a 5×5 convolution, a pooling layer, or something else. In differentiable NAS, you include all of them in parallel. This “mixed operation” outputs a weighted sum of each candidate’s output:

\[ m_{\mathcal{O}}^{\text{DARTS}}(x) = \sum_{i=1}^{N} \frac{\exp(\alpha_i)}{\sum_j \exp(\alpha_j)}\, o_i(x) \]

Here, \(o_i(x)\) is the output of the \(i\)-th operation (e.g., the 3×3 convolution), and the weights \(\alpha_i\) are learned architecture parameters passed through a softmax. The search then becomes a standard optimization problem: use gradient descent to train both the network’s weights and the architecture parameters.

This reduced the search cost from tens of thousands of GPU hours to just a few days. But it introduced a crippling bottleneck: GPU memory. Computing the weighted sum requires calculating and storing the output feature maps for every candidate operation in memory. With 7 candidate operations per layer, memory usage is roughly 7× that of a normal network. On large-scale tasks like ImageNet, this memory explosion is simply infeasible—forcing researchers back to proxies.


The Core Idea of ProxylessNAS: Path Binarization

ProxylessNAS solves the memory problem with a deceptively simple trick: path binarization. Instead of computing a weighted sum of all candidate paths, the network samples exactly one path to activate at a time.

Architecture parameters \(\alpha_i\) are still converted to probabilities \(p_i\) via softmax. But instead of using these as continuous weights, the probabilities are used to sample a single active path, producing a binary gate vector \(g\):

\[ g = \mathrm{binarize}(p_1, \cdots, p_N) = \begin{cases} [1, 0, \cdots, 0] & \text{with prob.\ } p_1 \\ \vdots & \\ [0, 0, \cdots, 1] & \text{with prob.\ } p_N \end{cases} \]

With binary gating, the mixed operation’s output becomes:

\[ m_{\mathcal{O}}^{\text{Binary}}(x) = \sum_{i=1}^{N} g_i\, o_i(x) \]

At any moment during training, the network executes only a single operation per block—bringing its memory usage down to that of a normal compact network. This one idea unlocks direct, proxyless search on large datasets like ImageNet.


Training the Untrainable: Learning Binarized Architectures

Sampling discrete paths introduces a new challenge: how do you train the architecture parameters \(\alpha_i\) through this stochastic process, given that you can’t directly backpropagate through a random choice?

ProxylessNAS addresses this with an alternating two-step training process:

The two-step training process of ProxylessNAS. In one step, weight parameters are updated for a sampled architecture. In the other step, the architecture parameters are updated.

Figure 2: Alternating updates between weight parameters and binarized architecture parameters keep memory usage low while learning both optimally.

  1. Update Weight Parameters
    Freeze the architecture parameters. For each batch of training data, sample one active path for each mixed operation block (creating a compact network for the batch). Update the network’s weights via standard gradient descent on the training set.

  2. Update Architecture Parameters
    Freeze the weights. Using a validation set, update the architecture parameters with a gradient-based method inspired by BinaryConnect. The core gradient estimate is:

    \[ \frac{\partial L}{\partial \alpha_i} \approx \sum_{j=1}^N \frac{\partial L}{\partial g_j} \, p_j \, (\delta_{ij} - p_i) \]

    To keep memory low, gradients are computed for only two randomly sampled paths in each update step—comparing them directly and updating their parameters accordingly.

By alternating these two steps throughout training, ProxylessNAS learns both the network’s weights and its discrete architectural configuration efficiently.


Making Hardware a First-Class Citizen

A standout feature of ProxylessNAS is direct optimization for hardware metrics like latency, which are normally non-differentiable. The researchers propose two approaches:

1. Latency Regularization Loss

They train a simple latency prediction model for each target hardware. For a block with candidate ops \(o_j\) and probabilities \(p_j^i\), the expected latency is:

\[ \mathbb{E}[\text{latency}_i] = \sum_j p_j^i \times F(o_j^i) \]

where \(F(o_j^i)\) is the predicted latency on the target hardware. Since the total network latency is just the sum across blocks, this expectation is differentiable. It can be added to the loss:

A diagram showing how the expected latency is calculated from the probabilities of choosing different operations and added to the loss function.

Figure 3: Latency modeled as a differentiable regularization term guides search toward faster architectures.

\[ Loss = Loss_{CE} + \lambda_1 \|w\|_2^2 + \lambda_2\, \mathbb{E}[\text{latency}] \]

Here, \(\lambda_2\) tunes the trade-off between accuracy and latency.

2. REINFORCE-Based Alternative

Alternatively, the REINFORCE algorithm samples architectures, evaluates their rewards (accuracy and latency combined), and updates probabilities based on this reward signal. This avoids the need for latency to be differentiable, offering greater flexibility.


Results: ProxylessNAS in Action

State-of-the-Art on CIFAR-10

On CIFAR-10, the gradient-based ProxylessNAS (Proxyless-G) achieves a 2.08% test error with only 5.7M parameters—beating AmoebaNet-B’s 2.13% error while using over 6× fewer parameters.

A table comparing the test error and parameter count of various models on CIFAR-10. ProxylessNAS achieves the lowest error with one of the smallest models.

Table 1: ProxylessNAS achieves best-in-class performance with far fewer parameters than previous NAS models.

Dominating ImageNet on Mobile Devices

On ImageNet, with a mobile latency constraint of under 80 ms, ProxylessNAS reaches 74.6% top-1 accuracy at only 78 ms latency—beating MobileNetV2 (72.0%) and MnasNet (74.0%), while using only 200 GPU hours vs. MnasNet’s 40,000.

A table showing ImageNet performance under a mobile latency constraint. ProxylessNAS achieves higher accuracy than competitors with significantly less search cost.

Table 2: Mobile-specific ProxylessNAS models surpass manually designed and previous NAS-generated architectures with drastically reduced search cost.

A plot of Top-1 Accuracy vs. Mobile Latency. The ProxylessNAS curve is consistently above the MobileNetV2 curve, indicating a better trade-off.

Figure 4: Across latency settings, ProxylessNAS consistently dominates MobileNetV2, achieving the same accuracy at much lower latency.

To do this without a farm of devices, an accurate, lightweight latency prediction model was trained—matching real measurements closely.

A scatter plot showing the strong correlation between predicted latency and real measured latency on a mobile device.

Figure 5: Predicted latency correlates tightly with real-world measurements, enabling device-aware NAS without costly setups.


One Size Does Not Fit All: Specialized Models for Each Hardware

ProxylessNAS also searched for architectures specialized for three platforms: mobile phone, CPU, and GPU.

A table showing the performance of models optimized for one hardware platform when run on others. The best performance for each platform is achieved by the model specialized for it.

Table 4: Hardware-optimized models perform best only on their target platform, underscoring the need for specialization.

The architectures discovered by ProxylessNAS for GPU, CPU, and mobile platforms. The models show distinct structural patterns.

Figure 6: GPU-targeted models tend to be shallow and wide; CPU-targeted ones are deeper and narrower; mobile-focused designs balance the two.

Key insights from the architectures:

  • GPU Models: Favor shallower, wider designs with large kernels (e.g., 7×7 MBConvs), leveraging massive parallelism.
  • CPU Models: Favor deeper, narrower designs with smaller kernels, matching CPU sequential processing.
  • Mobile Models: Find a balance given mobile resource constraints.

Conclusion: A New Era for Practical NAS

ProxylessNAS represents a major step forward in practical, efficient NAS. By introducing path binarization, it eliminates the crippling memory bottleneck in differentiable NAS, enabling direct architecture search on large-scale datasets like ImageNet—for the first time—at the cost of training just one regular network.

Key Takeaways:

  1. Proxies aren’t required anymore — architectures can be searched directly on the target task, narrowing the “proxy gap.”
  2. Search cost is reduced drastically — down to a few hundred GPU hours, comparable to training a single deep network.
  3. Hardware optimization is essential — the best architecture depends on where it will run; ProxylessNAS finds these specialized designs automatically.

This work democratizes Neural Architecture Search, moving it from being an exclusive tool for well-funded industrial labs to something accessible to everyday researchers and engineers. It shifts the NAS paradigm toward generating families of specialized models—each perfectly tuned for its intended deployment environment.