Designing a state-of-the-art neural network has often been described as a “dark art.” It requires deep expertise, countless hours of experimentation, and a healthy dose of intuition. From AlexNet and VGGNet to ResNet and DenseNet, each breakthrough architecture has been the product of painstaking human design.

But what if we could automate this process? What if, instead of manually designing architectures, we could design an algorithm that learns to design architectures for us?

This is the groundbreaking idea behind “Neural Architecture Search with Reinforcement Learning”, a 2017 paper from researchers at Google Brain. The authors propose a system where an AI—called the Controller—learns to generate high-performing neural network architectures entirely from scratch. The Controller improves over time, exploring a vast design space to discover novel architectures that rival—and sometimes surpass—the best human-invented designs.

In this article, we’ll dive into how you can use one neural network to design another, explore the reinforcement learning techniques that make this possible, and examine the impressive results that helped kick off a new era in automated machine learning.

The Challenge of Architecture Engineering

Before we explore the NAS solution, let’s understand the problem. In deep learning, progress often comes hand-in-hand with architectural innovation. While hyperparameter tuning (finding the right learning rate, batch size, optimizer) is already challenging, designing the architecture of a network is even more complex.

An architect must decide:

  • Depth: How many layers should the model have?
  • Layer Types: Convolutional? Recurrent? Pooling? Some combination?
  • Parameters per Layer: Filter sizes, strides, number of units.
  • Connectivity: Sequential stack? Skip connections like ResNet? Dense paths like DenseNet?

The space of possibilities is enormous. Traditional hyperparameter optimization methods—random search, Bayesian optimization—handle fixed-length parameter sets well, but struggle to design complex, conditional, variable-length architectures.

Neural Architecture Search (NAS) reframes architecture design as a learning problem itself.

The Core Idea: An AI Architect

The authors’ approach creates a feedback loop between two core components:

  1. Controller: A recurrent neural network (RNN) that generates a sequence of tokens describing a network architecture—a blueprint.
  2. Child Network: The network defined by that blueprint, trained on real data such as CIFAR-10.
  3. Reward Assessment: Once trained, the child network’s accuracy on a validation set becomes the reward signal.
  4. Controller Update: Using reinforcement learning, the Controller’s parameters are updated to favor architectural choices that produce better rewards.

This loop, illustrated in Figure 1, is repeated thousands of times. Over time, the Controller becomes a more skilled “AI architect.”

An overview of the Neural Architecture Search (NAS) loop. A controller RNN samples an architecture, which is used to train a child network. The child network’s accuracy is used as a reward to update the controller via reinforcement learning.

Figure 1: Overview of the NAS process: The Controller samples architectures, child networks are trained, accuracy becomes the reward, and the Controller learns to generate better architectures.

The Controller: Generating Blueprints with an RNN

Why use an RNN? Because architecture generation is inherently sequential: parameter choices for one layer may influence the next. An RNN can predict sequences conditioned on previous outputs.

For a simple convolutional network (CNN), the Controller predicts, for each layer:

  1. Filter height
  2. Filter width
  3. Stride height
  4. Stride width
  5. Number of filters

The prediction for each layer’s parameters is made one step at a time, feeding into the next step. This continues until a predefined maximum depth is reached.

The controller RNN sequentially samples hyperparameters for a convolutional layer, such as filter dimensions, stride, and number of filters. Each prediction is fed as input to the next time step.

Figure 2: Controller RNN sampling CNN layer hyperparameters sequentially, with output of one step feeding the next.

This autoregressive design enables variable-length architectures and captures complex dependencies between hyperparameter choices.

Training the Controller: Learning with REINFORCE

Once the Controller defines an architecture, the corresponding child network is built and trained. After training, the child’s validation accuracy \( R \) is used as the reward.

The reward \( R \) is non-differentiable. You cannot directly backpropagate it through the entire generation-training process. Instead, the authors use REINFORCE, a classic policy gradient algorithm.

The objective is to maximize:

\[ J(\theta_c) = E_{P(a_{1:T};\theta_c)}[R] \]

Here \(\theta_c\) are the Controller’s parameters, and \(a_{1:T}\) are the sequence of predicted actions (architecture choices).

REINFORCE estimates the gradient:

\[ \nabla_{\theta_c} J(\theta_c) = \sum_{t=1}^T E_{P(a_{1:T};\theta_c)} \big[ \nabla_{\theta_c} \log P(a_t | a_{(t-1):1}; \theta_c) R \big] \]

In plain terms: if a sequence of choices leads to high accuracy, adjust \(\theta_c\) to make those choices more probable. Bad choices get downgraded.

High variance in this estimator can make training unstable. To reduce variance, the authors introduce a baseline \( b \), the exponential moving average of past rewards. Reward contributions become \((R_k - b)\), emphasizing above-average architectures:

\[ \frac{1}{m} \sum_{k=1}^{m} \sum_{t=1}^{T} \nabla_{\theta_c} \log P(a_t \mid a_{(t-1):1}; \theta_c) (R_k - b) \]

Scaling NAS: Parallelism at Massive Scale

Training each child network to convergence is slow (hours). The authors leveraged massive parallelism.

A diagram of the distributed training system. A parameter server distributes controller weights to many controller replicas. Each replica samples multiple child architectures, which are trained in parallel. The resulting accuracies are used to compute gradients and update the central server.

Figure 3: Distributed NAS training—servers manage Controller weights, replicas sample architectures, and hundreds of child models train in parallel.

They used:

  • Parameter Servers storing the Controller’s shared weights.
  • Controller Replicas pulling weights and generating multiple architectures each.
  • Child Replicas training each sampled architecture concurrently.

In experiments: 20 parameter server shards, 100 Controller replicas, \( m = 8 \) architectures per replica, totaling 800 child networks trained on 800 GPUs in parallel.

Expanding the Search Space: Skip Connections

Modern networks often use skip connections (ResNet, DenseNet) to ease gradient flow and enable depth. The Controller was upgraded with an attention-like mechanism:

For each layer \( i \), it predicts a sigmoid probability of connecting to each prior layer \( j < i \):

\[ P(\text{Layer j is input to layer i}) = \sigma( v^T \tanh( W_{prev} h_j + W_{curr} h_i)) \]

Multiple connections are allowed; outputs are concatenated depth-wise.

The controller uses anchor points and an attention mechanism to form skip connections, allowing it to create complex, non-sequential network topologies.

Figure 4: Controller attention mechanism forming skip connections and complex topologies.

Beyond CNNs: Designing New Recurrent Cells

What about RNNs? Standard LSTM cells are human-designed. Could NAS find better?

The authors model RNN cell computation as a tree, with inputs \( x_t \) and \( h_{t-1} \). For each node, the Controller predicts:

  1. Combination method (e.g., addition, elementwise multiply)
  2. Activation function (e.g., tanh, sigmoid, relu)

This builds a computation tree that outputs \( h_t \) (and optionally \( c_t \) memory state).

An example of how the controller designs a recurrent cell. It predicts a sequence of combination methods and activation functions for nodes in a computation tree, which are then assembled into a final computation graph for the cell.

Figure 5: Example recurrent cell built from NAS predictions, compiled into a computation graph.

With a base size of 8 leaf inputs, the search space contained \(\approx 6 \times 10^{16}\) possible architectures.

Experiments and Results

NAS was tested on two demanding benchmarks:

CIFAR-10: Discovering a State-of-the-Art ConvNet

After evaluating ~12,800 architectures, NAS found a CNN achieving 3.65% error rate, slightly better than DenseNet-BC’s 3.74% and 1.05x faster.

Table showing the performance of NAS-discovered models against other state-of-the-art architectures on the CIFAR-10 dataset. The NAS model achieves a top-tier error rate of 3.65%.

Table 1: NAS vs. other leading CIFAR-10 architectures. NAS reaches state-of-the-art performance.

The discovered architecture preferred rectangular filters (like 7×5) and short skip connections.

Penn Treebank: Inventing a Better LSTM

NAS designed a new RNN cell with 62.4 test perplexity, beating the best prior model’s 66.0.

Table showing the performance of the NAS-discovered recurrent cell against LSTMs and other models on the Penn Treebank language modeling task. The NAS cell achieves a new state-of-the-art perplexity of 62.4.

Table 2: NAS cell achieves new state-of-the-art perplexity on PTB language modeling.

Applied to a character-level PTB task, the same cell also set a new benchmark.

Table showing the NAS cell’s state-of-the-art performance on the PTB character modeling task, demonstrating its ability to generalize to new tasks.

Table 3: NAS cell generalizes to character-level PTB modeling, outperforming strong baselines.

When dropped into Google’s Neural Machine Translation system without tuning, the cell improved BLEU score by 0.5—evidence of robust transferability.

Could this just be luck from trying many random architectures? No.

A chart comparing the performance of NAS with reinforcement learning against random search. The policy gradient method consistently finds better architectures, showing that a true learning process is taking place.

Figure 6: NAS (policy gradient) vs. random search—NAS consistently finds better models, with widening gap.

NAS’s Controller clearly learned to search more effectively over time.

Conclusion and Lasting Impact

“Neural Architecture Search with Reinforcement Learning” transformed how researchers think about network design. Key takeaways:

  1. Automation Works: Architecture design can be framed as a reinforcement learning problem, enabling a Controller to discover novel, high-performing architectures.
  2. State-of-the-Art Results: NAS matched or exceeded human-designed architectures across diverse tasks.
  3. New Research Era: Although early RL-based NAS was computationally expensive, it sparked massive research into more efficient NAS methods (e.g., differentiable search, one-shot models).

This work was a milestone toward meta-learning—machines learning how to learn. By automating one of AI’s most challenging tasks, NAS frees human researchers to pursue higher-level innovations. The NAS-discovered cell, released in TensorFlow as NASCell, remains a testament to machines building themselves.