Beyond ReLU: How Automated Search Discovered the Swish Activation Function

For nearly a decade, the Rectified Linear Unit (ReLU) has been the undisputed champion of activation functions in deep learning. Its elegant simplicity—outputting the input if it’s positive and zero otherwise—was a breakthrough that unlocked practical training for very deep neural networks. Fast, effective, and easy to implement, ReLU quickly became the default choice across the AI community.

Many rivals have tried to dethrone ReLU. Alternatives like Leaky ReLU, ELU, and SELU promised improvements by tweaking how ReLU handles negative inputs. Yet none managed to replace it. Gains were often inconsistent across models and datasets, leaving practitioners to fall back on ReLU’s reliable simplicity.

This raises a compelling question: Is ReLU truly the best we can do, or have we just not found the right contender? And instead of hand-designing a better function guided by intuition, could we automatically search the mathematical universe for one?

This is precisely the idea behind the Google Brain paper “Searching for Activation Functions.” The authors turned activation design into a search problem, using reinforcement learning to explore a vast space of possible functions. The result? They discovered Swish, an activation defined as:

\[ f(x) = x \cdot \sigma(\beta x) \]

where \(\sigma\) is the sigmoid function and \(\beta\) is a constant or trainable parameter. Swish consistently outperformed ReLU on deep and challenging models. In this article, we’ll unpack how they found it, what makes it special, and why it might become your new go-to activation.

From Vanishing Gradients to the Reign of ReLU

In a neural network, each layer first applies a linear transformation (a matrix multiplication) to its input. Without activation functions, stacking many layers would collapse into a single large linear transformation, incapable of learning complex, nonlinear patterns like those in images, audio, or text.

Activation functions introduce essential non-linearity. Early choices like sigmoid and tanh faded in popularity due to the vanishing gradient problem: as gradients backpropagated through many layers, they shrank exponentially, making training deep networks infeasible.

ReLU, defined as \(f(x) = \max(x, 0)\), was a game changer. For positive inputs, its derivative is exactly 1, letting gradients flow without decay. This single property powered breakthroughs like AlexNet in 2012, marking the start of the deep learning boom. Despite issues like “dying ReLUs” (neurons stuck outputting zero), its effectiveness and speed cemented ReLU as the standard.

The Automated Search for a Better Activation

Instead of guessing and testing activation designs manually, the authors automated the process—requiring two key components:

1. Defining the Search Space

You can’t just tell a computer “find the best activation function.” You need defined building blocks. The authors based their search space on a repeating core unit: two inputs, each processed by a unary function (one input, one output), then combined via a binary function (two inputs, one output).

A diagram showing the structure of a core unit used to build activation functions. Two inputs (x) are fed into separate unary functions, which are then combined by a binary function. This output can be fed into subsequent units.

Figure 1: Structure of a core unit for constructing activation functions.

Unary functions included:

Basic operations: \(x, -x, x^2, x^3, \sqrt{x}, \exp(x), \sin(x)\)
Existing activations: sigmoid \(\sigma(x)\), ReLU \(\max(x, 0)\)
Trainable constants like \(\beta\)

Binary functions combined inputs in various ways:

Addition: \(x_1 + x_2\)
Multiplication: \(x_1 \cdot x_2\)
Maximum: \(\max(x_1, x_2)\)

By composing these core units, the search could span a huge variety of candidate functions.

2. The Search Algorithm

With potentially \(10^{12}\) combinations, exhaustive search is impossible. The authors used reinforcement learning, similar to Neural Architecture Search (NAS).

They trained an RNN controller to generate candidate activation definitions—like a language model where the “words” are unary and binary ops.

A schematic showing how the RNN controller autoregressively predicts components of the activation function, feeding each prediction back as input for the next step.

Figure 2: RNN controller predicting activation function components.

The loop worked as follows:

Generate: Controller outputs a function definition.
Test: Build a “child network” (ResNet-20) using the new function and train it on CIFAR-10.
Reward: Measure validation accuracy—used as the RL reward signal.
Update: Reinforcement learning (Proximal Policy Optimization) nudges the controller to generate functions with higher accuracy.

Distributed training with many worker machines sped up exploration.

Findings: Simplicity Wins

After extensive searching, the team analyzed the top novel functions.

Two plots showing the graphs of the top novel activation functions discovered by the search. Functions include x·σ(βx), min(x, sin(x)), and cos(x) − x.

Figure 3: Graphs of the best activation functions discovered.

Key observations:

Simplicity: Best performers often used just one or two core units. Overly complex functions hurt optimization.
Common Structure: Many top functions followed \(b(x, g(x))\), where raw preactivation \(x\) is an input to the final binary fn—ReLU itself fits this form (\(\max(x,0)\)).
Periodic Potential: Functions incorporating \(\sin\) or \(\cos\) showed promise—an underexplored area.
Division Traps: Functions with division often failed due to near-zero denominators causing exploding outputs.

To test generality, the top functions were evaluated on larger CIFAR models.

Table showing CIFAR-10 accuracy for several discovered activation functions across ResNet, Wide ResNet, and DenseNet models.

Table 1: CIFAR-10 accuracy for top discovered functions.

Table showing CIFAR-100 accuracy for several discovered activation functions across ResNet, Wide ResNet, and DenseNet models.

Table 2: CIFAR-100 accuracy for top discovered functions.

Two functions stood out: \(x \cdot \sigma(\beta x)\) and \(\max(x, \sigma(x))\). The former, dubbed Swish, showed slightly better generalization and became the focus for deeper study.

Meet Swish: \(f(x) = x \cdot \sigma(\beta x)\)

Swish is defined as:

\[ f(x) = x \cdot \sigma(\beta x), \quad \sigma(z) = \frac{1}{1 + e^{-z}} \]

with \(\beta\) as a constant or trainable parameter.

A plot of the Swish activation function (left) and its first derivative (right) for different values of the parameter β.

Figure 4: Swish curves and derivatives for various \(\beta\) values.

Properties:

Smoothness: No sharp corner at zero—fully differentiable.
Non-Monotonicity: For negative \(x\), Swish dips before rising—a unique “bump” most activations lack.
Unbounded Positive, Bounded Negative: Like ReLU, but smoother.
Interpolates Between Linear & ReLU: As \(\beta \to 0\), Swish → linear (\(x/2\)). As \(\beta \to \infty\), it → ReLU.

Why the Bump Matters

A histogram showing the distribution of preactivation values in a trained ResNet. A significant portion of values falls in the negative range where Swish has its characteristic bump.

Figure 6: Preactivation distribution for Swish.

A large share of preactivations land in the bump’s range (\(-5 \le x \le 0\)), implying the network leverages this shape during learning.

Trainable \(\beta\)

Swish’s \(\beta\) can be learned, letting each neuron adjust its curve.

A histogram showing the distribution of learned β values in a Mobile NASNet-A model. The values are spread out but centered near 1.0.

Figure 7: Distribution of learned \(\beta\) in Mobile NASNet-A.

Learned values clustered near \(\beta \approx 1\), but varied enough to suggest useful adaptability.

Swish vs. The Field

The authors compared Swish against strong baselines:

LReLU: Fixed slope for \(x<0\).
PReLU: Learnable slope for \(x<0\).
Softplus: Smooth \(\log(1+\exp(x))\) approximation to ReLU.
ELU: Uses exponentials for \(x<0\).
SELU: Scaled ELU for self-normalizing nets.
GELU: Smooth, non-monotonic curve similar to Swish.

A summary table showing how many times Swish outperformed, was equivalent to, or underperformed various baseline activation functions across all experiments.

Table 3: Aggregate comparison results—Swish wins most matchups.

CIFAR Results

On CIFAR-10/100, Swish (trainable or fixed \(\beta=1\)) matched or beat ReLU across all models, often matching the strongest baseline.

Table of CIFAR-10 results comparing Swish to other activation functions on three different models.

Table 4: CIFAR-10 accuracy across models.

Table of CIFAR-100 results comparing Swish to other activation functions on three different models.

Table 5: CIFAR-100 accuracy across models.

ImageNet Results

Simply swapping ReLU for Swish in ImageNet architectures yielded gains:

Training curves for Mobile NASNet-A on ImageNet, showing Swish (orange) achieving higher validation accuracy than ReLU (blue).

Figure 8: Mobile NASNet-A training curves—Swish tops ReLU.

On Mobile NASNet-A, Swish improved top-1 accuracy by +0.9%. On Inception-ResNet-v2, the boost was +0.6%.

Table of ImageNet results for Mobile NASNet-A, showing Swish consistently outperforming ReLU and other baselines across three runs.

Table 6: Mobile NASNet-A ImageNet results.

Table of ImageNet results for Inception-ResNet-v2, showing Swish outperforming ReLU across three runs.

Table 7: Inception-ResNet-v2 ImageNet results.

Machine Translation

To test cross-domain applicability, Swish was used in a Transformer model for WMT14 English→German translation. Again, it was top-tier.

Table of BLEU scores for machine translation, showing Swish-1 (Swish with β=1) achieving the best performance on several test sets.

Table 11: BLEU scores across newstest sets—Swish excels.

Conclusion: Swish as the New Default?

This work exemplifies the power of meta-learning—here, learning an activation function—at overturning assumptions. The team didn’t rely on hunches; they explored a vast mathematical space and found an activation that works better than the reigning champion.

Swish, \(f(x) = x \cdot \sigma(\beta x)\), is:

Simple: One-line change in most deep learning frameworks.
Effective: Consistent performance boosts across tasks from vision to translation.
Drop-in Ready: Gains achieved without retuning architectures—greater potential if designed for Swish from the start.

Swish’s success challenges the notion that ReLU’s constant gradient is crucial for deep nets in the era of residual connections and normalization layers. Its smooth, non-monotonic profile may bring advantages ReLU cannot.

For practitioners seeking a quick win, swapping ReLU for Swish may be one of the easiest, most promising changes to try.

The king is dead; long live Swish.

From Vanishing Gradients to the Reign of ReLU#

The Automated Search for a Better Activation#

1. Defining the Search Space#

2. The Search Algorithm#

Findings: Simplicity Wins#

Meet Swish: \(f(x) = x \cdot \sigma(\beta x)\)#

Why the Bump Matters#

Trainable \(\beta\)#

Swish vs. The Field#

CIFAR Results#

ImageNet Results#

Machine Translation#

Conclusion: Swish as the New Default?#