For nearly a decade, the Rectified Linear Unit (ReLU) has been the undisputed champion of activation functions in deep learning. Its elegant simplicity—outputting the input if it’s positive and zero otherwise—was a breakthrough that unlocked practical training for very deep neural networks. Fast, effective, and easy to implement, ReLU quickly became the default choice across the AI community.
Many rivals have tried to dethrone ReLU. Alternatives like Leaky ReLU, ELU, and SELU promised improvements by tweaking how ReLU handles negative inputs. Yet none managed to replace it. Gains were often inconsistent across models and datasets, leaving practitioners to fall back on ReLU’s reliable simplicity.
This raises a compelling question: Is ReLU truly the best we can do, or have we just not found the right contender? And instead of hand-designing a better function guided by intuition, could we automatically search the mathematical universe for one?
This is precisely the idea behind the Google Brain paper “Searching for Activation Functions.” The authors turned activation design into a search problem, using reinforcement learning to explore a vast space of possible functions. The result? They discovered Swish, an activation defined as:
\[ f(x) = x \cdot \sigma(\beta x) \]where \(\sigma\) is the sigmoid function and \(\beta\) is a constant or trainable parameter. Swish consistently outperformed ReLU on deep and challenging models. In this article, we’ll unpack how they found it, what makes it special, and why it might become your new go-to activation.
From Vanishing Gradients to the Reign of ReLU
In a neural network, each layer first applies a linear transformation (a matrix multiplication) to its input. Without activation functions, stacking many layers would collapse into a single large linear transformation, incapable of learning complex, nonlinear patterns like those in images, audio, or text.
Activation functions introduce essential non-linearity. Early choices like sigmoid and tanh faded in popularity due to the vanishing gradient problem: as gradients backpropagated through many layers, they shrank exponentially, making training deep networks infeasible.
ReLU, defined as \(f(x) = \max(x, 0)\), was a game changer. For positive inputs, its derivative is exactly 1, letting gradients flow without decay. This single property powered breakthroughs like AlexNet in 2012, marking the start of the deep learning boom. Despite issues like “dying ReLUs” (neurons stuck outputting zero), its effectiveness and speed cemented ReLU as the standard.
The Automated Search for a Better Activation
Instead of guessing and testing activation designs manually, the authors automated the process—requiring two key components:
1. Defining the Search Space
You can’t just tell a computer “find the best activation function.” You need defined building blocks. The authors based their search space on a repeating core unit: two inputs, each processed by a unary function (one input, one output), then combined via a binary function (two inputs, one output).
Figure 1: Structure of a core unit for constructing activation functions.
Unary functions included:
- Basic operations: \(x, -x, x^2, x^3, \sqrt{x}, \exp(x), \sin(x)\)
- Existing activations: sigmoid \(\sigma(x)\), ReLU \(\max(x, 0)\)
- Trainable constants like \(\beta\)
Binary functions combined inputs in various ways:
- Addition: \(x_1 + x_2\)
- Multiplication: \(x_1 \cdot x_2\)
- Maximum: \(\max(x_1, x_2)\)
By composing these core units, the search could span a huge variety of candidate functions.
2. The Search Algorithm
With potentially \(10^{12}\) combinations, exhaustive search is impossible. The authors used reinforcement learning, similar to Neural Architecture Search (NAS).
They trained an RNN controller to generate candidate activation definitions—like a language model where the “words” are unary and binary ops.
Figure 2: RNN controller predicting activation function components.
The loop worked as follows:
- Generate: Controller outputs a function definition.
- Test: Build a “child network” (ResNet-20) using the new function and train it on CIFAR-10.
- Reward: Measure validation accuracy—used as the RL reward signal.
- Update: Reinforcement learning (Proximal Policy Optimization) nudges the controller to generate functions with higher accuracy.
Distributed training with many worker machines sped up exploration.
Findings: Simplicity Wins
After extensive searching, the team analyzed the top novel functions.
Figure 3: Graphs of the best activation functions discovered.
Key observations:
- Simplicity: Best performers often used just one or two core units. Overly complex functions hurt optimization.
- Common Structure: Many top functions followed \(b(x, g(x))\), where raw preactivation \(x\) is an input to the final binary fn—ReLU itself fits this form (\(\max(x,0)\)).
- Periodic Potential: Functions incorporating \(\sin\) or \(\cos\) showed promise—an underexplored area.
- Division Traps: Functions with division often failed due to near-zero denominators causing exploding outputs.
To test generality, the top functions were evaluated on larger CIFAR models.
Table 1: CIFAR-10 accuracy for top discovered functions.
Table 2: CIFAR-100 accuracy for top discovered functions.
Two functions stood out: \(x \cdot \sigma(\beta x)\) and \(\max(x, \sigma(x))\). The former, dubbed Swish, showed slightly better generalization and became the focus for deeper study.
Meet Swish: \(f(x) = x \cdot \sigma(\beta x)\)
Swish is defined as:
\[ f(x) = x \cdot \sigma(\beta x), \quad \sigma(z) = \frac{1}{1 + e^{-z}} \]with \(\beta\) as a constant or trainable parameter.
Figure 4: Swish curves and derivatives for various \(\beta\) values.
Properties:
- Smoothness: No sharp corner at zero—fully differentiable.
- Non-Monotonicity: For negative \(x\), Swish dips before rising—a unique “bump” most activations lack.
- Unbounded Positive, Bounded Negative: Like ReLU, but smoother.
- Interpolates Between Linear & ReLU: As \(\beta \to 0\), Swish → linear (\(x/2\)). As \(\beta \to \infty\), it → ReLU.
Why the Bump Matters
Figure 6: Preactivation distribution for Swish.
A large share of preactivations land in the bump’s range (\(-5 \le x \le 0\)), implying the network leverages this shape during learning.
Trainable \(\beta\)
Swish’s \(\beta\) can be learned, letting each neuron adjust its curve.
Figure 7: Distribution of learned \(\beta\) in Mobile NASNet-A.
Learned values clustered near \(\beta \approx 1\), but varied enough to suggest useful adaptability.
Swish vs. The Field
The authors compared Swish against strong baselines:
- LReLU: Fixed slope for \(x<0\).
- PReLU: Learnable slope for \(x<0\).
- Softplus: Smooth \(\log(1+\exp(x))\) approximation to ReLU.
- ELU: Uses exponentials for \(x<0\).
- SELU: Scaled ELU for self-normalizing nets.
- GELU: Smooth, non-monotonic curve similar to Swish.
Table 3: Aggregate comparison results—Swish wins most matchups.
CIFAR Results
On CIFAR-10/100, Swish (trainable or fixed \(\beta=1\)) matched or beat ReLU across all models, often matching the strongest baseline.
Table 4: CIFAR-10 accuracy across models.
Table 5: CIFAR-100 accuracy across models.
ImageNet Results
Simply swapping ReLU for Swish in ImageNet architectures yielded gains:
Figure 8: Mobile NASNet-A training curves—Swish tops ReLU.
On Mobile NASNet-A, Swish improved top-1 accuracy by +0.9%. On Inception-ResNet-v2, the boost was +0.6%.
Table 6: Mobile NASNet-A ImageNet results.
Table 7: Inception-ResNet-v2 ImageNet results.
Machine Translation
To test cross-domain applicability, Swish was used in a Transformer model for WMT14 English→German translation. Again, it was top-tier.
Table 11: BLEU scores across newstest sets—Swish excels.
Conclusion: Swish as the New Default?
This work exemplifies the power of meta-learning—here, learning an activation function—at overturning assumptions. The team didn’t rely on hunches; they explored a vast mathematical space and found an activation that works better than the reigning champion.
Swish, \(f(x) = x \cdot \sigma(\beta x)\), is:
- Simple: One-line change in most deep learning frameworks.
- Effective: Consistent performance boosts across tasks from vision to translation.
- Drop-in Ready: Gains achieved without retuning architectures—greater potential if designed for Swish from the start.
Swish’s success challenges the notion that ReLU’s constant gradient is crucial for deep nets in the era of residual connections and normalization layers. Its smooth, non-monotonic profile may bring advantages ReLU cannot.
For practitioners seeking a quick win, swapping ReLU for Swish may be one of the easiest, most promising changes to try.
The king is dead; long live Swish.