How ReLU Revolutionized Deep Learning: The Story Behind max(0,x)

In the history of deep learning, certain papers mark a turning point—a moment when a seemingly simple idea unlocks a new level of performance and understanding. The 2011 paper “Deep Sparse Rectifier Neural Networks” by Xavier Glorot, Antoine Bordes, and Yoshua Bengio is one such work. Before this paper, training deep neural networks was a notoriously tricky process, often requiring complex, multi-stage unsupervised pre-training to achieve good results.

The standard activation functions of the time—the logistic sigmoid and hyperbolic tangent (tanh)—suffered from a critical flaw: the vanishing gradient problem. As error signals propagated backward through many layers, the gradients would shrink exponentially, making it nearly impossible for the early layers of the network to learn.

This paper introduced a refreshingly simple alternative inspired by neuroscience: the rectifier activation function, defined as:

\[ \text{rectifier}(x) = \max(0, x) \]

This function, now ubiquitously known as the Rectified Linear Unit (ReLU), not only sidestepped the vanishing gradient problem but also introduced a powerful property called sparsity. The authors demonstrated that deep networks using rectifiers could be trained effectively with standard supervised learning, often outperforming their tanh-based counterparts without the need for unsupervised pre-training.

Let’s explore why this elegant idea changed the game.

Background: The Brain, Sparsity, and Activation Functions

To appreciate the contribution, we need to understand the state of neural network research at the time—particularly the gap between machine learning models and principles observed in computational neuroscience.

Lessons from Neuroscience

Neuroscientists have long observed that neurons in the brain operate under sparse firing—at any given moment, only a small fraction of neurons (estimated between 1–4%) are active. This minimizes energy expenditure while still providing rich representations.

Furthermore, biological neurons behave differently from the popular sigmoid or tanh functions. A common biological model—the Leaky Integrate-and-Fire (LIF)—has a one-sided response curve, firing only for sufficiently high inputs. In contrast, tanh is antisymmetric: strong negative input produces strong negative output, a property absent from real neurons.

Left: A biologically plausible neuron firing rate curve. Right: The commonly used sigmoid and tanh activation functions.

The Allure of Sparsity

Sparsity isn’t just a biological curiosity—it’s highly desirable in machine learning models:

Information Disentangling: Sparse representations change little with small variations in the input, making it easier for models to separate causal factors of variation.
Efficient Representation: Inputs with varying complexity can be represented with a different number of active neurons.
Linear Separability: Sparse, high-dimensional representations are often more easily separable.
Computational Efficiency: Zero outputs mean skipped computations.

Earlier techniques to induce sparsity (e.g., \(L_1\) penalties) often resulted in small but nonzero activations. This paper showcased a way to get true zeros naturally.

The Core Idea: Deep Rectifier Networks

The authors replaced sigmoid and tanh with the simple rectifier:

\[ \text{rectifier}(x) = \max(0, x) \]

It’s piecewise linear: zero for negative inputs, linear for positive inputs.

Left: Sparse activation paths in a rectifier network. Right: Rectifier vs. its smooth approximation, Softplus.

Advantages

True Sparsity: Neurons receiving negative input output hard zeros. With weights initialized near zero, roughly 50% of neurons are inactive for any given input.
No Vanishing Gradients: Active neurons have a derivative of 1, allowing gradients to pass unchanged during backpropagation along active paths.
Computational Simplicity: A max operation is far cheaper than exponentials used in sigmoid/tanh.

Potential Problems — and Why They Don’t Derail Training

Non-differentiable at Zero: The corner at \(x = 0\) seldom causes issues; sub-gradient methods easily handle it.
Dying ReLU: “Dead” neurons never activate and never update weights. The authors suggest this can focus learning on active neurons.
Unbounded Activations: They can grow large, so the authors add a small \(L_1\) penalty on activations to encourage stability and extra sparsity.

Rectifiers in Unsupervised Pre-training

In 2011, deep networks were often trained with layer-wise unsupervised pre-training via denoising autoencoders before supervised fine-tuning. The authors adapted this to rectifiers.

Challenge: Using a rectifier in the reconstruction layer is problematic—if it outputs zero for a nonzero target, gradients stop.

They experimented with solutions:

The equation for a simple auto-encoder with a rectifier hidden layer.

Image Data: Use softplus (\(\log(1+e^x)\)) in the reconstruction layer with a quadratic cost.
Text Data: Scale hidden activations to [0, 1] and use a sigmoid reconstruction layer with cross-entropy cost.

Experiments and Results

Image Recognition

Benchmarks: MNIST, CIFAR10, NISTP, NORB.

Table 1: Test error rates for 3-layer networks across image datasets.

Key Findings:

Pre-training Gap Closed: Rectifiers perform nearly identically with or without pre-training (NORB: 16.40% without pre-training vs. 16.46% with), unlike tanh/softplus.
Hard Zeros Win: Hard-zero rectifiers outperform smooth softplus activations.
High Natural Sparsity: Hidden layers average between ~68–83% zeros.

To probe sparsity’s role, they trained 200 rectifier networks on MNIST with varying \(L_1\) penalties.

Figure 3: MNIST test error vs. average network sparsity.

Performance is optimal and stable between 70–85% sparsity.

Semi-Supervised Setting

Does pre-training ever help rectifiers? Yes—when labeled data is scarce.

Figure 4: NORB test error vs. % of labeled training data used.

Findings:

Low Data: Pre-training boosts performance significantly.
Ample Data: Differences vanish.

Rectifiers can learn directly from large labeled sets but still benefit from unlabeled data when labels are sparse.

Sentiment Analysis — Are Rectifiers Just for Images?

To test generality, the authors tackled sentiment analysis on a dataset of restaurant reviews (OpenTable). Represented as “bag-of-words,” the text vectors are highly sparse (~0.6% nonzeros).

Table 2: RMSE and sparsity for sentiment analysis on OpenTable.

Results:

Depth Helps: 3-layer rectifier nets achieve 0.746 RMSE vs. 0.807 for single-layer.
Rectifiers Beat Tanh: 3-layer rectifier outperforms tanh (0.774 RMSE).
Sparsity Preserved: 53.9% sparsity achieved in hidden layers.

They also validated on the Amazon sentiment benchmark, achieving 78.95% accuracy—outperforming prior best results (73.72%).

Conclusion and Lasting Impact

Rectifiers Are Superior Activations: Simple max(0, x) mitigates vanishing gradients, enabling effective training of deep networks.
Sparsity Is Powerful: True zeros create biologically plausible, computationally efficient representations aiding optimization and generalization.
Supervised Deep Learning Becomes Practical: With rectifiers, large deep networks train well via standard backprop alone, removing the necessity for complex pre-training.

Following its introduction, ReLU became the default activation for deep learning, from CNNs in vision to transformers in language models. This paper is a landmark example of how neuroscience-inspired simplicity can lead to profound advancements in AI, unlocking the potential of deep learning across domains.

Background: The Brain, Sparsity, and Activation Functions#

Lessons from Neuroscience#

The Allure of Sparsity#

The Core Idea: Deep Rectifier Networks#

Advantages#

Potential Problems — and Why They Don’t Derail Training#

Rectifiers in Unsupervised Pre-training#

Experiments and Results#

Image Recognition#

Semi-Supervised Setting#

Sentiment Analysis — Are Rectifiers Just for Images?#

Conclusion and Lasting Impact#