From the melodies we listen to, the sentences we read, to the raw waveforms of our speech, the world around us is filled with sequences. For machine learning, understanding and generating this kind of data is a monumental challenge. How can a model grasp the grammatical structure of a long paragraph, or compose a melody that feels coherent from start to finish? The key lies in memory — specifically, the ability to store information over long spans of time.

For years, Recurrent Neural Networks (RNNs) have been the go-to tool for sequence modeling. Their defining feature is a hidden state that acts as a memory, carrying information from one step of a sequence to the next. But the classic RNN has a critical flaw: its memory quickly fades. It struggles to hold onto information over long distances, a problem known as the vanishing gradient problem.

To tackle this, researchers developed more sophisticated RNN units with gates — mechanisms that carefully control the flow of information. The most famous of these is the Long Short-Term Memory (LSTM) unit. More recently, a simpler alternative called the Gated Recurrent Unit (GRU) has surged in popularity. This raises a crucial question for practitioners: which one is better?

In a foundational 2014 paper, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, researchers from the University of Montreal (including Yoshua Bengio) put these architectures to the test. This article will break down their work, exploring the inner workings of LSTMs and GRUs and analyzing their head-to-head performance on challenging sequence modeling tasks.


Background: The Standard RNN and Its Memory Problem

Before delving into gated units, let’s quickly review how a standard RNN works.

An RNN processes a sequence one element at a time. At each step \(t\), it takes the current input \(x_t\) and its previous hidden state \(h_{t-1}\) to compute the new hidden state \(h_t\):

A formal definition of the RNN’s hidden state update.

Equation: The RNN hidden state update function \(\phi\), where \(h_t\) depends on \(x_t\) and \(h_{t-1}\).

In a traditional — or “vanilla” — RNN, \(\phi\) is typically implemented as an affine transformation followed by a non-linear activation function like tanh:

The update equation for a traditional tanh RNN unit.

Hidden state update in a vanilla tanh RNN.

Here, the network learns the weight matrices \(W\) and \(U\). The hidden state \(h_t\) serves as the network’s entire memory of the past. For generative tasks (like creating music or text), the goal is to model the probability of a sequence by predicting the next element given all previous ones:

The chain rule decomposition of a sequence’s probability.

Sequence probability decomposition using the chain rule.

The problem with this simple update mechanism appears during training. RNNs use backpropagation through time to adjust their weights — calculating how a small change in \(W\) or \(U\) affects the final error. At each step backward, the signal is multiplied by \(U\). If the values in \(U\) are small, the signal shrinks exponentially, quickly approaching zero (vanishing gradient). If large, the signal can blow up (exploding gradient). Both situations make it extremely difficult to learn long-range dependencies.

This fundamental limitation inspired the creation of gated RNN units — designed to protect and regulate their memory.


The Core Method: Gating the Flow of Information

The solution to the RNN’s memory problem is to give each unit more control over its state. Instead of overwriting the hidden state at every step, what if the unit could learn when to remember, when to forget, and what to output? This is the core idea behind gating.

Gated units function like valves that control information flow. They are implemented as small neural networks (often using a sigmoid activation that outputs values between 0 and 1) which determine how much of the information passes through. Let’s look at the two most successful implementations: LSTM and GRU.

A diagram comparing the architecture of (a) an LSTM unit and (b) a GRU.

Figure 1: Visual comparison of LSTM (left) vs. GRU (right). LSTM maintains a separate memory cell \(c\), while GRU merges memory into the hidden state \(h\).


The Veteran: Long Short-Term Memory (LSTM)

The LSTM, introduced in 1997, is the classic fix for the vanishing gradient problem. Its key innovation is the cell state (\(c_t\)), a pathway running through the unit like a conveyor belt, allowing information to flow unchanged unless modified by gates.

The LSTM uses three gates:

  1. Forget Gate (\(f_t\)) — decides what information to discard from the cell state.
  2. Input Gate (\(i_t\)) — decides which new information to store.
  3. Output Gate (\(o_t\)) — controls what part of the cell state to reveal as the hidden state.

The forget and input gates are computed as:

Equations for the forget gate (f) and input gate (i) in an LSTM.

Forget and input gate equations.

The new candidate memory content is:

Equation for the new memory content in an LSTM.

New candidate memory \(\tilde{c}_t\).

Cell state update:

The equation for updating the cell state in an LSTM.

Updating \(c_t\) by partly forgetting old content and adding new.

Output gate equation:

Equation for the output gate (o) in an LSTM.

Output gate determining exposure of the cell state.

Final hidden state:

Equation for the final hidden state (h) in an LSTM.

Computing \(h_t\) from \(o_t\) and \(c_t\).

By separating cell state from hidden state and using explicit gates, the LSTM can preserve critical information for hundreds of time steps without vanishing or being overwritten.


The Challenger: Gated Recurrent Unit (GRU)

The GRU, introduced in 2014, is a more compact alternative. It merges the forget and input gates into a single update gate and eliminates the separate cell state.

It has two gates:

  1. Reset Gate (\(r_t\)) — controls how much of the previous hidden state to forget when computing new candidate activation \(\tilde{h}_t\):

Equation for the reset gate (r) in a GRU.

Reset gate controls past memory contribution.

Candidate activation:

Equation for the candidate activation (h-tilde) in a GRU.

Computing \(\tilde{h}_t\) influenced by the reset gate.

  1. Update Gate (\(z_t\)) — controls how much of the previous state to keep vs. replace with the new candidate:

Equation for the update gate (z) in a GRU.

Update gate blending old and new memory.

Final hidden state:

Equation for the final hidden state (h) in a GRU.

Linear interpolation of \(h_{t-1}\) and \(\tilde{h}_t\).

If \(z_t \approx 1\), the GRU heavily favors new information; if \(z_t \approx 0\), it retains older memory. This simple mechanism effectively captures long-term dependencies.


LSTM vs. GRU: A Tale of Two Gates

Both LSTM and GRU tackle the vanishing gradient problem through additive state updates and gradient shortcut paths. Differences include:

  • Complexity: GRU is simpler and typically faster to compute.
  • Memory Control: LSTM has a separate memory cell with controlled exposure. GRU fully exposes its hidden state.
  • Independence of Gates: LSTM can independently decide to forget old memory and add new. GRU ties these actions via the update gate.

Which is better? Theory alone isn’t enough — empirical testing is needed.


The Experiment: Putting Theory to the Test

To compare tanh-RNN, LSTM, and GRU, the researchers designed a balanced experiment.

Tasks and Datasets

Models were tested on sequence modeling, optimizing the log-likelihood:

The optimization objective for sequence modeling.

Mathematical objective for sequence modeling.

Two domains:

  1. Polyphonic Music Modeling — four datasets: Nottingham, JSB Chorales, MuseData, Piano-midi.
  2. Speech Signal Modeling — two Ubisoft datasets with raw audio waveforms.

Model Setup

To ensure fairness, each architecture was allocated a similar number of parameters. The simpler tanh models had more hidden units to match the parameter count of LSTM and GRU.

Table 1: The model sizes used in the experiments.

Table 1: Number of units and parameters for each architecture.

Training used RMSProp, gradient clipping, and early stopping for validation performance.


Results: And the Winner Is… It Depends

Final Performance

Average negative log-probabilities (lower is better):

Table 2: The average negative log-probabilities across datasets.

Table 2: Test performance for all models.

Key observations:

  1. Gated Units Shine — On Ubisoft speech datasets, LSTM and GRU vastly outperform tanh-RNN. Tanh-RNN fails to capture complex dependencies here.
  2. Mixed Results in Music Tasks — GRU edges out LSTM in most music datasets, but differences are minor. In speech tasks, LSTM wins on Ubisoft A, GRU on Ubisoft B.

Learning Curves

Learning trends reveal efficiency.

Music datasets:

Figure 2: Learning curves for music datasets.

GRU often converges faster than LSTM and tanh — both by iterations and wall-clock time.

Speech datasets:

Figure 3: Learning curves for speech datasets.

Tanh-RNN stalls early. LSTM and GRU continue to improve, validating the importance of gating.


Conclusion: Key Takeaways

  1. Gating is Essential — LSTM and GRU are superior to traditional tanh-RNN for long-term dependency tasks.
  2. GRU is a Powerful Alternative — Comparable performance to LSTM, sometimes faster and simpler.
  3. No Universal Winner — Dataset and task specifics influence which gating unit performs best.

While preliminary, this study was pivotal in validating GRU’s effectiveness and cementing its place alongside LSTM in the deep learning toolkit. For practitioners, it reinforces the importance of experimenting with both architectures to find the best fit for a given problem.