If you’ve spent any time in the world of deep learning for sequential data, you’ve undoubtedly come across the Long Short-Term Memory network—better known as LSTM. Since their introduction, LSTMs have become the workhorse for tasks ranging from speech recognition and language translation to handwriting analysis and music generation. They are renowned for capturing long-range dependencies in data—a capability that their simpler predecessors, Simple Recurrent Networks (SRNs), often lacked.

But here’s a question that might surprise you: what is an LSTM, really?
It turns out “LSTM” isn’t a single, rigidly defined architecture. It’s more like a family of related models. Over the years, researchers have proposed numerous tweaks and variations: adding peephole connections, removing gates, coupling gates together, and more. As a result, the field has accumulated a kind of architectural zoo—where practitioners often rely on folklore or copy designs from well-known papers without fully understanding why certain components exist.

So which of these components are essential? Which ones are redundant? Does a more complex variant always outperform the standard setup?

In 2015, a team of researchers from the Swiss AI Lab IDSIA—Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber—set out to answer these questions systematically. Their large-scale study compared the standard LSTM against eight of its most popular variants across three benchmark tasks. The scale was staggering: 5,400 experimental runs, totaling approximately 15 years of CPU time.
The result was LSTM: A Search Space Odyssey—a landmark paper that provided a clear, data-driven map of the LSTM design space.

In this post, we’ll travel through that “search space odyssey.” We’ll unpack the mechanics of the LSTM, explore the variants they tested, and reveal insights you can use to design and tune your own LSTM networks effectively.


From Simple RNNs to Long Short-Term Memory

To appreciate why the LSTM was revolutionary, we need to look at what came before: the Simple Recurrent Network (SRN). An SRN processes sequences step-by-step, maintaining a hidden state that serves as “memory” of past inputs. At each time step, it combines the current input and the previous hidden state to produce a new output.

The problem? Training an SRN means passing gradients back through time—over many repeated non-linear transformations. This leads to the notorious vanishing or exploding gradient problem.
Gradients can shrink exponentially (making it impossible to learn long-term dependencies) or grow exponentially (causing unstable training).

The LSTM was designed to overcome this. It introduces a memory cell that maintains information over time, regulated by gates that control what to remember, forget, and output.

A diagram comparing a simple SRN unit on the left with a more complex LSTM block on the right. The LSTM block features a central cell state controlled by input, forget, and output gates.

Figure 1. The Simple Recurrent Network (left) and the Long Short-Term Memory block (right). The LSTM adds dedicated gating mechanisms and peephole connections that allow precise control over information flow.

The standard, or vanilla, LSTM block uses three gates:

  1. Forget Gate (f) – Decides what information to discard.
  2. Input Gate (i) – Controls what new information enters the cell state.
  3. Output Gate (o) – Determines what portion of the cell state becomes the output.

These gates allow selective retention and updating of information, enabling gradients to flow across long time spans without vanishing.

The forward equations for a vanilla LSTM are:

EquationComponent
\( z^t = g(W_z x^t + R_z y^{t-1} + b_z) \)Block input
\( i^t = \sigma(W_i x^t + R_i y^{t-1} + p_i \odot c^{t-1} + b_i) \)Input gate
\( f^t = \sigma(W_f x^t + R_f y^{t-1} + p_f \odot c^{t-1} + b_f) \)Forget gate
\( c^t = i^t \odot z^t + f^t \odot c^{t-1} \)Cell state
\( o^t = \sigma(W_o x^t + R_o y^{t-1} + p_o \odot c^t + b_o) \)Output gate
\( y^t = o^t \odot h(c^t) \)Block output

The terms involving \(p_i \odot c^{t-1}\) are the peephole connections, which allow the gates to “peek” at the cell state directly—helpful in tasks where precise timing is critical.


The Great LSTM Bake-Off: An Epic Experiment

The research team designed their experiments to isolate the effect of individual architectural components.

The Contenders

They compared nine architectures:

  1. Vanilla (V): The standard three-gate LSTM with peepholes and activation functions.
  2. No Input Gate (NIG): Removes the input gate; the cell always writes new input.
  3. No Forget Gate (NFG): Removes the forget gate; the cell cannot erase information.
  4. No Output Gate (NOG): Removes the output gate; the full internal state is always visible.
  5. No Input Activation Function (NIAF): Removes the non-linearity from the block input.
  6. No Output Activation Function (NOAF): Removes the non-linearity from the cell state before output.
  7. No Peepholes (NP): Removes the peephole connections to all gates.
  8. Coupled Input and Forget Gate (CIFG): Ties the forget and input gates together using \( f^t = 1 - i^t \), simplifying the model.
  9. Full Gate Recurrence (FGR): Adds recurrent connections between all gates—greatly increasing parameter count.

The Tasks

To test generality, they evaluated each architecture on three distinct benchmarks:

  • TIMIT (Speech Recognition): Frame-level phoneme classification.
  • IAM Online (Handwriting Recognition): Mapping pen trajectories to character sequences.
  • JSB Chorales (Polyphonic Music Modeling): Predicting successive note patterns in Bach chorales.

Experimental Design

Fair comparison is crucial—different architectures perform best under different hyperparameters. The authors used random search to tune hyperparameters for each architecture and dataset combination.
Every combination had 200 trials, exploring variations in:

  • Number of LSTM units per layer
  • Learning rate
  • Momentum
  • Input noise

They then analyzed the top 10% of runs for each setup—ensuring comparisons reflected well-tuned configurations, not random luck.


Results: What Really Matters in an LSTM

After 15 CPU-years of computation, the verdict was in.

Box plots comparing the performance of the nine LSTM variants across the three datasets: TIMIT, IAM Online, and JSB Chorales.

Figure 2. Test set performance of the top 10% of trials for each dataset and LSTM variant. Blue boxes indicate statistically significant differences from the vanilla LSTM. Grey bars show parameter count.

Finding 1: The Forget Gate and Output Activation Are Essential

Removing the forget gate (NFG) or output activation function (NOAF) consistently degraded performance across all tasks.
The forget gate enables the model to reset its memory; without it, the cell accumulates information endlessly.
The output activation function (typically tanh) is equally vital—it constrains the output range, preventing the cell state from exploding.

Finding 2: Simpler Variants Often Work Just as Well

Two variants matched vanilla performance while simplifying the architecture:

  • CIFG: Coupling the input and forget gates reduced parameters yet maintained effectiveness.
  • NP: Removing peepholes made training slightly faster and even improved handwriting recognition.

These results mean practitioners can simplify their LSTM—saving computation—without sacrificing performance.

Finding 3: Full Gate Recurrence Is Not Worth It

The FGR variant, which reintroduces cross-gate recurrences from the original 1997 design, adds nine extra recurrent weight matrices but yielded no benefit—and worse results for music modelling. Complexity without reward.

Finding 4: Task-Dependent Components

Removing the input, output, or input activation gates hurt performance in speech and handwriting tasks but not in music modeling. Continuous, real-valued data needs these gates for fine control, while symbolic tasks may rely less on them.


Hyperparameter Insights: What Should You Tune?

The scale of the study allowed the authors to apply fANOVA, a statistical framework for analyzing hyperparameter importance.

Pie charts for each dataset showing the fraction of performance variance explained by each hyperparameter.

Figure 3. For all datasets, learning rate dominates performance variance (> two-thirds). Hidden size is second; input noise and momentum contribute almost nothing.

Learning Rate Rules Everything

Over two-thirds of the performance variance was explained by the learning rate alone. The next most influential factor was hidden size, then input noise, with momentum barely registering.

Plots showing the effect of varying learning rate, hidden size, and input noise on model error and training time.

Figure 4. Predicted error (blue) and training time (green) as a function of learning rate, hidden size, and input noise. Shaded regions show model uncertainty.

Practical takeaways:

  • Learning Rate: There is a broad range of good learning rates. Start large (e.g. 1.0) and divide by ten until the performance stops improving.
  • Hidden Size: Bigger networks perform better but take longer to train.
  • Momentum: Has negligible effect on either performance or speed—at least for online SGD.
  • Input Noise: Slightly beneficial for speech recognition (TIMIT), harmful elsewhere.

Interactions: Hyperparameters Work Independently

The researchers also examined whether hyperparameters interfere with each other (e.g., does the best learning rate depend on network size?).
They found almost no interaction—an incredibly useful practical insight.

Heatmaps showing the interaction between learning rate and hidden size for the TIMIT dataset. The right-hand plot, showing only the interaction effect, is mostly neutral.

Figure 5. Heatmaps for the TIMIT dataset: left shows joint effects of learning rate and hidden size; right isolates interaction only. Blue means better, red worse. Very little interaction is visible.

Bottom line: You can tune hyperparameters independently.
Find a good learning rate using a small network, then reuse it for larger models—saving massive amounts of compute time.


Conclusion: Evidence Over Folklore

The IDSIA team’s Search Space Odyssey is a model of rigorous, data-driven investigation. Their findings cut through years of anecdotal wisdom to offer concrete guidance for LSTM design and training.

Key takeaways for practitioners:

  • Stick with the vanilla LSTM — it’s robust, reliable, and surprisingly hard to beat.
  • Keep the forget gate and output activation function. They’re essential for stability and performance.
  • Simplify smartly: coupling input and forget gates (CIFG) or removing peepholes (NP) can maintain quality while reducing complexity.
  • Focus on the learning rate. It’s your most impactful hyperparameter.
  • Tune independently: find good values for learning rate and hidden size separately to save time.

This paper demonstrates that careful, large-scale empirical research can replace intuition with evidence. With its rigorous comparisons and comprehensive analysis, LSTM: A Search Space Odyssey remains one of the most insightful references for anyone working with recurrent neural networks—providing both clarity and confidence in navigating the vast design space of LSTM architectures.