It’s 1997. The Spice Girls are topping the charts, Titanic is about to hit theaters, and two researchers, Sepp Hochreiter and Jürgen Schmidhuber, publish a paper that will, in time, become a cornerstone of the modern AI revolution.
The paper, titled Long Short-Term Memory, proposed a new kind of neural network architecture that could remember information for incredibly long periods.

At the time, this was a monumental challenge. Neural networks had a memory problem — they were notoriously forgetful. Trying to get a standard Recurrent Neural Network (RNN) to remember something that happened 100 steps ago was like trying to recall the first sentence of a book after reading the whole thing. The information would almost certainly be gone, washed away by a flood of newer data.

This amnesia was a major roadblock for applying AI to sequential data like text, speech, or time series.
How could you translate a sentence if you forgot the beginning by the time you reached the end?
How could you predict stock prices if you couldn’t see patterns spanning more than a few days?

The Long Short-Term Memory (LSTM) network was the answer. It introduced a novel architecture designed specifically to overcome this fundamental limitation.
This article takes a deep dive into that seminal 1997 paper — breaking down the problem LSTMs were built to solve, the ingenious design behind them, and the experiments that proved they were not just another incremental improvement, but a genuine breakthrough.


The Fading Memory of Recurrent Neural Networks

To understand why LSTM was such a big deal, we first need to understand the infamous vanishing gradient problem.

Recurrent Neural Networks are designed to work with sequences. Unlike a standard feedforward network, an RNN has a loop that allows information to persist from one step to the next. At each time step \(t\), the network takes:

  • Input: \(x(t)\)
  • Previous hidden state: \(h(t-1)\)

It uses both to produce:

  • Output
  • New hidden state: \(h(t)\)

This hidden state is the network’s “memory.”

Training an RNN typically involves an algorithm called Backpropagation Through Time (BPTT).
It’s a variation of standard backpropagation where the network is unrolled through time, creating a long chain of repeated computations with shared weights at each time step. The error from the final output is then propagated backward through this unrolled network to update the weights.

Here’s where the trouble starts:
To update the weights, we need the gradient of the error with respect to the weight. This gradient is a chain of many multiplications — derivatives of activation functions multiplied by weights — across time steps.

For a neuron \(j\):

\[ \vartheta_j(t) = f'_j(net_j(t)) \sum_i w_{ij} \ \vartheta_i(t+1) \]

Here, \(\vartheta_j(t)\) is the backpropagated error signal. Chaining this over many time steps \(q\) leads to a scaling factor that is:

The equation for propagating an error gradient back through q time steps in an RNN. It involves a product of weight matrices and activation function derivatives for each step.


The Two Failure Modes

  1. Vanishing Gradient
    Most common activation functions, like the logistic sigmoid, have derivatives strictly less than 1 (max: 0.25). Multiplying many numbers less than 1 causes exponential decay of the gradient. After just a few steps, the gradient can shrink to effectively zero. The network becomes incapable of learning dependencies between far-apart events.

  2. Exploding Gradient
    If the weights are large (greater than 1), the gradient can grow exponentially. This leads to highly unstable weight updates and failure to learn.

In short, the influence of past inputs either vanishes or explodes, giving standard RNNs a very short active memory span.
This made them unsuitable for long time lag problems.


The authors asked: What if we could force the error to flow backward through time without shrinking or growing?

In recurrent backprop, the error signal flowing backward through a neuron’s self-connection is scaled by:

\[ f'_j(net_j(t)) \cdot w_{jj} \]

To keep error flow constant, this term must equal 1.0:

\[ f'_j(net_j(t)) \cdot w_{jj} = 1.0 \]

The simplest way to achieve this:

  1. Use a linear activation, \(f_j(x) = x\), so \(f'_j = 1\).
  2. Set self-connection weight to \(w_{jj} = 1.0\).

This creates what the authors called a Constant Error Carousel (CEC) — a unit that can carry information forward indefinitely without change. Gradients can flow backward through it unchanged.

But the naive CEC has two critical flaws:

  • Input Weight Conflict — The same input weights must both store relevant inputs and ignore irrelevant ones.
  • Output Weight Conflict — The same output weights must both retrieve stored information and prevent it from disturbing the network when irrelevant.

The solution: control information flow with gates.


The LSTM Cell: Memory with Gates

The LSTM tackles these conflicts by wrapping the CEC in multiplicative gate units — learnable switches that decide when to write to or read from the cell.

The architecture of a single LSTM memory cell. The central component is the Constant Error Carousel (CEC), a self-recurrent linear unit. Access to this carousel is controlled by the input and output gates.

Anatomy of the LSTM Memory Cell

1. Cell State: The CEC

At the heart of the LSTM cell is the cell state \(s_{c_j}(t)\) — the CEC.
It’s updated by adding new information to the previous state, allowing gradients to flow backward unchanged:

\[ s_{c_j}(t) = s_{c_j}(t-1) + \text{new information} \]

2. Input Gate (\(in_j\))

Controls writes to memory. Output between 0 and 1 via sigmoid activation:

  • 1 → allow new information to be stored
  • 0 → block update, protecting stored state

State update refined as:

\[ s_{c_j}(t) = s_{c_j}(t-1) + y^{in_j}(t) \cdot g(net_{c_j}(t)) \]

Here, \(g\) is a squashing function (e.g., tanh scaled), and \(y^{in_j}(t)\) is the learned gate output.


3. Output Gate (\(out_j\))

Controls reads from memory:

\[ y^{c_j}(t) = y^{out_j}(t) \cdot h(s_{c_j}(t)) \]

Here, \(h\) is usually tanh; \(y^{out_j}(t)\) decides whether the cell value influences the rest of the network.


Cell Blocks

LSTM cells are often grouped into blocks sharing input/output gates, improving efficiency.
Example network topology:

An example LSTM network with 8 inputs, 4 outputs, and two memory cell blocks of size 2. This illustrates how individual memory cells are integrated into a larger network architecture.


Putting LSTM to the Test

The brilliance of LSTM lies in its application to problems where vanilla RNNs utterly fail.


Experiment 1: Embedded Reber Grammar

Benchmark for sequence learning — predict the next symbol from a grammar-generated string.

The transition diagram for the embedded Reber grammar task. To predict the final ‘T’ or ‘P’, the network must remember which one it saw near the beginning.

LSTM results:

Comparison of different models on the embedded Reber grammar task. LSTM is the only one to consistently solve it, and does so much faster.

Other algorithms struggled or failed entirely. LSTM solved it 100% of the time, much faster. Output gates were critical — they allowed learning of short-term grammar rules without interference from the harder long-term dependency.


Experiment 2: The 1000-Step Challenge

Task: predict the final symbol based on the second symbol in a sequence. Minimal time lag: 1000 steps, with random distractors in between.

This is the quintessential long-term dependency test — impossible for standard RNNs.

LSTM performance on Task 2c with varying time lags and numbers of distractor symbols. Learning time scales remarkably well, even as the problem gets harder.

LSTM reliably solved tasks with 1000-step minimal lags. Training time scaled slowly even with more distractor symbols — a property unmatched by other methods.


Experiment 4: Adding Problem

Test: can LSTM store continuous, distributed values?
Sequence of value-marker pairs → output the sum of marked values at end.

Results for the Adding Problem. LSTM successfully learns to store and sum precise real values over long time intervals.

LSTM preserved exact real values for hundreds of steps without drift — proving CEC’s stability.


Experiment 5: Multiplication Problem

Same as above but target is product of marked values. Shows LSTM isn’t just integrating but can perform non-linear transformations.


Experiment 6: Temporal Order

Classify sequences based on whether X appears before Y or vice versa, separated by long spans of noise.

Results for the Temporal Order problem. LSTM can extract information from the relative timing of widely separated events.

Again, LSTM solved it easily by storing the first symbol in a cell and updating state when the second appears.


Why It Matters

The LSTM paper delivered:

  • True long-term memory — bridging lags over 1000 steps
  • Noise robustness
  • Continuous & distributed representation support
  • Efficient, local-in-time-and-space learning
  • Wide parameter tolerance — little fine-tuning needed

Legacy

The 1997 Long Short-Term Memory paper was a fundamental breakthrough in deep learning.
By introducing the Constant Error Carousel protected by gates, Hochreiter and Schmidhuber solved one of the most persistent problems in training recurrent networks.

Today, LSTMs (and variants like GRUs) power:

  • Language translation (e.g., Google Translate)
  • Speech recognition (e.g., Siri)
  • Text generation in large language models
  • Time series forecasting

Looking Back:
The LSTM story is a masterclass in identifying a core problem, reasoning from first principles, engineering an elegant mechanism, and validating it rigorously.
It laid the foundation for machines that can understand context, process language, and learn from long arcs of time — a truly unreasonable effectiveness that continues to shape AI’s future.