It’s 1997. The Spice Girls are topping the charts, Titanic is about to hit theaters, and two researchers, Sepp Hochreiter and Jürgen Schmidhuber, publish a paper that will, in time, become a cornerstone of the modern AI revolution.
The paper, titled Long Short-Term Memory, proposed a new kind of neural network architecture that could remember information for incredibly long periods.
At the time, this was a monumental challenge. Neural networks had a memory problem — they were notoriously forgetful. Trying to get a standard Recurrent Neural Network (RNN) to remember something that happened 100 steps ago was like trying to recall the first sentence of a book after reading the whole thing. The information would almost certainly be gone, washed away by a flood of newer data.
This amnesia was a major roadblock for applying AI to sequential data like text, speech, or time series.
How could you translate a sentence if you forgot the beginning by the time you reached the end?
How could you predict stock prices if you couldn’t see patterns spanning more than a few days?
The Long Short-Term Memory (LSTM) network was the answer. It introduced a novel architecture designed specifically to overcome this fundamental limitation.
This article takes a deep dive into that seminal 1997 paper — breaking down the problem LSTMs were built to solve, the ingenious design behind them, and the experiments that proved they were not just another incremental improvement, but a genuine breakthrough.
The Fading Memory of Recurrent Neural Networks
To understand why LSTM was such a big deal, we first need to understand the infamous vanishing gradient problem.
Recurrent Neural Networks are designed to work with sequences. Unlike a standard feedforward network, an RNN has a loop that allows information to persist from one step to the next. At each time step \(t\), the network takes:
- Input: \(x(t)\)
- Previous hidden state: \(h(t-1)\)
It uses both to produce:
- Output
- New hidden state: \(h(t)\)
This hidden state is the network’s “memory.”
Training an RNN typically involves an algorithm called Backpropagation Through Time (BPTT).
It’s a variation of standard backpropagation where the network is unrolled through time, creating a long chain of repeated computations with shared weights at each time step. The error from the final output is then propagated backward through this unrolled network to update the weights.
Here’s where the trouble starts:
To update the weights, we need the gradient of the error with respect to the weight. This gradient is a chain of many multiplications — derivatives of activation functions multiplied by weights — across time steps.
For a neuron \(j\):
\[ \vartheta_j(t) = f'_j(net_j(t)) \sum_i w_{ij} \ \vartheta_i(t+1) \]Here, \(\vartheta_j(t)\) is the backpropagated error signal. Chaining this over many time steps \(q\) leads to a scaling factor that is:
The Two Failure Modes
Vanishing Gradient
Most common activation functions, like the logistic sigmoid, have derivatives strictly less than 1 (max: 0.25). Multiplying many numbers less than 1 causes exponential decay of the gradient. After just a few steps, the gradient can shrink to effectively zero. The network becomes incapable of learning dependencies between far-apart events.Exploding Gradient
If the weights are large (greater than 1), the gradient can grow exponentially. This leads to highly unstable weight updates and failure to learn.
In short, the influence of past inputs either vanishes or explodes, giving standard RNNs a very short active memory span.
This made them unsuitable for long time lag problems.
A First Attempt: The Constant Error Carousel
The authors asked: What if we could force the error to flow backward through time without shrinking or growing?
In recurrent backprop, the error signal flowing backward through a neuron’s self-connection is scaled by:
\[ f'_j(net_j(t)) \cdot w_{jj} \]To keep error flow constant, this term must equal 1.0:
\[ f'_j(net_j(t)) \cdot w_{jj} = 1.0 \]The simplest way to achieve this:
- Use a linear activation, \(f_j(x) = x\), so \(f'_j = 1\).
- Set self-connection weight to \(w_{jj} = 1.0\).
This creates what the authors called a Constant Error Carousel (CEC) — a unit that can carry information forward indefinitely without change. Gradients can flow backward through it unchanged.
But the naive CEC has two critical flaws:
- Input Weight Conflict — The same input weights must both store relevant inputs and ignore irrelevant ones.
- Output Weight Conflict — The same output weights must both retrieve stored information and prevent it from disturbing the network when irrelevant.
The solution: control information flow with gates.
The LSTM Cell: Memory with Gates
The LSTM tackles these conflicts by wrapping the CEC in multiplicative gate units — learnable switches that decide when to write to or read from the cell.
Anatomy of the LSTM Memory Cell
1. Cell State: The CEC
At the heart of the LSTM cell is the cell state \(s_{c_j}(t)\) — the CEC.
It’s updated by adding new information to the previous state, allowing gradients to flow backward unchanged:
2. Input Gate (\(in_j\))
Controls writes to memory. Output between 0 and 1 via sigmoid activation:
- 1 → allow new information to be stored
- 0 → block update, protecting stored state
State update refined as:
\[ s_{c_j}(t) = s_{c_j}(t-1) + y^{in_j}(t) \cdot g(net_{c_j}(t)) \]Here, \(g\) is a squashing function (e.g., tanh scaled), and \(y^{in_j}(t)\) is the learned gate output.
3. Output Gate (\(out_j\))
Controls reads from memory:
\[ y^{c_j}(t) = y^{out_j}(t) \cdot h(s_{c_j}(t)) \]Here, \(h\) is usually tanh; \(y^{out_j}(t)\) decides whether the cell value influences the rest of the network.
Cell Blocks
LSTM cells are often grouped into blocks sharing input/output gates, improving efficiency.
Example network topology:
Putting LSTM to the Test
The brilliance of LSTM lies in its application to problems where vanilla RNNs utterly fail.
Experiment 1: Embedded Reber Grammar
Benchmark for sequence learning — predict the next symbol from a grammar-generated string.
LSTM results:
Other algorithms struggled or failed entirely. LSTM solved it 100% of the time, much faster. Output gates were critical — they allowed learning of short-term grammar rules without interference from the harder long-term dependency.
Experiment 2: The 1000-Step Challenge
Task: predict the final symbol based on the second symbol in a sequence. Minimal time lag: 1000 steps, with random distractors in between.
This is the quintessential long-term dependency test — impossible for standard RNNs.
LSTM reliably solved tasks with 1000-step minimal lags. Training time scaled slowly even with more distractor symbols — a property unmatched by other methods.
Experiment 4: Adding Problem
Test: can LSTM store continuous, distributed values?
Sequence of value-marker pairs → output the sum of marked values at end.
LSTM preserved exact real values for hundreds of steps without drift — proving CEC’s stability.
Experiment 5: Multiplication Problem
Same as above but target is product of marked values. Shows LSTM isn’t just integrating but can perform non-linear transformations.
Experiment 6: Temporal Order
Classify sequences based on whether X appears before Y or vice versa, separated by long spans of noise.
Again, LSTM solved it easily by storing the first symbol in a cell and updating state when the second appears.
Why It Matters
The LSTM paper delivered:
- True long-term memory — bridging lags over 1000 steps
- Noise robustness
- Continuous & distributed representation support
- Efficient, local-in-time-and-space learning
- Wide parameter tolerance — little fine-tuning needed
Legacy
The 1997 Long Short-Term Memory paper was a fundamental breakthrough in deep learning.
By introducing the Constant Error Carousel protected by gates, Hochreiter and Schmidhuber solved one of the most persistent problems in training recurrent networks.
Today, LSTMs (and variants like GRUs) power:
- Language translation (e.g., Google Translate)
- Speech recognition (e.g., Siri)
- Text generation in large language models
- Time series forecasting
Looking Back:
The LSTM story is a masterclass in identifying a core problem, reasoning from first principles, engineering an elegant mechanism, and validating it rigorously.
It laid the foundation for machines that can understand context, process language, and learn from long arcs of time — a truly unreasonable effectiveness that continues to shape AI’s future.