Recurrent Neural Networks (RNNs), and their more powerful cousins, Long Short-Term Memory networks (LSTMs), are foundational tools for processing sequential data. They’ve enabled breakthroughs in everything from language translation and image captioning to speech and handwriting generation. Yet despite their success, LSTMs have long been treated as “black boxes.” We know they work—but how they work, what they learn, why they succeed, and where they fail have remained poorly understood.
This lack of interpretability is a significant obstacle. Without understanding our models, it’s challenging to design better ones.
A classic 2015 paper from Stanford, “Visualizing and Understanding Recurrent Networks,” addresses this challenge head-on. Instead of introducing new architectures, the authors conduct a deep, empirical investigation into the inner workings of LSTMs. Using character-level language models as an interpretable testbed, they visualize what these networks learn, how they represent information, and where they make errors.
Their discoveries are intriguing. They not only confirm that LSTMs capture long-range dependencies, but also show how—revealing individual memory cells that track line length, quotation marks, indentation levels in code, and more. This post unpacks their methodology, results, and what they mean for the future of sequence modeling.
RNNs, LSTMs, and GRUs: A Refresher
Before diving into the experiments, let’s recap the key models explored in the paper. All process sequences one element at a time, maintaining an internal “state” or “memory” that captures past context.
Vanilla Recurrent Neural Network (RNN)
An RNN updates its hidden state \(h_t\) at time \(t\) based on the current input \(x_t\) and the previous hidden state \(h_{t-1}\):
A basic RNN recurrence: previous state and current input are combined and passed through a nonlinearity.
While conceptually elegant, vanilla RNNs suffer from vanishing and exploding gradients: as gradients propagate backward through many time steps, they can shrink to nearly zero or grow uncontrollably. This makes learning long-range dependencies extremely difficult.
Long Short-Term Memory (LSTM)
LSTMs solve this problem with a more sophisticated internal structure. In addition to a hidden state, they maintain a cell state \(c_t\) that acts like a conveyor belt—allowing information to flow with fewer changes, mitigating vanishing gradients.
Information flow is controlled by three gates:
- Forget Gate (\(f\)) – decides which information to discard.
- Input Gate (\(i\)) – decides which new information to store.
- Output Gate (\(o\)) – decides what part of the cell state to reveal.
LSTM cell update: gates modulate reading, writing, and revealing memory.
The additive cell update
\[ c_t = f \circ c_{t-1} + i \circ g \]is key: addition lets gradients pass through without repeated multiplication, preserving them over long sequences.
Gated Recurrent Unit (GRU)
GRUs are a streamlined variant that combine the forget and input gates into a single update gate (\(z\)) and use a reset gate (\(r\)) to control how much of the past to forget.
GRU update: merges hidden and cell state, simplifying gating.
GRUs are computationally efficient and often match LSTM performance.
Experimental Setup: A Window into the Network’s Mind
The authors needed a task complex enough to challenge the models, but simple enough to interpret. Their choice: character-level language modeling.
The model reads one character at a time and predicts the next character. This lets us directly inspect predictions and activations, revealing learned patterns.
Two datasets were chosen to span different structural properties:
- Leo Tolstoy’s War and Peace – rich natural language with minimal markup.
- Linux Kernel source code – highly structured C code with long-range dependencies (matching
{}
braces, block indentation, comments, etc.).
Models included vanilla RNNs, LSTMs, and GRUs, with 1–3 layers and varying sizes.
Key Finding 1: LSTMs and GRUs Outperform RNNs
The first step: establish performance baselines using test set cross-entropy loss (lower is better).
Left: LSTMs and GRUs consistently outperform vanilla RNNs.
Right: t-SNE shows LSTMs and GRUs cluster together, learning similar patterns, while RNNs are distinct.
Observations:
- Gating matters: LSTMs and GRUs significantly beat RNNs.
- Depth helps: at least 2 layers outperform single-layer models; a 3rd layer has mixed benefits.
- LSTMs ≈ GRUs: their predictions are more similar to each other than to RNNs.
With this established, the study focuses on why LSTMs perform so well.
Key Finding 2: Interpretable Memory Cells
If LSTMs capture long-range dependencies, some cells may track specific features over time. The authors visualized memory cell activations (\(c_t\)) over sequences and found striking patterns:
Cells trained on Linux Kernel code: clear tracking of quotes, line length, indentation, and comments.
Examples:
- Quote Detection Cell: Activates inside quoted strings.
- Line Length Tracker: Decays each character until newline resets it.
- Indentation Depth: Increases with code block depth.
- Comment/String Cell: Activates in comments or strings.
These cells handle dependencies hundreds of characters long—well beyond the backpropagation horizon (100 characters). This is the first experimental proof that LSTMs trained on real-world data can autonomously learn high-level, interpretable tracking.
Key Finding 3: Gate Behavior Insights
How do gates manage this long-term memory? The team studied gate saturation—how often gates are fully open (>0.9) or closed (<0.1).
Points near bottom-right = often open; top-left = often closed.
Findings:
- Forget Gate Integrators: Many LSTM forget gates are frequently open, passing memory unchanged—ideal for long-term tracking.
- No Pure Feed-Forward Units: No cells with forget gates always closed.
- First-Layer Diffusion: In both LSTMs and GRUs, layer-1 gates were less saturated than deeper layers, suggesting it primarily extracts features while later layers make longer-term decisions.
Key Finding 4: Quantifying Long-Range Power
To measure the long-range advantage, LSTMs were compared to finite-horizon baselines:
- n-gram models – predict from last \(n\) characters only.
- n-NN models – feed last \(n\) characters to a small feedforward neural net.
Even large n-gram models can’t match LSTM performance, despite huge size (3GB vs. 11MB).
Where does the edge come from?
LSTM excels on structural, long-range characters (brackets, whitespace, carriage returns).
Case Study: Closing Brace }
A closing brace’s matching {
could be hundreds of characters away.
LSTM maintains advantage for distances up to 60+ characters; 20-gram flattens at baseline.
Training dynamics show the LSTM’s competence grows over time: early predictions resemble 1-NN, then progress to 2-NN, 3-NN, etc.—suggesting it learns short-term patterns first, then extends its reach.
Key Finding 5: Peeling the Onion of Errors
Even top-performing LSTMs make many mistakes. The authors systematically peeled away errors using perfect “oracles” for specific situations:
Error categories: short-term (n-gram), dynamic memory, rare words, start-of-word prediction, punctuation, miscellaneous.
Error categories:
- n-gram errors (18%) – fixable by looking at last 9 characters.
- Dynamic memory (6%) – fails to repeat recently seen phrases.
- Rare words (9%) – poor on infrequent training-set words.
- Word starts (37%) – hardest: predicting next word after space/newline.
Scaling the model 26× eliminated most short-term errors but left rare word and dynamic memory errors unchanged. This shows that simply making models bigger won’t solve deeper challenges—architectural innovations are needed.
Conclusion: Demystifying LSTMs
“Visualizing and Understanding Recurrent Networks” is a landmark study because it moves beyond accuracy metrics to actually understand the representations and mechanisms inside RNNs.
Key takeaways:
- LSTMs learn highly interpretable, long-range tracking cells.
- Their advantage over finite-horizon models is both visualizable and quantifiable.
- They learn progressively: short-term patterns first, then expand to longer dependencies.
- Scaling up fixes local errors but doesn’t solve rare word or dynamic memory failures.
By breaking down error sources, this work points directly to weaknesses in current architectures—fueling developments like Memory Networks and Attention Mechanisms designed to overcome these limitations.
It’s a reminder that in deep learning, interpretability and understanding are as crucial as raw performance.