Unlocking Continual Learning: How LSTMs Learned to Forget

Recurrent Neural Networks (RNNs) are the workhorses of sequence modeling. From predicting the next word in a sentence to forecasting stock prices, their ability to maintain an internal state, or “memory,” makes them uniquely suited for tasks where context is key. However, traditional RNNs have a notoriously short memory. When faced with long sequences, they struggle with a problem called the vanishing gradient, where the influence of past events fades away too quickly during training.

In 1997, a groundbreaking architecture called Long Short-Term Memory (LSTM) was introduced to solve this very problem. By using a clever system of gates and a dedicated memory cell, LSTMs could successfully learn dependencies over hundreds of time steps, revolutionizing fields like natural language processing and time-series analysis.

But even LSTMs had a hidden weakness. While they excelled at processing individual, well-defined sequences, they faltered when faced with a continual, unending stream of data. In these scenarios, their internal memory could grow uncontrollably, leading to a kind of computational breakdown. The very mechanism that gave them their power—the ability to hold onto information—became a liability.

This brings us to a seminal 2000 paper by Felix Gers, Jürgen Schmidhuber, and Fred Cummins, “Learning to Forget: Continual Prediction with LSTM.” The researchers identified this critical limitation and proposed an elegant and powerful solution: adding a forget gate. This simple addition enabled an LSTM cell to learn not just when to store information, but also when to discard it. It taught the network how to reset itself, paving the way for robust learning on continuous, unsegmented data streams.

In this article, we’ll dive deep into this paper, exploring how this one modification transformed the LSTM—and why learning to forget is just as important as learning to remember.

A Quick Refresher on Standard LSTMs

Before we see what the authors changed, let’s quickly recap how a standard LSTM works. The core idea is to have a dedicated memory cell that can maintain a constant state over time, protected by gates that control the flow of information.

Diagram of an LSTM memory block showing input gate, output gate, and memory cell with its self-recurrent connection. The forget gate, in a dashed box, is the key addition proposed in the paper.

Figure 1: The standard LSTM cell structure with input and output gates and a self-recurrent connection weighting of 1.0. The dashed box marks the forget gate introduced later.

As shown above, a standard LSTM memory block has a few key components:

Memory Cell (\(s_c\)) — the heart of the LSTM. It’s a linear unit with a self-recurrent connection weighted at 1.0, called the Constant Error Carousel (CEC). This self-connection ensures gradients do not vanish, enabling long-term memory.
Input Gate (\(y^{in}\)) — controls how much of the new input information should enter the memory cell.
Output Gate (\(y^{out}\)) — regulates how much of the cell’s state should affect the network’s output at the current time step.

The cell state updates by adding gated new information to its previous state:

Equation for the standard LSTM cell state update.

Equation 1: In the standard LSTM, the cell state is updated additively, maintaining long-term consistency.

This addition allows persistent memory but also opens the door to uncontrolled growth of the cell state over continual input streams.

The Problem: Uncontrolled Memory Growth

That additive update rule, \(s_c(t) = s_c(t-1) + \dots\), is the key to LSTM’s memory—but it’s also its Achilles’ heel in continual learning scenarios.

Imagine feeding an LSTM a never-ending stream of text. The network has no “reset” signal between sentences or documents. With each time step, something new is added to the cell state. Over time, the value of \(s_c\) can grow without bound.

This leads to two major issues:

Saturation: The cell output \(y^c\) is computed through a nonlinear squashing function (like tanh). As \(s_c\) grows excessively large, the function saturates, its derivative approaches zero, and the flow of gradients stops. The cell stops learning.
Loss of Function: Once saturated, the memory cell behaves like a simple feedforward unit, losing its ability to retain meaningful history.

Early LSTM implementations worked around this by manually resetting the cell states to zero at the start of every new sequence. But in many real-world tasks, data streams are continuous and unsegmented. How can a network learn to reset its memory on its own?

The Solution: The Forget Gate

The authors’ solution is brilliantly simple: give the LSTM cell the ability to control its own memory reset.

They achieved this by adding a forget gate, denoted \(y^{\varphi}\). This gate operates like other LSTM gates—a neural unit with a sigmoid activation output between 0 and 1:

Equation for the forget gate activation.

Equation 2: Forget gate activation computed similarly to input and output gates.

The forget gate modifies the recurrent connection within the cell, turning the fixed self-recurrent weight (1.0) into a learnable, dynamic one. The new update rule becomes:

New cell state update with forget gate.

Equation 3: Extended LSTM cell update; the previous state is multiplied by the forget gate activation.

Here’s what happens:

If \(y^{\varphi} = 1\), the cell remembers everything (behaves like a standard LSTM).
If \(y^{\varphi} = 0\), the cell forgets everything—its state is reset.
If \(y^{\varphi}\) is between 0 and 1, the cell performs a gradual decay.

This mechanism gives the network fine-grained control over information retention and removal. It can learn to forget rhythmically or in response to specific patterns in the input stream.

An added benefit lies in training efficiency. As the authors noted: “When the forget gate activation goes to zero, not only the cell’s state but also the partial derivatives are reset—forgetting includes forgetting previous mistakes.” When it forgets, it truly moves on.

Experiments: Putting Forgetting to the Test

To demonstrate this capability, the authors tested their “extended LSTM” on continual versions of classic RNN benchmarks—cases specifically designed to make ordinary LSTMs fail.

Task 1: Continual Embedded Reber Grammar (CERG)

The Embedded Reber Grammar (ERG) is a well-known challenge for recurrent networks. It involves predicting the next symbol in a sequence generated by a finite-state graph, requiring memory of distant dependencies.

Transition diagrams for standard and embedded Reber grammars. The continual version loops the end back to the start for an infinite stream.

Figure 2: The continual Reber grammar extends the standard finite grammar into a loop, removing explicit sequence boundaries.

The researchers created a continual version (CERG) by concatenating endless Reber grammar strings into one continuous input stream—no explicit start, no explicit end. The network had to predict the next symbol indefinitely.

The extended LSTM was implemented with four memory blocks (each with two cells) in a single hidden layer, all fully connected to input and output units.

Network topology used in experiments, showing two of four memory blocks.

Figure 3: Experiment network architecture, an extended LSTM with input, hidden (memory blocks), and output layers.

Results: Forgetting Works

The results were striking.

Table 2: Results on the Continuous Embedded Reber Grammar task.

Table 2: Extended LSTM outperforms standard variants on continual Reber prediction. Forget gates and learning-rate decay yield the best outcomes.

Standard LSTM completely failed in the continual task—0% perfect solutions. In contrast, LSTM with forget gates solved the task in 18% of trials and reached 62% success when combined with learning-rate decay.

What’s happening inside the network? Visualizing cell states reveals how the forget gate learned to manage resets internally.

Standard LSTM internal states drift upward during continual input until saturation and failure.

Figure 4: In standard LSTM, internal cell states grow without bound as new sequences accumulate, leading to eventual saturation.

As the input stream continues, cell activations in standard LSTM steadily increase, eventually breaking the network. The extended LSTM behaves very differently.

Top: Internal cell states remain bounded and oscillate around zero. Bottom: forget gate activations drop to zero at task boundaries.

Figure 5: Extended LSTM learns self-resets. Forget gate activations drop at the end of each embedded grammar string, keeping cell states bounded and efficient.

Here, activations remain within a tight range. The forget gate sits at 1.0 most of the time, maintaining memory during a sequence. At the end of a string, it sharply drops to zero—resetting the state. The network has learned to recognize internal boundaries and clear its memory before starting a new segment.

Further analysis showed functional specialization across memory blocks: some handled long-term dependencies, while others focused on short-term transitions.

Internal state and forget gate activations for a short-term memory block.

Figure 6: A shorter-term memory block also learns self-resets, but on a faster timescale.

Together, they formed a hierarchy of learned temporal behaviors—each governed by its respective forget gate.

Task 2: Continual Noisy Temporal Order (CNTO)

To ensure the results generalized beyond symbolic grammars, the researchers applied the method to another challenging benchmark: the Noisy Temporal Order (NTO) task. Here, the network classifies long sequences based on the positions and order of rare symbols appearing amid noise. The continual version (CNTO) concatenated many such sequences into an unending stream.

Table 3: Results on the Continuous Noisy Temporal Order task.

Table 3: Standard LSTM fails on CNTO; extended LSTM with forget gates succeeds and improves further with learning-rate decay.

Once again, standard LSTM failed completely—unable to recover from memory saturation. The forget-gate-equipped networks succeeded, and with sequential learning-rate decay, even surpassed the original non-continual performance benchmarks.

Discussion: Why Forgetting Matters

The extended LSTM’s success highlights a profound principle: learning requires forgetting.

Without forgetting, continual learning causes internal representations to spiral into saturation. The forget gate transforms LSTM from a sequence processor into a stream learner—capable of managing infinite, dynamically segmented data.

Key takeaways:

Forgetting is essential: Continuous information streams demand periodic memory resets to prevent saturation and maintain effective learning.
Learned control beats manual resets: Forget gates replace external segmentation or fixed decay constants with adaptive, data-driven self-management.
Hierarchical decomposition emerges naturally: Memory blocks can solve subtasks, then clear their states via forgetting. This supports dynamic, multi-scale temporal structure.

Conclusion: A Legacy of Elegance and Impact

The introduction of the forget gate was more than a minor upgrade—it was a fundamental conceptual leap. By allowing networks to manage their internal resources and autonomously reset memory, Gers, Schmidhuber, and Cummins solved a problem that appeared only when models became powerful enough to need it.

Today, nearly every LSTM implementation in frameworks such as TensorFlow and PyTorch includes forget gates by default. What began as a single additional multiplication term now underpins state-of-the-art systems processing streaming video, text, sensor data, and more.

The story of “Learning to Forget” is a beautiful reminder: intelligence isn’t just about remembering—it’s also about knowing what to let go.

A Quick Refresher on Standard LSTMs#

The Problem: Uncontrolled Memory Growth#

The Solution: The Forget Gate#

Experiments: Putting Forgetting to the Test#

Task 1: Continual Embedded Reber Grammar (CERG)#

Results: Forgetting Works#

Task 2: Continual Noisy Temporal Order (CNTO)#

Discussion: Why Forgetting Matters#

Conclusion: A Legacy of Elegance and Impact#