Strong Experience Replay: A New Weapon Against Catastrophic Forgetting

Imagine teaching an AI to recognize different animals. You first show it thousands of pictures of cats, and it gets really good at identifying them. Next, you teach it to recognize dogs. But when you ask it to identify a cat again, it struggles. It seems that in learning about dogs, it has forgotten what a cat looks like. This phenomenon, known as catastrophic forgetting, is one of the biggest hurdles in building truly intelligent, adaptive AI systems.

Traditional deep learning models are trained offline, assuming all the data is available at once. But in the real world, information often arrives in a continuous stream. An AI system must learn new concepts incrementally without overwriting previous knowledge—this is the goal of Continual Learning (CL).

A common and effective strategy to overcome catastrophic forgetting is Experience Replay (ER). The idea is straightforward: store a few examples from past tasks in a memory buffer and “replay” them alongside new data during training. While ER helps preserve previous knowledge, it struggles when the memory buffer is very small. In those cases, the model can overfit to the few replayed samples and fail to generalize, resulting in renewed forgetting.

This is where the paper “Continual Learning with Strong Experience Replay” by Zhuo et al. introduces a powerful alternative: Strong Experience Replay (SER). By adding two complementary consistency losses, SER helps models retain past knowledge more effectively—especially under tight memory constraints.

Let’s explore how it works.

The Stability–Plasticity Dilemma

At the heart of Continual Learning lies a fundamental trade-off—the stability–plasticity dilemma.

Plasticity refers to the model’s ability to learn new information and adapt to fresh tasks.
Stability preserves the model’s memory of prior tasks, preventing overwriting of learned features.

The ideal continual learning model must balance both. Too much plasticity leads to catastrophic forgetting; too much stability makes the model rigid and unable to adapt.

Experience Replay addresses this tension by combining the classification loss on current data (for plasticity) and replayed memory data (for stability). Formally, for task \(t\):

\[ \mathcal{L} = \mathcal{L}_{cls}^t + \mathcal{L}_{cls}^m \]

This baseline is effective—but limited. It relies only on stored labels to maintain stability. What if we also preserved the prediction behavior of the previous model?

The Core Idea: Strong Experience Replay (SER)

The authors propose going beyond replayed data labels by aligning the prediction distributions of the old and current models. Whenever the model is updated from parameters \(\theta_{t-1}\) to \(\theta_t\), SER ensures that the newer version produces predictions consistent with its earlier self. It does this through two complementary mechanisms: backward consistency and forward consistency.

Illustration of backward and forward consistency. Backward consistency uses the memory buffer to enforce consistency on past data, while forward consistency uses the current training data to enforce consistency with the old model’s “future” predictions.

Figure 1: A conceptual comparison between backward and forward consistency. Backward consistency uses past data stored in memory, while forward consistency regularizes training using the entire current dataset.

1. Backward Consistency — Distilling Past Experiences

Backward consistency ensures that for samples stored in the memory buffer \(\mathcal{M}\), the updated model’s outputs remain similar to those of the old model. It’s essentially a form of knowledge distillation that transfers learned representations forward in time.

\[ \mathcal{L}_{bc}^{m} = \mathbb{E}_{x \sim \mathcal{M}}[\|f(x; \theta_t) - f(x; \theta_{t-1})\|^2] \]

In practical implementation, the old model’s logits are stored alongside the samples and labels inside the memory buffer to avoid recomputation. This loss helps preserve earlier “experience,” preventing severe drift in the learned representations.

2. Forward Consistency — The Key Innovation

While backward consistency helps retain old information, it depends exclusively on the small set of samples stored in memory. With limited buffer size, it can lead to overfitting.

To overcome this bottleneck, SER introduces forward consistency—a novel idea that enforces alignment between the current and previous models on the incoming training data:

\[ \mathcal{L}_{fc}^{t} = \mathbb{E}_{x \sim \mathcal{D}_t}[\|f(x; \theta_t) - f(x; \theta_{t-1})\|^2] \]

This is called forward consistency because the current data \(\mathcal{D}_t\) represents “future experiences” from the perspective of the old model. The previous network is frozen and acts as a stable anchor during optimization. By enforcing similar outputs on future inputs, SER prevents drastic changes and improves generalization to both old and new tasks.

The most striking feature of forward consistency is that it leverages all current training data, not just the limited memory buffer—providing a broader, stronger regularization signal.

3. The Full SER Objective

Combining these insights, SER’s total training objective integrates four losses:

Classification loss on current data (\(\mathcal{L}_{cls}^t\)) – encourages learning of new tasks.
Classification loss on memory data (\(\mathcal{L}_{cls}^m\)) – reinforces old knowledge.
Backward consistency loss (\(\mathcal{L}_{bc}^m\)) – preserves learned representations from past experiences.
Forward consistency loss (\(\mathcal{L}_{fc}^t\)) – regularizes new learning using knowledge of the previous model.

The overall formulation is:

\[ \mathcal{L} = \mathcal{L}_{cls}^t + \mathcal{L}_{cls}^m + \alpha \mathcal{L}_{bc}^m + \beta \mathcal{L}_{fc}^t \]

where \(\alpha\) and \(\beta\) balance the consistency terms.

A schematic of the Strong Experience Replay (SER) method, showing how data from the memory buffer (M) and current task (D_t) are used to compute four distinct loss terms.

Figure 2: Architecture of the SER framework. Four loss components jointly promote plasticity (learning new tasks) and stability (preserving old knowledge).

Training Procedure

SER’s implementation is clean and straightforward. During each training iteration:

Sample one batch of current task data \(\mathcal{D}_t\).
Sample one batch from the memory buffer \(\mathcal{M}\).
Compute classification losses on both batches.
Compute backward consistency loss on memory data.
Compute forward consistency loss on current data using frozen parameters \(\theta_{t-1}\).
Update \(\theta_t\) with stochastic gradient descent (SGD).
Refresh the memory buffer using reservoir sampling to maintain a balanced representation.

The SER algorithm, outlining the steps for training across multiple tasks.

Algorithm 1: Step-by-step process for training with Strong Experience Replay.

SER avoids heavy computation: only one extra forward pass through the frozen model is needed per batch. Unlike complex architectures such as CLS-ER that use multiple models simultaneously, SER trains efficiently with virtually no additional overhead.

Experiments: Putting SER to the Test

The paper evaluates SER across five benchmark datasets—CIFAR-10, CIFAR-100, TinyImageNet, Permuted MNIST, and Rotated MNIST—under three continual learning settings:

Class-Incremental Learning (Class-IL): Learn disjoint sets of classes over time without task identity at test time.
Task-Incremental Learning (Task-IL): Task identity is known during inference, making it a simpler scenario.
Domain-Incremental Learning (Domain-IL): Same classes throughout, but changing input distributions.

Performance is measured with two metrics:

Average Accuracy: Overall accuracy across all tasks after training.
Average Forgetting: How much accuracy on old tasks decreases after learning new ones.

Results That Speak for Themselves

CIFAR-100: A Clear Win

On CIFAR-100, split into 20 tasks with a tiny memory of just 200 samples (roughly two per class), SER achieves 24.35% accuracy in the challenging Class-IL setting—far exceeding 15.16% from DER++. That’s over a 60% relative improvement.

Table of results for various continual learning methods on the CIFAR-100 dataset. SER consistently outperforms other methods across different task splits and buffer sizes.

Table 1: Classification results on CIFAR-100. SER consistently outperforms previous methods, particularly in low-memory scenarios.

To visualize the improvement over learning stages, the following figures show average accuracy across increasing task counts:

Line graphs showing the average accuracy on CIFAR-100 with 10 tasks. SER (red triangles) maintains a much higher accuracy than DER++ (orange circles) and ER (blue diamonds).

Figure 3: Average accuracy on CIFAR-100 (10 tasks, memory size 200). SER maintains superior performance across learning stages.

Line graphs showing the average accuracy on CIFAR-100 with 20 tasks. The performance gap between SER and other methods widens as more tasks are learned.

Figure 4: Accuracy over 20 tasks. The advantage of SER grows with task number, showing long-term stability improvements.

TinyImageNet: The Ultimate Test

TinyImageNet pushes continual learning methods to their limits—with 200 classes and only one sample per class when using a buffer of size 200. Most previous approaches collapse under these conditions; SER, however, achieves 28.50% Class-IL accuracy, compared to DER++’s 10.96%.

Table of results on several standard CL benchmarks. SER shows a particularly large advantage on TinyImageNet.

Table 2: Performance comparisons across multiple benchmarks. The forward consistency loss proves invaluable when data is minimal.

Forgetting Analysis

SER also achieves the lowest forgetting among all compared methods—indicating that its higher accuracy stems from better knowledge retention.

Table showing the average forgetting metric for ER, DER++, and SER. SER has the lowest forgetting score, indicating better stability.

Table 3: Average forgetting on CIFAR-10 and CIFAR-100. Lower is better—SER excels in both cases.

Ablation Study: Understanding the Gains

To pinpoint which components contribute most, the authors conducted an ablation study varying the loss terms used during training.

Ablation study results showing the contribution of each loss component to the final performance of SER.

Table 4: Impact of each component on overall accuracy. Adding forward consistency yields the largest improvement.

Remarkably, combining ER with just the forward consistency loss (\(\mathcal{L}_{fc}^t\)) already surpasses DER++ on CIFAR-10. This confirms that the forward consistency mechanism is the core contributor to SER’s performance boost—leveraging the full current dataset for regularization makes it vastly more effective than memory-only approaches.

Visualizing Stability and Plasticity

The heatmaps below illustrate how well each method retains knowledge across tasks. Each cell shows accuracy on task j after training up to task i. Bright diagonals and columns indicate good retention.

Heatmaps showing the task-by-task accuracy for ER, DER++, CLS-ER, and SER. The SER heatmap shows less color fading in the columns, indicating better retention of old tasks.

Figure 5: Task-by-task accuracy after sequential learning on CIFAR-10. SER maintains higher accuracy on early tasks, confirming superior stability.

SER’s matrix retains brighter columns on early tasks (T1–T2), showing minimal decline even after training later tasks—a visual testament to its balanced stability and plasticity.

Computational Efficiency

Despite its additional loss terms, SER remains computationally efficient. It uses a single model and one frozen copy for consistency checks, unlike CLS-ER, which trains dual networks. Moreover, its batch sampling strategy is the same as DER++, incurring no additional data storage or sampling cost.

Final Thoughts

The Strong Experience Replay framework blends replay-based learning with consistency regularization in an elegant and simple way. By introducing forward consistency, it strengthens the stability-plasticity balance, drastically reducing forgetting even when replay memory is scarce.

Key Takeaways:

Forward Consistency is the Breakthrough: Enforcing prediction alignment on current task data boosts generalization and memory retention beyond what backward replay alone can achieve.
Simple Yet Powerful: SER adds minimal computational cost while delivering major accuracy improvements.
A Step Toward Lifelong Learning: By mitigating catastrophic forgetting, SER paves the way for AI systems that adapt continuously without sacrificing past knowledge.

In essence, SER shows how we can teach models to learn new things without forgetting what they already know—bringing us closer to true lifelong learning capabilities in artificial intelligence.

The Stability–Plasticity Dilemma#

The Core Idea: Strong Experience Replay (SER)#

1. Backward Consistency — Distilling Past Experiences#

2. Forward Consistency — The Key Innovation#

3. The Full SER Objective#

Training Procedure#

Experiments: Putting SER to the Test#

Results That Speak for Themselves#

CIFAR-100: A Clear Win#

TinyImageNet: The Ultimate Test#

Forgetting Analysis#

Ablation Study: Understanding the Gains#

Visualizing Stability and Plasticity#

Computational Efficiency#

Final Thoughts#