Imagine teaching an AI to recognize different animals. You first show it thousands of pictures of cats, and it gets really good at identifying them. Next, you teach it to recognize dogs. But when you ask it to identify a cat again, it struggles. It seems that in learning about dogs, it has forgotten what a cat looks like. This phenomenon, known as catastrophic forgetting, is one of the biggest hurdles in building truly intelligent, adaptive AI systems.
Traditional deep learning models are trained offline, assuming all the data is available at once. But in the real world, information often arrives in a continuous stream. An AI system must learn new concepts incrementally without overwriting previous knowledge—this is the goal of Continual Learning (CL).
A common and effective strategy to overcome catastrophic forgetting is Experience Replay (ER). The idea is straightforward: store a few examples from past tasks in a memory buffer and “replay” them alongside new data during training. While ER helps preserve previous knowledge, it struggles when the memory buffer is very small. In those cases, the model can overfit to the few replayed samples and fail to generalize, resulting in renewed forgetting.
This is where the paper “Continual Learning with Strong Experience Replay” by Zhuo et al. introduces a powerful alternative: Strong Experience Replay (SER). By adding two complementary consistency losses, SER helps models retain past knowledge more effectively—especially under tight memory constraints.
Let’s explore how it works.
The Stability–Plasticity Dilemma
At the heart of Continual Learning lies a fundamental trade-off—the stability–plasticity dilemma.
- Plasticity refers to the model’s ability to learn new information and adapt to fresh tasks.
- Stability preserves the model’s memory of prior tasks, preventing overwriting of learned features.
The ideal continual learning model must balance both. Too much plasticity leads to catastrophic forgetting; too much stability makes the model rigid and unable to adapt.
Experience Replay addresses this tension by combining the classification loss on current data (for plasticity) and replayed memory data (for stability). Formally, for task \(t\):
\[ \mathcal{L} = \mathcal{L}_{cls}^t + \mathcal{L}_{cls}^m \]This baseline is effective—but limited. It relies only on stored labels to maintain stability. What if we also preserved the prediction behavior of the previous model?
The Core Idea: Strong Experience Replay (SER)
The authors propose going beyond replayed data labels by aligning the prediction distributions of the old and current models. Whenever the model is updated from parameters \(\theta_{t-1}\) to \(\theta_t\), SER ensures that the newer version produces predictions consistent with its earlier self. It does this through two complementary mechanisms: backward consistency and forward consistency.

Figure 1: A conceptual comparison between backward and forward consistency. Backward consistency uses past data stored in memory, while forward consistency regularizes training using the entire current dataset.
1. Backward Consistency — Distilling Past Experiences
Backward consistency ensures that for samples stored in the memory buffer \(\mathcal{M}\), the updated model’s outputs remain similar to those of the old model. It’s essentially a form of knowledge distillation that transfers learned representations forward in time.
\[ \mathcal{L}_{bc}^{m} = \mathbb{E}_{x \sim \mathcal{M}}[\|f(x; \theta_t) - f(x; \theta_{t-1})\|^2] \]In practical implementation, the old model’s logits are stored alongside the samples and labels inside the memory buffer to avoid recomputation. This loss helps preserve earlier “experience,” preventing severe drift in the learned representations.
2. Forward Consistency — The Key Innovation
While backward consistency helps retain old information, it depends exclusively on the small set of samples stored in memory. With limited buffer size, it can lead to overfitting.
To overcome this bottleneck, SER introduces forward consistency—a novel idea that enforces alignment between the current and previous models on the incoming training data:
\[ \mathcal{L}_{fc}^{t} = \mathbb{E}_{x \sim \mathcal{D}_t}[\|f(x; \theta_t) - f(x; \theta_{t-1})\|^2] \]This is called forward consistency because the current data \(\mathcal{D}_t\) represents “future experiences” from the perspective of the old model. The previous network is frozen and acts as a stable anchor during optimization. By enforcing similar outputs on future inputs, SER prevents drastic changes and improves generalization to both old and new tasks.
The most striking feature of forward consistency is that it leverages all current training data, not just the limited memory buffer—providing a broader, stronger regularization signal.
3. The Full SER Objective
Combining these insights, SER’s total training objective integrates four losses:
- Classification loss on current data (\(\mathcal{L}_{cls}^t\)) – encourages learning of new tasks.
- Classification loss on memory data (\(\mathcal{L}_{cls}^m\)) – reinforces old knowledge.
- Backward consistency loss (\(\mathcal{L}_{bc}^m\)) – preserves learned representations from past experiences.
- Forward consistency loss (\(\mathcal{L}_{fc}^t\)) – regularizes new learning using knowledge of the previous model.
The overall formulation is:
\[ \mathcal{L} = \mathcal{L}_{cls}^t + \mathcal{L}_{cls}^m + \alpha \mathcal{L}_{bc}^m + \beta \mathcal{L}_{fc}^t \]where \(\alpha\) and \(\beta\) balance the consistency terms.

Figure 2: Architecture of the SER framework. Four loss components jointly promote plasticity (learning new tasks) and stability (preserving old knowledge).
Training Procedure
SER’s implementation is clean and straightforward. During each training iteration:
- Sample one batch of current task data \(\mathcal{D}_t\).
- Sample one batch from the memory buffer \(\mathcal{M}\).
- Compute classification losses on both batches.
- Compute backward consistency loss on memory data.
- Compute forward consistency loss on current data using frozen parameters \(\theta_{t-1}\).
- Update \(\theta_t\) with stochastic gradient descent (SGD).
- Refresh the memory buffer using reservoir sampling to maintain a balanced representation.

Algorithm 1: Step-by-step process for training with Strong Experience Replay.
SER avoids heavy computation: only one extra forward pass through the frozen model is needed per batch. Unlike complex architectures such as CLS-ER that use multiple models simultaneously, SER trains efficiently with virtually no additional overhead.
Experiments: Putting SER to the Test
The paper evaluates SER across five benchmark datasets—CIFAR-10, CIFAR-100, TinyImageNet, Permuted MNIST, and Rotated MNIST—under three continual learning settings:
- Class-Incremental Learning (Class-IL): Learn disjoint sets of classes over time without task identity at test time.
- Task-Incremental Learning (Task-IL): Task identity is known during inference, making it a simpler scenario.
- Domain-Incremental Learning (Domain-IL): Same classes throughout, but changing input distributions.
Performance is measured with two metrics:
- Average Accuracy: Overall accuracy across all tasks after training.
- Average Forgetting: How much accuracy on old tasks decreases after learning new ones.
Results That Speak for Themselves
CIFAR-100: A Clear Win
On CIFAR-100, split into 20 tasks with a tiny memory of just 200 samples (roughly two per class), SER achieves 24.35% accuracy in the challenging Class-IL setting—far exceeding 15.16% from DER++. That’s over a 60% relative improvement.

Table 1: Classification results on CIFAR-100. SER consistently outperforms previous methods, particularly in low-memory scenarios.
To visualize the improvement over learning stages, the following figures show average accuracy across increasing task counts:

Figure 3: Average accuracy on CIFAR-100 (10 tasks, memory size 200). SER maintains superior performance across learning stages.

Figure 4: Accuracy over 20 tasks. The advantage of SER grows with task number, showing long-term stability improvements.
TinyImageNet: The Ultimate Test
TinyImageNet pushes continual learning methods to their limits—with 200 classes and only one sample per class when using a buffer of size 200. Most previous approaches collapse under these conditions; SER, however, achieves 28.50% Class-IL accuracy, compared to DER++’s 10.96%.

Table 2: Performance comparisons across multiple benchmarks. The forward consistency loss proves invaluable when data is minimal.
Forgetting Analysis
SER also achieves the lowest forgetting among all compared methods—indicating that its higher accuracy stems from better knowledge retention.

Table 3: Average forgetting on CIFAR-10 and CIFAR-100. Lower is better—SER excels in both cases.
Ablation Study: Understanding the Gains
To pinpoint which components contribute most, the authors conducted an ablation study varying the loss terms used during training.

Table 4: Impact of each component on overall accuracy. Adding forward consistency yields the largest improvement.
Remarkably, combining ER with just the forward consistency loss (\(\mathcal{L}_{fc}^t\)) already surpasses DER++ on CIFAR-10. This confirms that the forward consistency mechanism is the core contributor to SER’s performance boost—leveraging the full current dataset for regularization makes it vastly more effective than memory-only approaches.
Visualizing Stability and Plasticity
The heatmaps below illustrate how well each method retains knowledge across tasks. Each cell shows accuracy on task j after training up to task i. Bright diagonals and columns indicate good retention.

Figure 5: Task-by-task accuracy after sequential learning on CIFAR-10. SER maintains higher accuracy on early tasks, confirming superior stability.
SER’s matrix retains brighter columns on early tasks (T1–T2), showing minimal decline even after training later tasks—a visual testament to its balanced stability and plasticity.
Computational Efficiency
Despite its additional loss terms, SER remains computationally efficient. It uses a single model and one frozen copy for consistency checks, unlike CLS-ER, which trains dual networks. Moreover, its batch sampling strategy is the same as DER++, incurring no additional data storage or sampling cost.
Final Thoughts
The Strong Experience Replay framework blends replay-based learning with consistency regularization in an elegant and simple way. By introducing forward consistency, it strengthens the stability-plasticity balance, drastically reducing forgetting even when replay memory is scarce.
Key Takeaways:
- Forward Consistency is the Breakthrough: Enforcing prediction alignment on current task data boosts generalization and memory retention beyond what backward replay alone can achieve.
- Simple Yet Powerful: SER adds minimal computational cost while delivering major accuracy improvements.
- A Step Toward Lifelong Learning: By mitigating catastrophic forgetting, SER paves the way for AI systems that adapt continuously without sacrificing past knowledge.
In essence, SER shows how we can teach models to learn new things without forgetting what they already know—bringing us closer to true lifelong learning capabilities in artificial intelligence.
](https://deep-paper.org/en/paper/2305.13622/images/cover.png)