Introduction: When AI Starts Forgetting
Imagine teaching a smart AI to recognize animals. First, you show it thousands of cat images—it becomes a cat expert. Next, you train it on dogs. It learns that task quickly. But when you show it a picture of a cat again, the AI looks puzzled. Somehow, it forgot what a cat looks like.
This frustrating occurrence is called catastrophic forgetting, one of the biggest barriers to building adaptable AI that learns like humans. While people can pick up new skills without erasing old ones, neural networks tend to overwrite past knowledge when learning sequential tasks.
The field of Continual Learning (CL) tackles this problem by developing models that learn from an ongoing stream of information. One of the most promising approaches is memory replay, inspired by how the human brain consolidates memories. The idea is simple: store a small subset of past training data in a memory buffer and replay it while learning new tasks, reminding the model of previously learned information.
However, simply replaying old data has drawbacks. It can cause overfitting on the limited memory samples and doesn’t guarantee that the model will generalize well to future tasks.
That’s where the new research paper “MGSER-SAM: Memory-Guided Soft Experience Replay with Sharpness-Aware Optimization for Enhanced Continual Learning” steps in. The authors propose a novel algorithm that helps models not only retain what they’ve learned but also develop more generalized representations. Their approach combines memory replay with an advanced optimization method called Sharpness-Aware Minimization (SAM), and then resolves conflicts between learning new knowledge and preserving old memories.
This post takes you through the key concepts behind the MGSER-SAM algorithm—from the fundamentals of continual learning to its powerful dual-gradient solution for mitigating forgetting.
The Challenge: Stability vs. Plasticity
At the heart of continual learning lies the stability–plasticity dilemma. A model must balance two competing traits:
- Plasticity: The ability to adapt quickly to new information.
- Stability: The ability to retain previously learned knowledge.
If the model is too plastic, it overwrites old information. If it’s too stable, it fails to learn new tasks.
To evaluate how well continual learning approaches handle this, researchers use three benchmark scenarios illustrated below.

Figure 1: Continual learning scenarios—Class-Incremental, Task-Incremental, and Domain-Incremental—show how models are tested across varied task sequences.
- Task-Incremental Learning (Task-IL): The model knows which task a test sample belongs to (e.g., “Given Task 1, is this a dog or a bird?”).
- Domain-Incremental Learning (Domain-IL): The tasks include the same classes but in different domains (e.g., dogs in real photos vs. cartoon drawings).
- Class-Incremental Learning (Class-IL): The model must classify across all previously seen classes without being told which task the input belongs to—the most difficult setting.
Among the various approaches—regularization, architectural adaptation, and replay—this paper focuses on memory replay, the simplest and most effective family for controlling catastrophic forgetting.
Building on Experience Replay
The foundational replay method, Experience Replay (ER), trains a model simultaneously on new task data and a small batch sampled from its memory buffer. The total loss combines both:
\[ \mathcal{L}_{total} = \mathbb{E}_{(\mathbf{x},y)\sim\mathcal{D}_t}[l(f_{\theta}(\mathbf{x}), y)] + \mathbb{E}_{(\mathbf{x},y)\sim\mathcal{B}}[l(f_{\theta}(\mathbf{x}), y)] \]While ER is simple and effective, minimizing this empirical loss can drive the model toward sharp minima—tiny points in the loss surface where slight changes cause big drops in performance. In contrast, flat minima, found at the bottoms of wide valleys, lead to better generalization and stability. Flattening the loss landscape is where Sharpness-Aware Minimization (SAM) excels.
Step 1: Flattening the Loss with ER-SAM
The first improvement proposed in the paper is ER-SAM, which integrates SAM directly into Experience Replay.
SAM doesn’t just minimize the loss at the current weights \( \theta \); it minimizes the worst-case loss in a neighborhood around them. This is accomplished through a min–max optimization:
\[ \min_{\boldsymbol{\theta}} \max_{\|\boldsymbol{\delta}\|_2 \le \rho} L_{total}(\boldsymbol{\theta} + \boldsymbol{\delta}) \]Here, \( \rho \) defines the radius, and \( \delta \) is the adversarial weight perturbation that maximizes loss within that radius.

The SAM optimizer identifies the direction of maximum loss in the local neighborhood, helping the model move towards flatter minima.
Once SAM identifies \( \delta^* \), it computes the gradient at \( \theta + \delta^* \) and updates \( \theta \) accordingly:

The weight update combines loss minimization with sharpness-awareness, improving generalization capability.
This process guides the model to areas of the parameter space where small perturbations—such as learning new tasks—won’t drastically impact performance.
ER-SAM becomes a flexible component that can be added to existing replay-based methods to enhance their robustness and generalization.
Step 2: Resolving Conflicts with MGSER-SAM
Applying SAM to continual learning introduces a unique complication. The total loss combines two sources:
- \( \mathcal{L}_t \): Loss from the current task.
- \( \mathcal{L}_s \): Loss from past tasks stored in memory.
If the optimization directions for these two losses conflict, SAM’s perturbation can become unstable—improving one task while harming another. To overcome this, the authors propose Memory-Guided Soft Experience Replay with Sharpness-Aware Minimization (MGSER-SAM).
MGSER-SAM introduces two clever regularization concepts:
1. Soft Logits for Deeper Memory
Instead of storing just images and their labels, MGSER-SAM also saves the logits—the raw outputs before the softmax layer—for each memory sample. These logits capture richer knowledge about the model’s uncertainty and internal representation.
When replaying a memory sample \((\mathbf{x}', \mathbf{z}')\), the model minimizes not only the standard loss but also the difference between current and stored logits:
\[ \hat{\mathcal{L}}_s = \mathbb{E}_{(\mathbf{x},y)\sim\mathcal{B}}[l(f_{\theta}(\mathbf{x}),y)] + \mathbb{E}_{(\mathbf{x}',\mathbf{z}')\sim\mathcal{B}}[\|h_{\theta}(\mathbf{x}') - \mathbf{z}'\|_2] \]
The inclusion of soft logits enables deeper preservation of learned patterns by matching the old model’s internal representations.
This distillation process allows the new model to reproduce similar reasoning patterns for old tasks, improving knowledge retention without rigidly fixing weights.
2. Memory-Guided Gradient Alignment
MGSER-SAM refines the SAM update step itself. While SAM is designed for flatness, the memory component explicitly protects prior knowledge. To combine both objectives, MGSER-SAM merges two gradients:
- The SAM gradient, computed on the total loss (\(\mathcal{L}_t + \hat{\mathcal{L}}_s\)) at the perturbed weight point \( \theta + \delta^* \).
- The memory-guided gradient, calculated only on the memory component at the original weight point \( \theta \).

The dual-gradient update balances learning new information with maintaining past knowledge, resolving internal conflicts.
This dual-gradient design allows the model to explore flat minima for generalization while receiving stable guidance from its memory buffer—achieving harmony between learning new tasks and remembering old ones.
Putting MGSER-SAM to the Test
To validate the method, the authors evaluated MGSER-SAM and its SAM-based variants across all three continual learning scenarios using several benchmark datasets.

Table 1: Benchmarks span Task-IL, Class-IL, and Domain-IL scenarios using datasets like S-MNIST, S-CIFAR10, and P-MNIST.
Across these tasks, MGSER-SAM consistently achieved the highest accuracy and the lowest forgetting scores compared to leading baselines such as ER and DER++.

Table 2: MGSER-SAM outperforms previous methods across memory sizes and learning settings, achieving major gains in Class-IL accuracy.
Highlights from the results:
- Huge Accuracy Gains: On S-CIFAR10 (Class-IL), MGSER-SAM achieved 78.51%, outperforming ER by 24.4%.
- Better Generalization Across Scenarios: Gains are consistent for Task-IL, Domain-IL, and Class-IL.
- Gradient Conflict Resolution Matters: Simply adding SAM improves results (DER++-SAM), but the full MGSER-SAM formulation achieves the most robust balance.
Visual Insights into Performance

Figure 2: MGSER-SAM achieves top accuracy (ACC) and lowest forgetting across major benchmarks.

Figure 3: As tasks accumulate, MGSER-SAM retains much higher first-task accuracy, showing its resistance to forgetting.

Figure 4: Average accuracy over time remains consistently superior with MGSER-SAM on both P-MNIST and S-TinyImageNet.

Figure 5: MGSER-SAM scales gracefully with memory size, achieving dominant performance even as buffer capacity grows.
Conclusion: Towards Lifelong Learning Machines
Catastrophic forgetting remains one of the hardest challenges in artificial intelligence. The MGSER-SAM framework presents a powerful step forward by rethinking how memory replay interacts with optimization.
Key takeaways:
- Generalization through Flat Minima: Forgetting is mitigated when models learn stable, flat solutions via SAM’s sharpness-aware optimization.
- Memory Guidance for Stability: Continual learning requires not only flatness but direction, and MGSER-SAM achieves that by aligning gradients from memory with those from new tasks.
- Dual-Gradient Harmony: The synergy of SAM’s generalization power and soft-logit regularization makes MGSER-SAM a versatile, scalable approach for lifelong learning.
By combining memory replay with geometric and gradient-aware optimization, MGSER-SAM sets a new standard for building AI systems that can learn continuously, adapt flexibly, and remember intelligently—moving one step closer to the reality of lifelong learning.
](https://deep-paper.org/en/paper/2405.09492/images/cover.png)