Humans are remarkable lifelong learners. From birth, we continuously acquire new skills—walking, speaking, riding a bike, learning a new language—without abruptly forgetting what we learned before. This seamless ability to learn from a never-ending stream of dynamic experiences is something we take for granted.
For artificial neural networks, however, this is enormously difficult. Deep Neural Networks (DNNs) are powerful, but they suffer from a well-known flaw called catastrophic forgetting. When a network learns a new task, it tends to overwrite knowledge from previous tasks. Imagine a model that learns to recognize cats, then learns dogs, and suddenly forgets what a cat looks like. This phenomenon limits DNNs’ ability to perform continual learning (CL)—learning tasks sequentially from changing data streams.
At the boundary between tasks, say moving from recognizing digits to identifying fashion items, a network’s internal representations can drift so abruptly that previously learned knowledge gets erased.
To solve this, a team of researchers looked to the most successful lifelong learner of all: the human brain. In their paper, Error Sensitivity Modulation-based Experience Replay (ESMER), they propose a model inspired by neuroscience research showing that the brain learns more from small, consistent errors than from large, sudden ones. ESMER uses this principle to create a continual learning framework that not only slows catastrophic forgetting but also handles noisy, imperfect data with surprising resilience.
Let’s step through how it works.
Why Neural Networks Forget
To understand ESMER, let’s first unpack why continual learning is difficult for neural networks.
DNNs are typically trained in a batch learning regime—seeing all data from all categories at once. They learn by adjusting weights to reduce a loss function that measures prediction errors. In continual learning, data arrives task by task. When a model starts a new task, the new samples belong to unseen classes, and the model’s error spikes. These large errors dominate gradient updates, forcing abrupt weight changes that distort the network’s existing representations. This representation drift leads to forgetting older tasks.
A popular mitigation strategy is the use of rehearsal-based methods such as Experience Replay (ER). These methods store a small subset of samples from previous tasks—an episodic memory—to replay during training. Mixing past samples alongside new ones helps remind the model of prior knowledge, reducing drift.
However, when the buffer is small, the flood of new, high-error samples outweighs the influence of replayed ones. The model still experiences disruptive updates early in each new task and may never recover. This reveals a deeper flaw: the learning mechanism itself treats all errors equally.
Learning Like the Brain: The Core Idea Behind ESMER
Neuroscience suggests that human learning obeys a different rule. The brain does not scale learning linearly with error size. Instead, it down-weights large errors—often interpreting them as noise or unexpected context—and learns more from small, consistent mistakes that reflect meaningful patterns.
This mechanism requires maintaining a kind of “memory of errors” to recognize what counts as consistent versus surprising. ESMER implements this idea in a dual-memory architecture: a fast-learning working model and a slow-consolidating stable model that interact with an episodic memory and a memory of errors.

Figure 1: ESMER’s architecture includes a fast-learning “working model,” a slow-consolidating “stable model,” a replay buffer (“episodic memory”), and a novel “memory of errors” that modulates learning intensity.
Component 1: Modulating Error Sensitivity
At the heart of ESMER is error sensitivity modulation, which ensures the system learns more from small errors than from large, abrupt ones. When new data arrives, ESMER calculates each sample’s loss using the stable model, not the working model. This avoids bias and provides a consistent frame of reference.

Each sample’s loss \(l_s^i\) is compared to an error memory—a running momentum-based average of recent losses, \( \mu_{\ell} \). The system then computes a per-sample weight \( \lambda^i \):

If a sample’s loss is below a threshold (within a margin \( \beta \)), it’s considered “low-loss.” The system learns from it fully. High-loss samples receive reduced weight, preventing them from dominating gradients. As a result, when facing new, unfamiliar classes, the model learns gradually instead of being jolted by large updates.
To keep the error memory stable, ESMER filters outliers before updating \( \mu_{\ell} \) and pauses its update briefly during task transitions:

This preserves a reliable metric of what counts as “normal” error and encourages steady adaptation instead of abrupt correction.
Component 2: A Dual-Memory System for Balanced Learning
ESMER also employs two complementary memory systems to mimic the brain’s fast and slow learning pathways.
The Stable Model (Semantic Memory)
The stable model represents long-term memory. It slowly integrates knowledge from the working model using an exponential moving average:

This gradual update prevents instability and ensures older information consolidates over time. During inference, the stable model is used because it grounds representations across tasks and generalizes more effectively.
Error-Sensitive Reservoir Sampling (Episodic Memory)
The episodic memory stores samples from past tasks for replay, but ESMER improves how this memory is populated. Before performing reservoir sampling, it pre-selects only low-loss samples as candidates:

This selective sampling makes the buffer more representative of well-learned and consistent examples—filtering out noise and anomalies.
During replay, the model computes a loss on memory samples that combines classification loss with a semantic consistency loss encouraging alignment between the working and stable models’ outputs:

This coupling reduces drift between old and new representations, enabling the system to learn new tasks while maintaining old ones.
Results: ESMER in Action
The researchers validated ESMER across multiple continual learning setups:
- Class-Incremental Learning (Class-IL): Each task adds new classes; model must distinguish all seen classes.
- Generalized Class-Incremental Learning (GCIL): Tasks have varying class counts and imbalance—closer to real-world conditions.
- Noisy-Class-IL: Adds label noise to test robustness.

Table 1: Across datasets like Seq-CIFAR10, Seq-CIFAR100, and GCIL, ESMER achieves the highest accuracy. Gains are strongest in the most memory-constrained settings.
ESMER consistently outperforms baselines, even those using dual-memory approaches. On complex datasets such as Seq-CIFAR100, it delivers notable benefits under tight memory budgets—a scenario where catastrophic forgetting is usually unavoidable.

Table 2: ESMER maintains superior performance even with extremely limited memory and long task sequences.
It also proves astonishingly robust to noisy data:

Table 3: With 50% label noise, ESMER achieves over 116% higher accuracy than the baseline ER method—learning effectively from noisy streams.
Learning from low-loss samples naturally filters out incorrect labels, while error-sensitive sampling keeps the replay buffer clean.
Why ESMER Works So Well
The authors conducted analytical studies revealing why ESMER excels.
1. Complementary Components
An ablation study confirmed that each part—error modulation, stable model, and sampling—provides critical benefit. Remove any one and performance drops notably.

Table 4: Each core component contributes to ESMER’s performance. Together, they maximize accuracy and resilience.
2. Mitigating Representation Drift
Tracking performance on Task 1 while learning Task 2 shows how ESMER avoids abrupt degradation:

Figure 2: (a) Compared to standard ER, ESMER’s working model (WM) experiences minimal drift and quickly recovers, while its stable model (SM) remains steady. (b) ESMER also reduces recency bias, maintaining balanced predictions across tasks.
3. Robustness to Noisy Labels
When trained with corrupted data, ESMER memorizes far fewer noisy labels than ER. Its buffer also maintains far higher purity.

Figure 3: (a) ESMER avoids memorizing noisy labels, maintaining clarity in learning. (b) Its error-sensitive sampling yields a buffer with far fewer mislabeled samples—critical for replay stability.
4. Balanced Stability and Plasticity
Finally, across task sequences, ESMER shows an ideal balance between remembering old tasks (stability) and learning new ones (plasticity).

Figure 4: After multiple tasks, ESMER retains more accuracy across earlier tasks than competing methods, achieving a strong stability–plasticity trade-off.
Conclusion
ESMER provides a powerful, biologically inspired solution to catastrophic forgetting. By borrowing a key insight from human error-based learning—that we learn most effectively from small, consistent mistakes—it enables neural networks to:
- Adapt smoothly to new tasks without overwriting prior knowledge,
- Maintain robust performance under label noise and data imbalance, and
- Reduce recency bias for more uniform learning across tasks.
Its dual-memory design and error-sensitive modulation bring us closer to brain-like learning dynamics in artificial systems. ESMER shows that progress in AI may depend less on “bigger data” or “larger models,” and more on learning smarter—listening to the quiet, consistent signals rather than the loud, abrupt ones.
](https://deep-paper.org/en/paper/2302.11344/images/cover.png)