ESMER: A Brain-Inspired Fix for Catastrophic Forgetting in Neural Networks

Humans are remarkable lifelong learners. From birth, we continuously acquire new skills—walking, speaking, riding a bike, learning a new language—without abruptly forgetting what we learned before. This seamless ability to learn from a never-ending stream of dynamic experiences is something we take for granted.

For artificial neural networks, however, this is enormously difficult. Deep Neural Networks (DNNs) are powerful, but they suffer from a well-known flaw called catastrophic forgetting. When a network learns a new task, it tends to overwrite knowledge from previous tasks. Imagine a model that learns to recognize cats, then learns dogs, and suddenly forgets what a cat looks like. This phenomenon limits DNNs’ ability to perform continual learning (CL)—learning tasks sequentially from changing data streams.

At the boundary between tasks, say moving from recognizing digits to identifying fashion items, a network’s internal representations can drift so abruptly that previously learned knowledge gets erased.

To solve this, a team of researchers looked to the most successful lifelong learner of all: the human brain. In their paper, Error Sensitivity Modulation-based Experience Replay (ESMER), they propose a model inspired by neuroscience research showing that the brain learns more from small, consistent errors than from large, sudden ones. ESMER uses this principle to create a continual learning framework that not only slows catastrophic forgetting but also handles noisy, imperfect data with surprising resilience.

Let’s step through how it works.

Why Neural Networks Forget

To understand ESMER, let’s first unpack why continual learning is difficult for neural networks.

DNNs are typically trained in a batch learning regime—seeing all data from all categories at once. They learn by adjusting weights to reduce a loss function that measures prediction errors. In continual learning, data arrives task by task. When a model starts a new task, the new samples belong to unseen classes, and the model’s error spikes. These large errors dominate gradient updates, forcing abrupt weight changes that distort the network’s existing representations. This representation drift leads to forgetting older tasks.

A popular mitigation strategy is the use of rehearsal-based methods such as Experience Replay (ER). These methods store a small subset of samples from previous tasks—an episodic memory—to replay during training. Mixing past samples alongside new ones helps remind the model of prior knowledge, reducing drift.

However, when the buffer is small, the flood of new, high-error samples outweighs the influence of replayed ones. The model still experiences disruptive updates early in each new task and may never recover. This reveals a deeper flaw: the learning mechanism itself treats all errors equally.

Learning Like the Brain: The Core Idea Behind ESMER

Neuroscience suggests that human learning obeys a different rule. The brain does not scale learning linearly with error size. Instead, it down-weights large errors—often interpreting them as noise or unexpected context—and learns more from small, consistent mistakes that reflect meaningful patterns.

This mechanism requires maintaining a kind of “memory of errors” to recognize what counts as consistent versus surprising. ESMER implements this idea in a dual-memory architecture: a fast-learning working model and a slow-consolidating stable model that interact with an episodic memory and a memory of errors.

A high-level diagram of the ESMER architecture. It shows a working model and a stable model, which interact with an episodic memory and a memory of errors to guide the learning process.

Figure 1: ESMER’s architecture includes a fast-learning “working model,” a slow-consolidating “stable model,” a replay buffer (“episodic memory”), and a novel “memory of errors” that modulates learning intensity.

Component 1: Modulating Error Sensitivity

At the heart of ESMER is error sensitivity modulation, which ensures the system learns more from small errors than from large, abrupt ones. When new data arrives, ESMER calculates each sample’s loss using the stable model, not the working model. This avoids bias and provides a consistent frame of reference.

Equation 1: Cross-entropy loss for sample i using the stable model.

Each sample’s loss \(l_s^i\) is compared to an error memory—a running momentum-based average of recent losses, \( \mu_{\ell} \). The system then computes a per-sample weight \( \lambda^i \):

Equation 2: Weighting function for each sample based on its loss relative to the running average.

If a sample’s loss is below a threshold (within a margin \( \beta \)), it’s considered “low-loss.” The system learns from it fully. High-loss samples receive reduced weight, preventing them from dominating gradients. As a result, when facing new, unfamiliar classes, the model learns gradually instead of being jolted by large updates.

To keep the error memory stable, ESMER filters outliers before updating \( \mu_{\ell} \) and pauses its update briefly during task transitions:

Equation 3: Filtering the losses to remove outliers before updating the error memory. Equation 4: Momentum update for the error memory.

This preserves a reliable metric of what counts as “normal” error and encourages steady adaptation instead of abrupt correction.

Component 2: A Dual-Memory System for Balanced Learning

ESMER also employs two complementary memory systems to mimic the brain’s fast and slow learning pathways.

The Stable Model (Semantic Memory)

The stable model represents long-term memory. It slowly integrates knowledge from the working model using an exponential moving average:

Equation 6: The momentum update for the stable model’s weights.

This gradual update prevents instability and ensures older information consolidates over time. During inference, the stable model is used because it grounds representations across tasks and generalizes more effectively.

Error-Sensitive Reservoir Sampling (Episodic Memory)

The episodic memory stores samples from past tasks for replay, but ESMER improves how this memory is populated. Before performing reservoir sampling, it pre-selects only low-loss samples as candidates:

Equation 5: Pre-selecting low-loss samples as candidates for the memory buffer.

This selective sampling makes the buffer more representative of well-learned and consistent examples—filtering out noise and anomalies.

During replay, the model computes a loss on memory samples that combines classification loss with a semantic consistency loss encouraging alignment between the working and stable models’ outputs:

Equation 7: The loss function for memory samples, combining classification loss and semantic consistency loss.

This coupling reduces drift between old and new representations, enabling the system to learn new tasks while maintaining old ones.

Results: ESMER in Action

The researchers validated ESMER across multiple continual learning setups:

Class-Incremental Learning (Class-IL): Each task adds new classes; model must distinguish all seen classes.
Generalized Class-Incremental Learning (GCIL): Tasks have varying class counts and imbalance—closer to real-world conditions.
Noisy-Class-IL: Adds label noise to test robustness.

Table 1 shows ESMER outperforming other methods like ER, DER++, ER-ACE, and CLS-ER across Seq-CIFAR10, Seq-CIFAR100, and GCIL benchmarks, especially with smaller buffer sizes.

Table 1: Across datasets like Seq-CIFAR10, Seq-CIFAR100, and GCIL, ESMER achieves the highest accuracy. Gains are strongest in the most memory-constrained settings.

ESMER consistently outperforms baselines, even those using dual-memory approaches. On complex datasets such as Seq-CIFAR100, it delivers notable benefits under tight memory budgets—a scenario where catastrophic forgetting is usually unavoidable.

Table 2 shows ESMER’s performance with extremely small buffer sizes and increasing numbers of tasks.

Table 2: ESMER maintains superior performance even with extremely limited memory and long task sequences.

It also proves astonishingly robust to noisy data:

Table 3 shows ESMER’s performance compared to other methods as label noise increases.

Table 3: With 50% label noise, ESMER achieves over 116% higher accuracy than the baseline ER method—learning effectively from noisy streams.

Learning from low-loss samples naturally filters out incorrect labels, while error-sensitive sampling keeps the replay buffer clean.

Why ESMER Works So Well

The authors conducted analytical studies revealing why ESMER excels.

1. Complementary Components

An ablation study confirmed that each part—error modulation, stable model, and sampling—provides critical benefit. Remove any one and performance drops notably.

Table 4 is an ablation study showing the performance impact of removing different components of ESMER. The full model performs best.

Table 4: Each core component contributes to ESMER’s performance. Together, they maximize accuracy and resilience.

2. Mitigating Representation Drift

Tracking performance on Task 1 while learning Task 2 shows how ESMER avoids abrupt degradation:

Figure 2a shows a line graph comparing the performance drop on Task 1 while learning Task 2 for ER and ESMER. Figure 2b is a bar chart showing that ESMER has less recency bias than other methods.

Figure 2: (a) Compared to standard ER, ESMER’s working model (WM) experiences minimal drift and quickly recovers, while its stable model (SM) remains steady. (b) ESMER also reduces recency bias, maintaining balanced predictions across tasks.

3. Robustness to Noisy Labels

When trained with corrupted data, ESMER memorizes far fewer noisy labels than ER. Its buffer also maintains far higher purity.

Figure 3a shows ESMER memorizes fewer noisy labels; Figure 3b shows its buffer contains less mislabeled data.

Figure 3: (a) ESMER avoids memorizing noisy labels, maintaining clarity in learning. (b) Its error-sensitive sampling yields a buffer with far fewer mislabeled samples—critical for replay stability.

4. Balanced Stability and Plasticity

Finally, across task sequences, ESMER shows an ideal balance between remembering old tasks (stability) and learning new ones (plasticity).

Figure 4 shows heatmaps of task-wise accuracy for different methods, with ESMER maintaining consistently high values.

Figure 4: After multiple tasks, ESMER retains more accuracy across earlier tasks than competing methods, achieving a strong stability–plasticity trade-off.

Conclusion

ESMER provides a powerful, biologically inspired solution to catastrophic forgetting. By borrowing a key insight from human error-based learning—that we learn most effectively from small, consistent mistakes—it enables neural networks to:

Adapt smoothly to new tasks without overwriting prior knowledge,
Maintain robust performance under label noise and data imbalance, and
Reduce recency bias for more uniform learning across tasks.

Its dual-memory design and error-sensitive modulation bring us closer to brain-like learning dynamics in artificial systems. ESMER shows that progress in AI may depend less on “bigger data” or “larger models,” and more on learning smarter—listening to the quiet, consistent signals rather than the loud, abrupt ones.

Why Neural Networks Forget#

Learning Like the Brain: The Core Idea Behind ESMER#

Component 1: Modulating Error Sensitivity#

Component 2: A Dual-Memory System for Balanced Learning#

The Stable Model (Semantic Memory)#

Error-Sensitive Reservoir Sampling (Episodic Memory)#

Results: ESMER in Action#

Why ESMER Works So Well#

Conclusion#