Stopping the Drift: How to Fix Catastrophic Forgetting in Continual Learning

Imagine you are learning to play the piano. You spend months mastering classical music. Then, you decide to learn jazz. As you immerse yourself in jazz chords and improvisation, you suddenly realize you’re struggling to remember the classical pieces you once played perfectly.

In the world of Artificial Intelligence, this phenomenon is known as Catastrophic Forgetting. When a neural network learns a new task, it tends to overwrite the parameters it optimized for previous tasks.

This is the core problem addressed in the paper “Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning.” The researchers tackle a specific, challenging flavor of this problem called Class-Incremental Learning (CIL), where a model must learn new classes sequentially without accessing data from previous tasks—and crucially, without knowing which task an image belongs to during testing.

In this post, we will break down their novel solution, which involves diagnosing the root cause of forgetting—termed “Semantic Drift”—and fixing it by surgically calibrating the statistical properties of the model’s feature space.

The Problem: Plasticity vs. Stability

The central dilemma in continual learning is the Stability-Plasticity Dilemma:

Plasticity: The ability to learn new things (Jazz).
Stability: The ability to remember old things (Classical).

Modern approaches often use large pre-trained models (like Vision Transformers or ViTs) and fine-tune them using parameter-efficient methods like LoRA (Low-Rank Adaptation). While this helps, the researchers discovered that even with LoRA, the model’s internal representation of old classes “drifts” significantly as it learns new ones.

What is Semantic Drift?

To understand the solution, we must first visualize the problem. When a model updates its weights to learn new classes, the feature embeddings (the mathematical representation of an image) for old classes shift.

The authors define this movement as Semantic Drift. They break it down into two specific statistical shifts:

Mean Shift: The center of the data cluster moves.
Covariance Shift: The shape and spread of the data cluster change.

Figure 1 illustrating Semantic Drift and Calibration. Part (a) shows the drift of class distribution from Task t-1 to Task t. Part (b) shows how the proposed method compensates for mean shift and calibrates covariance.

As shown in Figure 1(a) above, the gray dots represent the distribution of a class in the previous task (\(t-1\)). The blue dots represent where that same class sits in the feature space after the model updates for the current task (\(t\)). The distribution has moved (drifted) and changed shape. This confuses the classifier, leading to forgetting.

The authors’ solution, illustrated in Figure 1(b), is to mathematically force the new distribution to align with the old one using Mean Shift Compensation and Covariance Calibration.

The Architecture: A Bird’s Eye View

Before diving into the math, let’s look at the overall framework. The system uses a pre-trained Vision Transformer (ViT) backbone. To allow the model to learn without destroying pre-trained knowledge, they use LoRA (Low-Rank Adaptation).

Figure 2. Illustration of the method at task t. It shows the flow from input to feature extraction, the use of LoRA modules, and the three key loss components: Classification, Covariance, and Distillation.

As visualized in Figure 2, the process involves a frozen backbone with learnable LoRA modules. The training objective is a combination of three loss functions:

Classification Loss (\(\mathcal{L}_{cls}\)): To learn the current task.
Covariance Calibration Loss (\(\mathcal{L}_{cov}\)): To maintain the shape of feature distributions.
Distillation Loss (\(\mathcal{L}_{distill}\)): To preserve feature knowledge using patch tokens.

The total optimization goal is represented by this equation:

Equation for the total loss function, combining classification, covariance, and distillation losses.

Let’s break down exactly how they calculate these components to stop the drift.

Core Method: Stopping the Drift

1. Mean Shift Compensation (MSC)

The first step is addressing the movement of the class centers (means). Since the model cannot access old data, it can’t simply recalculate the mean of old classes using the new network. It has to estimate where the old class means would be in the current feature space.

First, let’s define the class mean. After learning a task, the mean (\(\mu\)) for a class \(c\) is the average of all sample embeddings for that class:

Equation 3 defining the class mean calculation.

When the model trains on a new task (\(t\)), the embeddings of the current task’s images change. The researchers assume that the shift observed in the current samples approximates the shift that the old, unseen samples would undergo.

They calculate the difference in embeddings for current images between the old network (frozen from task \(t-1\)) and the current network (being trained for task \(t\)):

Equation 7 defining the embedding shift for a single sample between the old and current model.

Using this sample-level shift, they estimate the Class Mean Shift. However, not all samples are equal. Samples that were close to the class center in the previous model are better indicators of the true shift than outliers. Therefore, they use a weighted average based on proximity to the previous mean:

Equation 8 showing the weighted average calculation for estimating the mean shift.

The weights (\(w_i\)) are determined by a Gaussian kernel, giving higher importance to samples closer to the class center:

Equation 9 defining the weight calculation based on distance from the class mean.

By adding this estimated shift (\(\hat{\Delta}\mu\)) to the stored old means, the model can predict where the old classes are currently located in the feature space without actually seeing the old images.

2. Covariance Calibration (CC)

Correcting the mean is only half the battle. The shape of the distribution (covariance) also distorts. To fix this, the authors introduce a Covariance Calibration technique.

The goal is to align the covariance matrices of embeddings from the old and current networks. To do this efficiently, they utilize the Mahalanobis Distance. Unlike Euclidean distance, Mahalanobis distance accounts for the correlation between variables (the shape of the distribution).

The Mahalanobis distance between two vectors \(x\) and \(y\), given a covariance matrix \(\Sigma\), is defined as:

Equation 12 defining the Mahalanobis distance formula.

The researchers compute the covariance matrix for each class using the old network (which represents “past knowledge”).

Equation 13 showing the calculation of the covariance matrix using the previous task’s network.

Then, they formulate a loss function that forces the pairwise distances between embeddings in the current network to match the pairwise distances in the old network, specifically using the covariance structure of the old network.

Equation 10 detailing the Covariance Calibration loss function.

By minimizing this loss (\(\mathcal{L}_{cov}\)), the network is constrained to maintain the internal structure and shape of the feature clusters, effectively “locking” the second-order moments of the distribution against semantic drift.

3. Feature-level Self-Distillation

Standard classification usually relies on the [CLS] token in Vision Transformers. However, the “patch tokens” (the features representing specific parts of the image) contain rich semantic information.

The authors observed that patch tokens often get ignored or overwritten. To prevent this, they introduce a self-distillation mechanism. They compare the patch tokens from the current network (\(p^t\)) with those from the old network (\(p^{t-1}\)).

Interestingly, they weight this distillation based on how dissimilar a patch is to the class token. If a patch token is very different from the class token, it likely contains unique, local details that shouldn’t be lost.

Equation 14 defining the self-distillation loss for patch tokens.

This formula encourages patch tokens with low angular similarity to the class token (meaning they capture different info) to stay close to their representation in the previous network.

4. Classifier Alignment

Finally, after the training for a task is complete, the classifier head needs a tune-up. Because the model has just seen a lot of data for the new task, the classifier is biased toward it.

Using the calibrated means (from step 1) and covariances (from step 2), the authors sample synthetic features from Gaussian distributions. They use these “hallucinated” features of old classes mixed with new data to retrain the classifier head, ensuring a balanced decision boundary.

Equation 11 showing the cross-entropy loss used for post-hoc classifier alignment.

Experiments and Results

Does this mathematical calibration actually work? The authors tested their method on four benchmark datasets: ImageNet-R, ImageNet-A, CUB-200, and CIFAR-100. They compared their approach against state-of-the-art methods including L2P, DualPrompt, and other LoRA-based strategies.

Comparison with State-of-the-Art

The results, summarized in Table 1 below, are compelling.

Table 1 showing comparative results on four benchmark datasets. The proposed method (‘Ours’) achieves the highest accuracy on ImageNet-R and ImageNet-A.

On ImageNet-R (a dataset with artistic renditions of ImageNet classes, making it distributionally distinct), the proposed method achieves an Average Accuracy (\(\mathcal{A}_{Avg}\)) of 85.95%, outperforming the runner-up (SSIAT) by over 2%.

On ImageNet-A (which contains “natural adversarial” examples that are hard to classify), the method again takes the lead. This suggests that handling Semantic Drift is particularly effective when the data distribution is complex or shifted from the pre-training data.

Robustness Over Time

It is also crucial to see when methods fail. Do they crash after 2 tasks? 5 tasks?

Figure 3 plotting accuracy over incremental learning sessions. The ‘Ours’ line (cyan) consistently stays above competitors, showing better stability.

Figure 3 illustrates the accuracy drop-off as tasks are added.

Cyan Line (Ours): Notice how on the top-left graph (ImageNet-R 5 tasks), the cyan line remains nearly flat, while others degrade.
Even in the 20-task setting (bottom-left), where forgetting usually hits hard, this method maintains a higher baseline of performance throughout the learning lifecycle.

What Components Matter Most?

You might wonder: is it the Mean Shift Compensation (MSC) or the Covariance Calibration (CC) doing the heavy lifting? The ablation study in Table 3 breaks this down.

Table 3 showing ablation studies. Adding MSC and CC individually improves performance, but combining them yields the best results.

Baseline: 79.36% accuracy.
Adding MSC: Jumps to 80.81%.
Adding CC: Jumps to 80.70%.
Combining Both: Reaches 81.60%.
Full Method (with Distillation): Peaks at 81.88%.

This confirms that fixing both the center (mean) and the shape (covariance) of the feature clusters is essential for maximum stability.

Conclusion

The paper “Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning” provides a sophisticated look at why neural networks forget. It moves beyond simple “replay” strategies and looks at the geometry of the latent space.

By identifying Semantic Drift as the culprit and introducing statistical tools to calibrate the Mean and Covariance of feature distributions, the authors offer a robust solution. They allow models to remain “plastic” enough to learn new tasks via LoRA, while “stable” enough to retain old knowledge via drift compensation.

For students and researchers in Machine Learning, this work highlights a vital lesson: sometimes the solution isn’t just a bigger network or more data, but a deeper understanding of the statistical behavior of your features. By mathematically constraining how features move and change shape, we can build AI that learns—and remembers—more like we do.

The Problem: Plasticity vs. Stability#

What is Semantic Drift?#

The Architecture: A Bird’s Eye View#

Core Method: Stopping the Drift#

1. Mean Shift Compensation (MSC)#

2. Covariance Calibration (CC)#

3. Feature-level Self-Distillation#

4. Classifier Alignment#

Experiments and Results#

Comparison with State-of-the-Art#

Robustness Over Time#

What Components Matter Most?#

Conclusion#