Never Forget: How Random Projections Help AI Learn Continuously

Imagine teaching a smart assistant a new skill every day. You teach it to identify dog breeds on Monday, cat breeds on Tuesday, and bird species on Wednesday. But when you ask it to identify a Golden Retriever on Thursday, it has completely forgotten what a dog is. It only remembers birds. This frustrating phenomenon is called catastrophic forgetting, and it’s one of the biggest hurdles in creating truly intelligent, adaptable AI systems.

The goal of Continual Learning (CL) is to solve exactly this problem—to enable models to learn from a continuous stream of new information without overwriting what they’ve already learned. For years, researchers have wrestled with this challenge, often with limited success. But the arrival of massive, pre-trained foundation models like Vision Transformers (ViT) and CLIP has changed the game. These models, trained on huge general-purpose datasets, act as strong feature extractors. The key question now becomes: How can we best leverage these powerful pre-trained models for continual learning?

The 2023 NeurIPS paper “RanPAC: Random Projections and Pre-trained Models for Continual Learning” proposes a surprisingly simple yet remarkably effective solution. The authors argue that the full potential of pre-trained models for CL has not been fully tapped. Instead of complex retraining schemes, their method largely bypasses forgetting altogether—by inserting a frozen layer of random numbers between the model’s features and its classifier head.

It sounds almost too simple, but the results speak for themselves: RanPAC reduces error rates by up to 62% compared to previous state-of-the-art rehearsal-free methods.

In this article, we’ll explore:

The fundamentals of continual learning and the challenge of catastrophic forgetting.
Why simple “class prototype” strategies struggle.
The surprising power of Random Projections to improve feature separability.
How RanPAC combines these ideas into a state-of-the-art continual learning method that never forgets.

Understanding Continual Learning

Continual Learning scenarios are usually organized as a sequence of tasks:

Task 1: Learn to classify 10 classes (say, animals).
Task 2: Learn 10 new classes (vehicles).
Task 3: Learn 10 more classes (plants).

During training, the model only sees data for the current task—it cannot access older data once it moves on. This leads to two main challenges:

Catastrophic Forgetting: Updating parameters for new tasks overwrites old knowledge, reducing performance on previous tasks.
Task Recency Bias: The model becomes overly biased toward the most recent task.

There are two main flavors of CL:

Class-Incremental Learning (CIL): Each new task introduces new classes. During testing, the model must classify a sample from any class seen so far, without knowing which task it came from.
Domain-Incremental Learning (DIL): The classes stay the same, but the domain changes—for instance, classifying cats in photos, paintings, and sketches.

Traditional CL methods rely heavily on training from scratch or storing examples from old tasks (“rehearsal buffers”). But with pre-trained models like ViT and CLIP, we can instead build methods that use strong frozen representations. These strategies fall into three broad categories:

Prompting: Add small learnable “prompts” to the frozen backbone, updating only these prompts for new tasks.
Selective Fine-Tuning: Carefully fine-tune a small subset of the pre-trained parameters.
Class Prototypes (CP): For each class, store a single representative feature—its average vector—and classify new inputs based on similarity to these stored prototypes.

The Class Prototype strategy is simple and memory-friendly. If the feature extractor is frozen, it should, in theory, avoid forgetting. But as RanPAC shows, this simplicity hides a critical flaw.

The Problem With Simple Class Prototypes

In a basic Class-Prototype (CP) method, each class \( y \) has a prototype vector \( \bar{\mathbf{c}}_y \) computed by averaging the feature vectors of its training samples. To classify a test sample with feature \( \mathbf{f}_{\text{test}} \), we find the class whose prototype has the highest cosine similarity:

\[ y_{\text{test}} = \arg \max_{y' \in \{1, \dots, K\}} \frac{\mathbf{f}_{\text{test}}^\top \bar{\mathbf{c}}_{y'}}{||\mathbf{f}_{\text{test}}|| \cdot ||\bar{\mathbf{c}}_{y'}||} \]

Figure 2: Histograms of similarities (left) and Pearson correlation coefficients for class prototypes (right). The top row shows that for a simple NCM classifier, prototypes are highly correlated, and the similarity scores for true vs. incorrect classes overlap significantly.

Histograms and correlation matrices showing correlations between class prototypes.

The intuition seems sound—but in practice, the features extracted by pre-trained models are often highly correlated. For similar-looking classes (“Persian cat” vs. “Siamese cat”), the prototypes end up close together in feature space, leading to confusion. As Figure 2 shows, prototypes for different classes exhibit high off-diagonal correlations, and their similarity scores overlap, resulting in poor classification accuracy.

Step 1: Decorrelating Prototypes with Second-Order Statistics

How can we force these prototypes apart?

RanPAC introduces a correction using second-order statistics, which account for correlations between feature dimensions. Specifically, it uses the Gram matrix \( \mathbf{G} \)—an inner-product matrix of the features—and computes a decorrelated similarity score:

\[ s_y := \mathbf{f}_{\text{test}}^{\top} \mathbf{G}^{-1} \mathbf{c}_y \]

This decorrelating step “whitens” the feature space—reducing inter-feature dependencies and increasing linear separability.

Equation showing Gram matrix-based similarity.

When applied, this technique dramatically reduces correlation between prototypes. As shown in Figure 2 (third row), the off-diagonal correlations disappear, overlap between similarity distributions decreases, and accuracy climbs from 64% to 75%.

It’s clear that accounting for relationships between features (not just their averages) is a powerful improvement.

Step 2: The Magic of Random Projections

While decorrelation helps, RanPAC adds another layer of transformation—a Random Projection (RP)—to further improve separability.

The idea is deceptively simple: expand the feature space. Given feature vectors \( \mathbf{f} \in \mathbb{R}^L \), apply a randomly generated matrix \( \mathbf{W} \in \mathbb{R}^{L \times M} \), followed by a nonlinear activation \( \phi(\cdot) \):

\[ \mathbf{h} = \phi(\mathbf{f}^{\top} \mathbf{W}) \]

When \( M > L \), this transformation increases dimensionality and introduces nonlinear interactions among features—making the classes more distinguishable in the new space.

Figure 1: Random projections expand the feature space, increasing class separation. Colored clusters show the transformed features of different classes after RP.

Schematic showing random projections and prototype separation.

Why does this help?

Dimensionality Expansion: Lifting features into a higher-dimensional space increases sparsity, making classification boundaries easier to find.
Nonlinear Interactions: Applying \( \phi(\cdot) \) (such as ReLU) introduces pairwise and higher-order feature combinations—capturing interactions the original feature space missed.
Zero Forgetting: The random weights \( \mathbf{W} \) are generated once and then frozen. Since they are never updated, this layer cannot forget.

The paper’s results (bottom row of Figure 2) show that adding an RP layer drastically improves separability and raises training accuracy to 87%, surpassing even joint training (i.e., learning on all tasks simultaneously).

Figure 3: Average accuracy improves with projection size \( M \), but only when a nonlinear activation \( \phi \) is used.

Accuracy improvement with random projections and nonlinearity.

Without nonlinearity, RP alone brings little benefit—even at \( M = 15{,}000 \). It’s the combination of expanded dimensionality and nonlinear mixing that creates linearly separable representations.

The RanPAC Algorithm: Putting It All Together

RanPAC combines these insights into an elegant two-phase training process.

Figure A1: Overview of RanPAC’s two phases—First-session PETL adaptation (optional) and continual learning with Random Projections.

Phase 1: Optional First-Session Adaptation

Before continual training begins, RanPAC can optionally apply Parameter-Efficient Transfer Learning (PETL), adapting a small number of parameters using the first task’s data only. This bridges the domain gap between pretraining (e.g., ImageNet) and the target dataset. Afterwards, all parameters—both the pre-trained backbone and PETL adapters—are frozen.

Phase 2: Continual Learning with Random Projections

For each task \( t = 1, \dots, T \):

Generate and freeze a random projection matrix \( \mathbf{W} \).
Extract features \( \mathbf{f}_{t,n} \) from the frozen model.
Compute random-projected features \( \mathbf{h}_{t,n} = \phi(\mathbf{f}_{t,n}^{\top}\mathbf{W}) \).
Incrementally update two matrices: \[ \mathbf{G} = \sum_{t,n} \mathbf{h}_{t,n} \otimes \mathbf{h}_{t,n}, \quad \mathbf{C} = \sum_{t,n} \mathbf{h}_{t,n} \otimes \mathbf{y}_{t,n} \] These updates can occur per-sample or per-task and are independent of task order—perfect for continual learning.
Compute final class scores for inference: \[ s_y = \phi(\mathbf{f}_{\text{test}}^{\top} \mathbf{W}) (\mathbf{G} + \lambda \mathbf{I})^{-1} \mathbf{c}_y \]

Equations showing Gram matrix and prototype update process. Final score calculation formula for RanPAC.

This formulation mirrors regularized least-squares regression but avoids any iterative parameter updates. The result: a simple, elegant algorithm that learns without forgetfulness.

Experimental Results: Shattering State of the Art

RanPAC’s performance was tested on eight major benchmark datasets for both CIL and DIL scenarios, using ViT-B/16 as the backbone and comparing against methods like L2P, DualPrompt, CODA-Prompt, ADaM, and SLCA.

Class-Incremental Learning (CIL)

Table 1: RanPAC outperforms all prompting and CP-based methods across seven CIL benchmarks.

Highlights:

Huge Error Reductions: RanPAC slashes error rates by 36% on CIFAR100, 20% on ImageNet-R, and an impressive 62% on Cars.
Ablations confirm importance:
Omitting the RP layer or Phase 1 PETL leads to significant drops.
The simple NCM method performs worst, revealing the weakness of naive prototypes.
Beyond the Upper Bound: RanPAC even surpasses the “Joint linear probe” baseline trained on all data at once, showing that nonlinear random projections unlock new performance gains.

Table 2: Comparing RanPAC to fine-tuning methods—RanPAC achieves nearly joint-training accuracy with far less computation and no forgetting.

Domain-Incremental Learning (DIL)

RanPAC retains its advantage even when domains shift across tasks.

Table 3: RanPAC achieves leading DIL results, with large improvements on DomainNet and CDDB-Hard.

For DomainNet (six domains, 345 classes), RanPAC boosts accuracy from ~50% to 66.6%. Ablations confirm that RP contributes most to performance, while PETL adds less value here—consistent with domain-shift scenarios.

Why RanPAC Works: Key Takeaways

Frozen Foundations Prevent Forgetting Pre-trained models already capture rich, general features. Keeping them frozen ensures no catastrophic forgetting, making CL easier.
Randomness Enhances Representations Expanding dimensionality through a non-linear random projection increases feature diversity and separability without any training cost.
Second-Order Statistics are Critical Using the inverted Gram matrix to decorrelate features yields prototypes that align better with actual data distributions, improving accuracy.
Simplicity Meets Power RanPAC’s process—feature extraction, random transform, statistical updates—requires no iterative tuning and no rehearsal memory, yet beats sophisticated alternatives.

Broader Implications and Future Directions

RanPAC challenges the assumption that continual learning requires complex mechanisms or stored data. Its foundation in classical linear algebra and probabilistic theory, combined with modern deep features, offers several exciting avenues:

Task-Agnostic Learning: Already shown to work even when task boundaries disappear.
Multi-modal Models: Extensions using both vision and language features (e.g., CLIP).
Regression and Representation Learning: By changing the target prototype definitions, RanPAC can adapt beyond classification.

Conclusion

The RanPAC framework delivers a striking message: sometimes, the most powerful solution is the simplest. By pairing frozen pre-trained models with a random projection layer and decorrelated prototypes, we can create fast, memory-efficient, and forget-proof continual learners.

In a landscape dominated by increasingly complex architectures, RanPAC demonstrates that leveraging randomness and mathematical elegance can take us farther than endless fine-tuning. It sets a new state of the art in rehearsal-free continual learning—proving that, for AI, never forgetting may finally be within reach.

Understanding Continual Learning#

The Problem With Simple Class Prototypes#

Step 1: Decorrelating Prototypes with Second-Order Statistics#

Step 2: The Magic of Random Projections#

The RanPAC Algorithm: Putting It All Together#

Phase 1: Optional First-Session Adaptation#

Phase 2: Continual Learning with Random Projections#

Experimental Results: Shattering State of the Art#

Class-Incremental Learning (CIL)#

Domain-Incremental Learning (DIL)#

Why RanPAC Works: Key Takeaways#

Broader Implications and Future Directions#

Conclusion#