Artificial intelligence has become incredibly powerful—but most models are surprisingly static. Trained once on massive datasets, they remain frozen, unable to adapt when the world changes. That’s a huge problem because reality is dynamic: data streams evolve, trends shift, and new information emerges every second. How can we build AI systems that learn continuously, absorbing new knowledge like humans do, without retraining from scratch each time?
This challenge defines Continual Learning (CL), also called lifelong or incremental learning. The goal of CL is to enable models to learn from a sequence of tasks while retaining previous knowledge. However, this goal faces a major obstacle: catastrophic forgetting. Training a neural network on a new task can overwrite previously learned information and severely degrade performance on older tasks.
Researchers have proposed numerous strategies to mitigate forgetting, often by training models from scratch with clever memory or regularization schemes. But the landscape transformed with the advent of Pre-Trained Models (PTMs)—massive models like the Vision Transformer (ViT) that are trained on enormous datasets and develop rich, general representations of the world.
Using a PTM for continual learning is like teaching an experienced adult a new skill instead of starting from an infant’s blank slate. The adult already has extensive prior knowledge, making learning more efficient and less error-prone. The survey paper “Continual Learning with Pre-Trained Models: A Survey” provides a sweeping overview of this exciting new frontier.
In this post, we’ll unpack the survey’s insights, explore how PTMs are reshaping continual learning, and break down the three main methodological families driving this revolution.

Figure 1: Traditional continual learning begins from random initialization—like teaching an infant from scratch. PTM-based learning starts from a pre-trained “adult” model that already possesses broad knowledge, enabling faster and more reliable adaptation.
Setting the Stage: The Basics of Continual Learning
In a continual learning setup, a model faces a stream of tasks \( \mathcal{D}^1, \mathcal{D}^2, \dots, \mathcal{D}^B \). At stage b, the model only sees data from the current task \( \mathcal{D}^b \). The ultimate objective is to perform well across all tasks encountered so far. Formally, the model aims to minimize the expected risk:
\[ f^* = \underset{f \in \mathcal{H}}{\operatorname{argmin}} \ \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}_{t}^{1} \cup \cdots \cup \mathcal{D}_{t}^{b}} \mathbb{I}(y \neq f(\mathbf{x})) \]Different scenarios of continual learning are defined by how tasks change:
- Class-Incremental Learning (CIL): Each new task introduces previously unseen classes (e.g., classifying cats and dogs first, then birds and fish later). At test time, the model must recognize all classes without knowing which task an image belongs to.
- Task-Incremental Learning (TIL): Similar to CIL, but the task ID is known during testing, simplifying classification.
- Domain-Incremental Learning (DIL): The set of classes stays constant, but domains shift—for instance, from real photos to sketches of the same objects.
The Power of Pre-Trained Models
Most modern PTM-based CL methods employ the Vision Transformer (ViT) architecture. A ViT divides an image into non-overlapping patches, appends a special [CLS] token, and processes them through transformer layers. The [CLS] embedding serves as a general-purpose image representation.
Conceptually, a PTM can be decomposed into:
\[ f(\mathbf{x}) = W^{\top} \phi(\mathbf{x}) \]where \( \phi(\cdot) \) is the feature extractor (frozen during CL) and \( W \) is the classification head.
PTMs bring two major advantages to continual learning:
- Exceptional Generalization — Trained on massive, diverse datasets, PTMs already provide robust, transferable features.
- Lightweight Adaptation — Their architecture allows efficient fine-tuning via small modules, making it feasible to adapt them incrementally without losing prior knowledge.
A Taxonomy of PTM-Based Continual Learning
The authors classify PTM-based continual learning methods into three families, each addressing the stability–plasticity trade-off differently:

Figure 2: Three main families of methods for continual learning with pre-trained models.
- Prompt-Based Methods — Tune small, trainable “prompts” while keeping PTM weights frozen.
- Representation-Based Methods — Exploit PTM feature embeddings directly for classification.
- Model Mixture-Based Methods — Combine multiple models via ensembling or parameter merging.
Let’s dive deeper into each of these approaches.
1. Prompt-Based Methods — Guiding the Giant
Prompt-based methods exploit PTMs as frozen engines, learning small trainable modules—prompts—to encode new tasks efficiently.
In Visual Prompt Tuning (VPT), learnable prompt vectors \( P \) are prepended to patch embeddings. The model processes the concatenated input while the PTM backbone remains fixed:
\[ \min_{P \cup W} \sum_{(\mathbf{x}, y) \in \mathcal{D}^{b}} \ell \left( W^{\top} \phi \left( \mathbf{x}; P \right), y \right) \]Freezing the backbone minimizes forgetting but requires careful prompt management as new tasks arrive.
Prompt Pool: Instead of updating a single prompt per task, researchers maintain a pool \( \mathbf{P} = \{P_1, P_2, \dots, P_M\} \) representing different tasks. Selecting which prompt to use introduces a retrieval challenge.
Prompt Selection — “Learning to Prompt” (L2P): L2P assigns a learnable key vector \( k_m \) to each prompt. Given a query feature \( \phi(\mathbf{x}) \), it retrieves the most relevant prompts based on similarity:
\[ \mathbf{K}_{\mathbf{x}} = \operatorname*{argmin}_{\{s_i\}_{i=1}^{N}} \sum_{i=1}^{N} \gamma(\phi(\mathbf{x}), \mathbf{k}_{s_i}) \]Keys are updated to better align with corresponding task features, mitigating forgetting across tasks.
Prompt Combination — CODA-Prompt: Instead of hard selection, CODA-Prompt computes an attention-weighted mixture of all prompts:
\[ P = \sum_{m=1}^{M} \gamma(\phi(\mathbf{x}) \odot \mathbf{a}_m, \mathbf{k}_m) P_m \]This continuous weighting enhances flexibility and diversity across tasks.
Prompt Generation — DAP: The Dynamic Attention-based Prompting (DAP) model goes further, using a meta-network to generate unique prompts for each instance:
\[ P = (\gamma_e \operatorname{MLP}(\operatorname{LN}(\phi(\mathbf{x}))^{\top}) + \beta_e)^{\top} \]
Figure 3: Visual summary of major prompt-based strategies, from retrieval-based selection to dynamic generation.
Advantages:
- Extremely parameter-efficient—only small prompts are trained.
- Preserve PTM’s global knowledge.
- Act as external memory for task-specific adaptation.
Drawbacks:
- Prompt selection can be unstable, leading to “meta-forgetting.”
- Growing prompt pools risk inconsistency between training and testing.
- Tasks with large domain shifts may exceed the expressiveness of fixed prompt spaces.
2. Representation-Based Methods — Trusting the Features
Representation-based approaches capitalize on PTMs’ strong feature extraction rather than modifying their parameters.
SimpleCIL: The most straightforward method freezes the PTM and computes a prototype (mean embedding) for each class:
\[ c_i = \frac{1}{K} \sum_{j=1}^{|\mathcal{D}^{b}|} \mathbb{I}(y_j = i) \phi(\mathbf{x}_j) \]These prototypes serve directly as classifier weights—an elegantly simple and surprisingly powerful approach that often outperforms complex prompt-based systems.
ADAM: To introduce task-specific adaptability, ADAM concatenates frozen features with tuned ones (using parameter-efficient fine-tuning):
\[ c_i = \frac{1}{K} \sum_{j=1}^{|\mathcal{D}^{b}|} \mathbb{I}(y_j = i) [\phi(\mathbf{x}_j), \phi(\mathbf{x}_j; \text{PEFT})] \]This hybrid representation merges PTM generality with task-specific knowledge.
Enhancements:
- RanPAC: Employs random projections and linear discriminant analysis to decorrelate class prototypes.
- EASE: Aggregates representations from multiple task-specific backbones to further strengthen robustness.
- SLCA: Fine-tunes the backbone very slowly while rapidly updating the classifier, reducing forgetting.
Advantages:
- Simple yet highly effective; interpretable via class prototypes.
- Low training cost since most parameters are frozen.
- Often form strong baselines for fair comparisons.
Drawbacks:
- Potential redundancy when concatenating features.
- Limited adaptability for large domain shifts among tasks.
3. Model Mixture-Based Methods — Many Minds Working Together
This family tackles forgetting by maintaining multiple models or parameter sets and combining their knowledge.
Model Ensemble: Separate classifiers or backbones are trained for each task and aggregated during inference. For example:
- ESN: Trains distinct classifiers for each task and aggregates predictions via temperature-based voting.
- PromptFusion: Combines outputs from different PTMs (e.g., ViT and CLIP) using weighted mixing.
Model Merging: Instead of storing multiple models, model-merging approaches condense them into a single network. LAE introduces online and offline models—training the online model for current tasks and then merging parameters via exponential moving average:
\[ \boldsymbol{\theta}^{\text{Offline}} \leftarrow \alpha \boldsymbol{\theta}^{\text{Offline}} + (1 - \alpha) \boldsymbol{\theta}^{\text{Online}} \]This smooth merging retains old knowledge while integrating new updates.
Other variants, such as HiDe-Prompt, perform prompt-specific merging after each learning stage, and CoFiMA weighs parameter importance using Fisher information to guide merging precision.
Advantages:
- Diverse models reduce forgetting and improve robustness.
- Final inference cost remains manageable since merged into one model.
- Flexible balance between old and new knowledge.
Drawbacks:
- Ensembling demands substantial resources and memory.
- Merging heuristics can be unstable and sensitive to hyperparameters.
Experimental Insights
The survey evaluates representative methods across seven benchmarks: CIFAR-100, CUB-200, ImageNet-R, ImageNet-A, ObjectNet, OmniBenchmark, and VTAB. These datasets vary in domain similarity to ImageNet, revealing how well models handle shifting distributions.

Table 1: Experimental results across seven benchmarks. \( \bar{\mathcal{A}} \) represents mean accuracy over tasks, and \( \mathcal{A}_B \) indicates final accuracy after learning the last task.
Key Takeaways:
- Challenging Benchmarks Are Essential: Performance on standard datasets like CIFAR-100 is high, but models struggle with ImageNet-A or ObjectNet—datasets with large domain gaps. Future CL research must focus on such difficult benchmarks.
- Representation-Based Methods Excel: Approaches like ADAM and RanPAC outperform most alternatives, underscoring the strength of PTM representations.
- Simplicity Wins: The humble SimpleCIL baseline achieves unexpectedly strong results, proving that in the PTM era, leveraging rich existing features can trump complex tuning strategies.
Fairness Matters: The authors identify a critical issue with DAP’s reported results. DAP’s prompt-generation network used batch-level information during testing, inadvertently revealing task identity and inflating accuracy. When re-evaluated fairly (batch size = 1, without such leakage), its performance dropped sharply below simpler methods. This finding underscores the importance of batch-agnostic evaluation for honest benchmarking.
The Road Ahead: Future Directions
Continual Learning for Large Language Models (LLMs): Large models like GPT face similar challenges—they must update world knowledge without retraining from scratch. Developing task-aware continual learning for LLMs could dramatically reduce resource use and improve responsiveness.
Beyond Single Modality: Multimodal PTMs, such as CLIP, combine visual and textual learning. Applying continual learning to such models enables incremental adaptation across diverse data types, advancing tasks like cross-modal reasoning and retrieval.
Learning on the Edge: PTMs are computationally heavy. Designing CL algorithms that run efficiently on resource-constrained devices will make personal adaptive AI feasible.
New Benchmarks Beyond PTM Knowledge: Since PTMs are trained on vast datasets, they rarely encounter novel concepts. Creating benchmarks with substantial domain gaps will test true continual learning—learning genuinely new information.
Conclusion
Pre-trained models have transformed continual learning from a battle against forgetting into a quest for adaptive knowledge growth. By starting with a globally informed model, researchers can focus on lightweight adaptation, fair evaluation, and realistic scenarios.
The survey “Continual Learning with Pre-Trained Models: A Survey” carefully categorizes methods into prompt-based, representation-based, and model mixture-based families. Experiments reveal that simplicity and fairness matter—a frozen PTM’s representations often rival or surpass complex tuning mechanisms.
The future of AI lies in systems that can learn continuously, intelligently, and sustainably. With pre-trained models paving the way, the dream of machines that “never forget” is closer than ever.
](https://deep-paper.org/en/paper/2401.16386/images/cover.png)