In the world of artificial intelligence, one of the holy grails is building systems that can learn continuously—without forgetting what they already know. Humans do this naturally: when we learn about dogs, we don’t forget everything we know about cats. That ability to acquire new knowledge while retaining prior understanding is the essence of Continual Learning (CL).
Traditionally, CL research has emphasized an agent-centric view, in which a single AI agent directly learns from a stream of raw data. This is powerful, but it doesn’t reflect our modern, interconnected ecosystem where trained models abound. From vision systems that classify images to language models powering chatbots, these existing models embody compressed expertise—what the authors of the research paper call neural skills.
This abundance raises a key question central to the paper “Ex-Model: Continual Learning from a Stream of Trained Models”: Why not learn directly from these expert models instead of starting from raw data?
Much like humans learning from teachers or textbooks rather than reinventing the wheel through trial and error, artificial agents could learn more efficiently, privately, and scalably by studying other models.
This idea forms the foundation of a new framework called Ex-Model Continual Learning (ExML). In ExML, an agent learns not from the original data, but from a stream of previously trained expert models. This shift promises AI systems that are both privacy-preserving and collaborative.

Figure 1: Comparison between traditional Continual Learning (left), which learns from a stream of raw data, and Ex-Model Continual Learning (right), which learns from a stream of expert models.
In this article, we’ll unpack the ExML paradigm, explore the algorithms that make it work, and discuss the experiments that demonstrate its potential.
From Raw Data to Expert Models: Redefining Continual Learning
To see what makes ExML revolutionary, let’s briefly recap the traditional continual learning setup.
A typical CL algorithm, denoted as \( \mathcal{A}^{CL} \), processes a sequence of learning experiences \( S = e_1, e_2, \ldots, e_n \). Each experience \( e_i \) contains a batch of data \( \mathcal{D}^i \). The algorithm updates its current model \( f_{i-1}^{CL} \) to produce the next version \( f_i^{CL} \):
\[ \mathcal{A}^{CL}: \langle f_{i-1}^{CL}, \mathcal{D}_{train}^{i}, \mathcal{M}_{i-1}, t_i \rangle \to \langle f_{i}^{CL}, \mathcal{M}_i \rangle. \]Here, \( \mathcal{M}_{i-1} \) is a memory buffer that may hold past samples, and \( t_i \) is an optional task label. The overall goal is to minimize loss across all experiences:
\[ \mathcal{L}_{S}(f_n^{CL}, n) = \frac{1}{\sum_{i=1}^{n} |\mathcal{D}_{test}^{i}|} \sum_{i=1}^{n} \mathcal{L}_{exp}(f_n^{CL}, \mathcal{D}_{test}^{i}). \]In the ExML scenario, things change drastically. We no longer have direct access to the data stream \( \mathcal{D}_1, \ldots, \mathcal{D}_n \). Instead, our stream consists of expert models \( f_1^S, f_2^S, \ldots, f_n^S \), each trained independently on its own dataset.
This introduces two defining constraints:
- No Access to Original Data: The learning algorithm can only interact with trained models, not the data that created them—enhancing privacy.
- Limited Memory and Computation: The learner can’t store all experts; models must be integrated efficiently as they arrive.
The goal remains: develop a single consolidated model performing well across all tasks. An ExML algorithm, \( \mathcal{A}^{ExM} \), can be described as:
\[ \mathcal{A}^{ExM}: \langle f_{i-1}^{ExM}, f_i^S, \mathcal{M}_{i-1}^{ex}, t_i \rangle \to \langle f_i^{ExM}, \mathcal{M}_i^{ex} \rangle. \]Here, \( f_i^{ExM} \) is the continually learned model, \( f_i^S \) the new expert, and \( \mathcal{M}_i^{ex} \) a memory buffer—filled not with raw data, but with surrogate samples crucial for learning.
The Core Method: Learning via Ex-Model Distillation
How can we transfer knowledge from an expert model without ever seeing its training data?
The authors introduce a family of algorithms called Ex-Model Distillation (ED). These methods rely on knowledge distillation—a process where a “student” learns to mimic a “teacher” model’s outputs. Since ExML lacks direct data, the system must first create surrogate data.
Each learning step involves two primary phases: updating this synthetic buffer and distilling knowledge from the experts.
Step 1: Creating a Surrogate Dataset
The ED algorithm maintains a fixed-size buffer \( \mathcal{M}^{ex} \) containing surrogate data—stand-ins for all previously seen experiences. When a new expert model \( f_i^S \) arrives, a data generator \( \mathcal{A}^{gen} \) produces synthetic examples:
\[ \mathcal{D}_{i}^{ex} = \mathcal{A}^{gen}(f_i^{S}, \frac{N}{i}), \]where \( N \) is the total buffer size.
Older samples are partially replaced, keeping the buffer size constant. This way, the model always retains useful knowledge while incorporating fresh synthetic data.
Step 2: Distilling Knowledge from Two Teachers
Next, the model learns from two teachers:
- The previous Ex-Model (\( f_{i-1}^{ExM} \)), which carries historical knowledge.
- The current expert (\( f_i^S \)), which provides new insights.
For each synthetic sample \( x^{syn} \), the algorithm combines the outputs of both teachers to form the target logits \( \tilde{\boldsymbol{y}} \):

Equations 6–8: If the sample’s class belongs to past tasks, use the old model; if it’s new, use the expert; if overlapping, average both.
To train the student model \( f_i^{ExM} \), the authors use a hybrid loss function:
\[ \mathcal{L}_{ED}(\boldsymbol{y}^{curr}, \tilde{\boldsymbol{y}}, y^{syn}) = \|\boldsymbol{y}^{curr} - \tilde{\boldsymbol{y}}\|_2^2 + \lambda_{CE}\mathcal{L}_{CE}(\boldsymbol{y}^{curr}, y^{syn}), \]combining Mean Squared Error (MSE) to match outputs and Cross-Entropy (CE) to guide correct classification. This allows continual learning without raw data access.
The Magic Ingredient: Data Generation Without Data
The success of Ex-Model Distillation depends on generating meaningful synthetic data. The paper proposes three approaches:
Model Inversion: Start from random noise and iteratively adjust pixels until the expert model confidently predicts a target class. In essence, ask the model: “Show me what you believe this class looks like.”
Data Impression: A refined method using class similarities derived from the expert’s classifier weights. It samples “soft” targets from a Dirichlet distribution—capturing inter-class relationships like “80% dog, 15% cat.”
Auxiliary Data: The simplest approach leverages existing public datasets (e.g., ImageNet) and lets the expert label them. This avoids heavy computation but depends on suitable domain matches.
To make synthetic samples more realistic, the authors add natural image priors, regularizations encouraging typical visual statistics and smoothness. They include augmentation, blur penalties, and matching batch normalization statistics from the expert model to reduce artifacts in generated data.
Putting It to the Test: Experiments and Results
To validate ExML, the researchers tested it on several datasets and learning scenarios.
Experimental Setup
- Datasets: MNIST (handwritten digits), CIFAR-10 (natural images), and CORe50 (object recognition benchmark).
- Scenarios:
- New Classes (NC): Each experience introduces unseen classes.
- New Instances (NI): Same classes, different appearances (backgrounds, poses).
- Multi-Task (MT): Tasks remain distinct with labels provided at test time.

Table 1: Datasets and scenarios used for evaluation.
Three ED variants—Model Inversion ED, Data Impression ED, and Auxiliary Data ED—were compared against various baselines, including an Oracle ensemble (the ideal upper bound) and simple Parameter Averaging.
Key Findings
Overall results reveal three major insights.

Table 2: Accuracy results for MNIST and CIFAR-10. Ex-Model strategies excel in non-continual (“Joint”) settings but face challenges in class-incremental (NC) scenarios.
| Ex-model scenario | Joint | CORe50 | ||
|---|---|---|---|---|
| NC | NI | |||
| Oracle | × | 85.73±0.29 | 96.04±1.08 | — |
| Ensemble Avg. | × | — | 26.30±1.38 | 69.92±0.70 |
| Min. Entropy | × | — | 42.41±0.96 | 61.36±1.86 |
| Param. Avg. | ✓ | — | 2.00±0.00 | 2.00±0.00 |
| Model Inversion ED | ✓ | 50.06±2.76 | 33.10±1.93 | 44.38±4.93 |
| Data Impression ED | ✓ | 52.91±2.09 | 17.57±3.57 | 43.26±2.36 |
| Aux. Data ED | ✓ | 81.82±0.29 | 34.87±1.16 | 44.51±2.91 |
Table 3: Results for CORe50 scenarios.
- Data-Free Distillation Works: In joint settings, ED models reach near-expert performance—proving that data-free knowledge transfer is possible.
- Continual Scenarios Are Hard: Performance significantly drops for class-incremental cases, underscoring the inherent difficulty of continual learning.
- Auxiliary Data Helps (When Similar): Auxiliary data performs well when its domain resembles the expert’s (e.g., ImageNet for natural images). In mismatched domains, generation-based methods fare better.
A Picture is Worth a Thousand Pixels
Why does continual learning degrade performance? The generated images offer a clue.

Figure 2: Synthetic samples for MNIST. Joint setting (b, c) yields recognizable digits; continual Split setting (d, e) produces noisy images, reflecting strong overfitting.
Experts trained on limited classes become overly confident and generate unrealistic images. In joint training, the model captures broad representations; in continual setups, narrow class scopes cause distorted samples—making distillation less effective.
The Limits of Synthetic Data
Does a larger buffer of synthetic samples improve learning?

Figure 3: Accuracy vs. buffer size for CIFAR10-MT. Performance plateaus for generation-based methods due to low sample diversity, while using real data (Replay ED, blue) scales better.
The results show diminishing returns: more synthetic samples don’t necessarily improve accuracy. Replay ED, with minimal real data, scales smoothly—highlighting that existing synthetic generation techniques lack diversity. Improving variety in generated patterns is a key future direction.
Conclusion: A New Frontier for Continual Learning
The paper “Ex-Model: Continual Learning from a Stream of Trained Models” introduces a transformative idea—Ex-Model Continual Learning (ExML)—a method to train continuously by learning from models, not raw data.
Through Ex-Model Distillation (ED) strategies, the approach demonstrates that knowledge transfer without data is not only possible, but already promising. While results in difficult class-incremental scenarios leave room for improvement, the framework lays the groundwork for scalable, privacy-friendly continual learning.
The implications are far-reaching. In healthcare, ExML could enable institutions to share diagnostic models without sharing sensitive patient data. In distributed and federated learning, independent agents could exchange distilled expertise instead of large datasets.
ExML doesn’t replace traditional continual learning—it complements it. In a world increasingly filled with expert “neural skills,” ExML offers a path to learn collaboratively and securely. It’s an invitation for AI to start learning from the masters themselves.
](https://deep-paper.org/en/paper/2112.06511/images/cover.png)