Learning from the Masters: A Deep Dive into Ex-Model Continual Learning

In the world of artificial intelligence, one of the holy grails is building systems that can learn continuously—without forgetting what they already know. Humans do this naturally: when we learn about dogs, we don’t forget everything we know about cats. That ability to acquire new knowledge while retaining prior understanding is the essence of Continual Learning (CL).

Traditionally, CL research has emphasized an agent-centric view, in which a single AI agent directly learns from a stream of raw data. This is powerful, but it doesn’t reflect our modern, interconnected ecosystem where trained models abound. From vision systems that classify images to language models powering chatbots, these existing models embody compressed expertise—what the authors of the research paper call neural skills.

This abundance raises a key question central to the paper “Ex-Model: Continual Learning from a Stream of Trained Models”: Why not learn directly from these expert models instead of starting from raw data?

Much like humans learning from teachers or textbooks rather than reinventing the wheel through trial and error, artificial agents could learn more efficiently, privately, and scalably by studying other models.

This idea forms the foundation of a new framework called Ex-Model Continual Learning (ExML). In ExML, an agent learns not from the original data, but from a stream of previously trained expert models. This shift promises AI systems that are both privacy-preserving and collaborative.

A diagram comparing traditional Continual Learning from raw data with the new Ex-Model Continual Learning from a stream of expert models.

Figure 1: Comparison between traditional Continual Learning (left), which learns from a stream of raw data, and Ex-Model Continual Learning (right), which learns from a stream of expert models.

In this article, we’ll unpack the ExML paradigm, explore the algorithms that make it work, and discuss the experiments that demonstrate its potential.

From Raw Data to Expert Models: Redefining Continual Learning

To see what makes ExML revolutionary, let’s briefly recap the traditional continual learning setup.

A typical CL algorithm, denoted as \( \mathcal{A}^{CL} \), processes a sequence of learning experiences \( S = e_1, e_2, \ldots, e_n \). Each experience \( e_i \) contains a batch of data \( \mathcal{D}^i \). The algorithm updates its current model \( f_{i-1}^{CL} \) to produce the next version \( f_i^{CL} \):

\[ \mathcal{A}^{CL}: \langle f_{i-1}^{CL}, \mathcal{D}_{train}^{i}, \mathcal{M}_{i-1}, t_i \rangle \to \langle f_{i}^{CL}, \mathcal{M}_i \rangle. \]

Here, \( \mathcal{M}_{i-1} \) is a memory buffer that may hold past samples, and \( t_i \) is an optional task label. The overall goal is to minimize loss across all experiences:

\[ \mathcal{L}_{S}(f_n^{CL}, n) = \frac{1}{\sum_{i=1}^{n} |\mathcal{D}_{test}^{i}|} \sum_{i=1}^{n} \mathcal{L}_{exp}(f_n^{CL}, \mathcal{D}_{test}^{i}). \]

In the ExML scenario, things change drastically. We no longer have direct access to the data stream \( \mathcal{D}_1, \ldots, \mathcal{D}_n \). Instead, our stream consists of expert models \( f_1^S, f_2^S, \ldots, f_n^S \), each trained independently on its own dataset.

This introduces two defining constraints:

No Access to Original Data: The learning algorithm can only interact with trained models, not the data that created them—enhancing privacy.
Limited Memory and Computation: The learner can’t store all experts; models must be integrated efficiently as they arrive.

The goal remains: develop a single consolidated model performing well across all tasks. An ExML algorithm, \( \mathcal{A}^{ExM} \), can be described as:

\[ \mathcal{A}^{ExM}: \langle f_{i-1}^{ExM}, f_i^S, \mathcal{M}_{i-1}^{ex}, t_i \rangle \to \langle f_i^{ExM}, \mathcal{M}_i^{ex} \rangle. \]

Here, \( f_i^{ExM} \) is the continually learned model, \( f_i^S \) the new expert, and \( \mathcal{M}_i^{ex} \) a memory buffer—filled not with raw data, but with surrogate samples crucial for learning.

The Core Method: Learning via Ex-Model Distillation

How can we transfer knowledge from an expert model without ever seeing its training data?

The authors introduce a family of algorithms called Ex-Model Distillation (ED). These methods rely on knowledge distillation—a process where a “student” learns to mimic a “teacher” model’s outputs. Since ExML lacks direct data, the system must first create surrogate data.

Each learning step involves two primary phases: updating this synthetic buffer and distilling knowledge from the experts.

Step 1: Creating a Surrogate Dataset

The ED algorithm maintains a fixed-size buffer \( \mathcal{M}^{ex} \) containing surrogate data—stand-ins for all previously seen experiences. When a new expert model \( f_i^S \) arrives, a data generator \( \mathcal{A}^{gen} \) produces synthetic examples:

\[ \mathcal{D}_{i}^{ex} = \mathcal{A}^{gen}(f_i^{S}, \frac{N}{i}), \]

where \( N \) is the total buffer size.

Older samples are partially replaced, keeping the buffer size constant. This way, the model always retains useful knowledge while incorporating fresh synthetic data.

Step 2: Distilling Knowledge from Two Teachers

Next, the model learns from two teachers:

The previous Ex-Model (\( f_{i-1}^{ExM} \)), which carries historical knowledge.
The current expert (\( f_i^S \)), which provides new insights.

For each synthetic sample \( x^{syn} \), the algorithm combines the outputs of both teachers to form the target logits \( \tilde{\boldsymbol{y}} \):

Equations for combining the outputs of the previous model and the new expert to create a target for distillation.

Equations 6–8: If the sample’s class belongs to past tasks, use the old model; if it’s new, use the expert; if overlapping, average both.

To train the student model \( f_i^{ExM} \), the authors use a hybrid loss function:

\[ \mathcal{L}_{ED}(\boldsymbol{y}^{curr}, \tilde{\boldsymbol{y}}, y^{syn}) = \|\boldsymbol{y}^{curr} - \tilde{\boldsymbol{y}}\|_2^2 + \lambda_{CE}\mathcal{L}_{CE}(\boldsymbol{y}^{curr}, y^{syn}), \]

combining Mean Squared Error (MSE) to match outputs and Cross-Entropy (CE) to guide correct classification. This allows continual learning without raw data access.

The Magic Ingredient: Data Generation Without Data

The success of Ex-Model Distillation depends on generating meaningful synthetic data. The paper proposes three approaches:

Model Inversion: Start from random noise and iteratively adjust pixels until the expert model confidently predicts a target class. In essence, ask the model: “Show me what you believe this class looks like.”
Data Impression: A refined method using class similarities derived from the expert’s classifier weights. It samples “soft” targets from a Dirichlet distribution—capturing inter-class relationships like “80% dog, 15% cat.”
Auxiliary Data: The simplest approach leverages existing public datasets (e.g., ImageNet) and lets the expert label them. This avoids heavy computation but depends on suitable domain matches.

To make synthetic samples more realistic, the authors add natural image priors, regularizations encouraging typical visual statistics and smoothness. They include augmentation, blur penalties, and matching batch normalization statistics from the expert model to reduce artifacts in generated data.

Putting It to the Test: Experiments and Results

To validate ExML, the researchers tested it on several datasets and learning scenarios.

Experimental Setup

Datasets: MNIST (handwritten digits), CIFAR-10 (natural images), and CORe50 (object recognition benchmark).
Scenarios:
New Classes (NC): Each experience introduces unseen classes.
New Instances (NI): Same classes, different appearances (backgrounds, poses).
Multi-Task (MT): Tasks remain distinct with labels provided at test time.

A table summarizing the datasets, stream length, class distribution, and model architectures used in the experiments.

Table 1: Datasets and scenarios used for evaluation.

Three ED variants—Model Inversion ED, Data Impression ED, and Auxiliary Data ED—were compared against various baselines, including an Oracle ensemble (the ideal upper bound) and simple Parameter Averaging.

Key Findings

Overall results reveal three major insights.

Table of results for MNIST and CIFAR-10 scenarios, comparing different Ex-Model strategies and baselines.

Table 2: Accuracy results for MNIST and CIFAR-10. Ex-Model strategies excel in non-continual (“Joint”) settings but face challenges in class-incremental (NC) scenarios.

	Ex-model scenario	Joint	CORe50
			NC	NI
Oracle	×	85.73±0.29	96.04±1.08	—
Ensemble Avg.	×	—	26.30±1.38	69.92±0.70
Min. Entropy	×	—	42.41±0.96	61.36±1.86
Param. Avg.	✓	—	2.00±0.00	2.00±0.00
Model Inversion ED	✓	50.06±2.76	33.10±1.93	44.38±4.93
Data Impression ED	✓	52.91±2.09	17.57±3.57	43.26±2.36
Aux. Data ED	✓	81.82±0.29	34.87±1.16	44.51±2.91

Table 3: Results for CORe50 scenarios.

Data-Free Distillation Works: In joint settings, ED models reach near-expert performance—proving that data-free knowledge transfer is possible.
Continual Scenarios Are Hard: Performance significantly drops for class-incremental cases, underscoring the inherent difficulty of continual learning.
Auxiliary Data Helps (When Similar): Auxiliary data performs well when its domain resembles the expert’s (e.g., ImageNet for natural images). In mismatched domains, generation-based methods fare better.

A Picture is Worth a Thousand Pixels

Why does continual learning degrade performance? The generated images offer a clue.

A comparison of original MNIST data and samples generated via Model Inversion and Data Impression for both joint and split scenarios.

Figure 2: Synthetic samples for MNIST. Joint setting (b, c) yields recognizable digits; continual Split setting (d, e) produces noisy images, reflecting strong overfitting.

Experts trained on limited classes become overly confident and generate unrealistic images. In joint training, the model captures broad representations; in continual setups, narrow class scopes cause distorted samples—making distillation less effective.

The Limits of Synthetic Data

Does a larger buffer of synthetic samples improve learning?

A plot showing the effect of buffer size on accuracy for different ED methods.

Figure 3: Accuracy vs. buffer size for CIFAR10-MT. Performance plateaus for generation-based methods due to low sample diversity, while using real data (Replay ED, blue) scales better.

The results show diminishing returns: more synthetic samples don’t necessarily improve accuracy. Replay ED, with minimal real data, scales smoothly—highlighting that existing synthetic generation techniques lack diversity. Improving variety in generated patterns is a key future direction.

Conclusion: A New Frontier for Continual Learning

The paper “Ex-Model: Continual Learning from a Stream of Trained Models” introduces a transformative idea—Ex-Model Continual Learning (ExML)—a method to train continuously by learning from models, not raw data.

Through Ex-Model Distillation (ED) strategies, the approach demonstrates that knowledge transfer without data is not only possible, but already promising. While results in difficult class-incremental scenarios leave room for improvement, the framework lays the groundwork for scalable, privacy-friendly continual learning.

The implications are far-reaching. In healthcare, ExML could enable institutions to share diagnostic models without sharing sensitive patient data. In distributed and federated learning, independent agents could exchange distilled expertise instead of large datasets.

ExML doesn’t replace traditional continual learning—it complements it. In a world increasingly filled with expert “neural skills,” ExML offers a path to learn collaboratively and securely. It’s an invitation for AI to start learning from the masters themselves.

From Raw Data to Expert Models: Redefining Continual Learning#

The Core Method: Learning via Ex-Model Distillation#

Step 1: Creating a Surrogate Dataset#

Step 2: Distilling Knowledge from Two Teachers#

The Magic Ingredient: Data Generation Without Data#

Putting It to the Test: Experiments and Results#

Experimental Setup#

Key Findings#

A Picture is Worth a Thousand Pixels#

The Limits of Synthetic Data#

Conclusion: A New Frontier for Continual Learning#