One Model to Rule Them All? Why Continual Learning Needs a 'Model Zoo'

Imagine trying to teach one student to be a world-class physicist, a concert pianist, and a master chef—all at once. While they might pick up the basics of each, mastering one skill could interfere with the others. The delicate touch needed for a soufflé might not harmonize with the powerful chords of a concerto.

This analogy captures a core challenge in continual learning—building artificial intelligence systems that can acquire new skills sequentially without catastrophically forgetting old ones.

For years, researchers have trained a single, large neural network and hoped it could become a jack-of-all-trades. But each time a new task arrives, the network’s weights are updated—often overwriting previous knowledge. This “catastrophic forgetting” remains a persistent obstacle in lifelong learning systems.

A recent paper, Model Zoo: A Growing Brain That Learns Continually, by Rahul Ramesh and Pratik Chaudhari (University of Pennsylvania), reframes the problem. It argues that the challenge isn’t only forgetting—it’s task competition. Some tasks help each other learn, while others conflict and fight for the same model resources. Forcing all tasks to coexist in one network often produces mediocrity.

Their solution? Don’t force it. Build a Model Zoo—an ensemble of small, specialized models that grows over time, intelligently grouping cooperative tasks while keeping conflicting ones apart.

In this deep dive, we’ll unpack the theory behind task competition, explore the design of the Model Zoo algorithm, and review results that could redefine how we think about continual learning.

The Problem with “One Size Fits All” Learning

In continual learning, a model encounters a sequence of tasks. Ideally, this learner should:

Avoid catastrophic forgetting — retain accuracy on previous tasks.
Exhibit forward transfer — use past knowledge to learn new tasks faster.
Exhibit backward transfer — let new knowledge improve older tasks.

Most existing methods try to solve this inside a single network. Techniques like Elastic Weight Consolidation (EWC) regularize important parameters, while replay-based methods like Gradient Episodic Memory (GEM) store old samples to remind the network of past tasks.

But the authors start with a deeper question: is combining multiple tasks in one model always beneficial?

Statistical learning theory suggests that more data—even from different but related tasks—should help generalization. Yet when tasks are dissimilar, the situation changes dramatically.

The Theory: When Good Tasks Go Bad

The paper’s theoretical framework defines a new way to measure task relatedness. It shows precisely when shared training helps, and when it hurts.

Synergistic Tasks — The Dream Team

Imagine two tasks: classifying images of cats, and classifying the same images rotated by 90°. These tasks are closely related—the same visual features apply, just transformed.

The authors show that if task inputs and outputs are simple transformations of each other, one shared representation can generalize across tasks. In this case, training together reduces the number of samples needed per task—a win for multi-task learning.

Competing Tasks — The Inevitable Conflict

Now suppose one task classifies Large Carnivores, while another identifies Household Furniture. These feature spaces—fur versus wood grain—have little overlap. Forcing the same model to handle both effectively is inefficient and leads to task interference.

To describe this interplay, the paper introduces the transfer exponent, \( \rho_{ij} \), which measures how related or antagonistic two tasks are. A small \( \rho_{ij} \approx 1 \) means synergy. A large \( \rho_{ij} \) means dissonance.

Heatmap showing task competition in CIFAR-100 experiments. Warm colors indicate accuracy drops as additional tasks are introduced; cool colors indicate cooperative improvements.

Figure 2: Accuracy of different CIFAR-100 task groups as more tasks are trained together. Non‑monotonic changes—such as sharp drops after adding specific tasks—demonstrate task competition. Some tasks help each other; others conflict.

This insight yields the paper’s key theorem—a kind of “no free lunch” result for multi-task learning. When training a single model on k tasks, the error bound on a specific task depends on two forces:

Competition Term: Captures disagreement among tasks—the more dissonant tasks you add, the greater the risk of increased error.
Generalization Term: Benefits from more samples across tasks, reducing error. But this benefit weakens as \( \rho_{\max} \), the worst-case transfer exponent, grows.

The result is striking: adding more tasks isn’t always good. The data benefit can be completely canceled by competition among tasks.

The Algorithm: From Theory to the “Model Zoo”

If a single model can’t juggle all tasks, the solution is specialization. The Model Zoo deliberately grows a collection of small models, each focused on synergistic groups of tasks.

Conceptual illustration of Model Zoo where models trained on different overlapping subsets of tasks form an ensemble.

Figure 3: Instead of forcing every task into a single model, Model Zoo grows multiple specialist models. Each new model is trained on a subset of past and current tasks that boost each other’s learning.

Here’s how Model Zoo works:

A new task arrives. At episode \( k \), the learner receives data for task \( P_k \).
Check the existing zoo. The current ensemble is evaluated on all previous tasks, identifying which ones still have high error.
Assign task weights. Tasks with high loss receive greater weight, similar to AdaBoost’s strategy.

Equation for calculating task weights, showing exponential weighting of tasks based on their current loss.

Tasks with higher loss get higher probability of being sampled next round. Difficult tasks are revisited more often.

Form a training subset. The learner samples a small group of past tasks—those with higher weights—and adds the new task.
Train a new specialist model. This model learns only from its chosen subset.
Add to the zoo. Once trained, it’s fixed and never overwritten.
Predict by averaging ensemble members. When evaluating task \( i \), the learner averages predictions from every model trained on \( P_i \).

Inference equation showing how Model Zoo aggregates predictions from all relevant models for each task.

At inference, predictions from all models trained on a given task are averaged—preserving memory while expanding capacity.

Over time, the zoo expands, each new model enriching the collective capability. Tasks that once competed are handled by different specialists, and those that cooperate are co-trained effectively.

Experiments: Putting Model Zoo to the Test

The authors rigorously validate their approach across benchmarks such as Rotated‑MNIST, Split‑CIFAR10, Split‑CIFAR100, and Split‑miniImagenet.

Line plots comparing continual learning methods on Mini‑Imagenet, showing average accuracy and per‑task evolution.

Figure 1: Left — Average accuracy on Split‑miniImagenet for various methods. Model Zoo (orange‑top line) consistently outperforms prior methods by wide margins. Right — Accuracy for individual tasks over episodes, showing strong forward‑backward transfer: old tasks improve as new ones are learned.

The Baseline That Shocked the Field

Before introducing Model Zoo’s results, the paper evaluates a simple baseline called Isolated —a new small model trained separately for each task, with no data sharing or replay.

Surprisingly, Isolated beats most contemporary continual learning algorithms. This finding reveals that many complex methods—while designed to mitigate forgetting—fail to overcome fundamental task competition.

Model Zoo Takes the Lead

When the full Model Zoo algorithm is used, its advantages become decisive.

Table summarizing average per‑task accuracy across multiple continual learning benchmarks. Model Zoo variants consistently achieve top performance.

Table 1: Model Zoo outperforms existing continual learning methods across datasets—including difficult ones like Split‑miniImagenet and Coarse‑CIFAR100—by up to 30%. Even small versions rival or exceed “multi‑task” upper bounds where all tasks are known a priori.

Measuring Forgetting and Transfer

Continual learning success isn’t just higher accuracy—it’s how knowledge evolves. The authors evaluate forgetting, forward transfer, and training efficiency on Split‑CIFAR100.

Comparison of continual learning metrics (accuracy, forgetting, forward transfer, efficiency) across leading methods and Model Zoo variants.

Table 2: Model Zoo and its variants show nearly zero forgetting, strong forward transfer, and efficient training times relative to regularization and replay‑based methods.

Forgetting: Almost zero—since old models aren’t updated, their learned weights remain intact.
Forward Transfer: New models rapidly leverage knowledge from prior ensemble members.
Efficient Training: Despite being an ensemble, training and inference times are comparable—or faster—than many single-model methods.

Not Just an Ensemble

Could Model Zoo’s success be simply due to ensembling? The authors tested this by training a massive ensemble of Isolated models without inter-task collaboration.

Ablation studies comparing Model Zoo versus ensembles of isolated models, varying replay and task sampling.

Figure 4: Ablations show that Model Zoo’s performance arises from its intelligent co‑training of tasks—not merely ensembling many independent models.

The result: Model Zoo significantly outperforms the basic ensemble, confirming that its boosting‑inspired strategy of pairing challenging tasks drives the gains, not just aggregation.

Rethinking Continual Learning

Beyond impressive numbers, Model Zoo reframes how we approach lifelong learning.

Task competition is fundamental. Some tasks simply don’t belong together inside one network. Recognizing and managing this relationship is key.
Splitting capacity is powerful. Instead of endlessly tuning update rules within a fixed model, allowing the learner’s capacity to grow yields resilience and adaptability.
Simple baselines matter. The Isolated learner’s strong performance shows that continual learning benchmarks should be grounded in careful, realistic baselines.

The analogy to biology is apt: the brain grows new neural circuits for new experiences rather than overwriting old ones. Model Zoo embodies this principle—a learning system that grows increasingly complex and competent as it faces new tasks.

Conclusion: A New Paradigm for Lifelong Learning

The Model Zoo paper delivers a compelling argument for reimagining continual learning.

Theoretical clarity: It quantifies task synergy and competition using the transfer exponent, showing that more tasks can sometimes hurt performance.
Algorithmic elegance: It introduces a simple, scalable method inspired by boosting that grows a set of specialized models.
Empirical strength: It achieves state‑of‑the‑art results, near‑zero forgetting, and efficient performance across diverse datasets.

Instead of dreaming of one model to rule them all, perhaps the future lies in learning how to cultivate a diverse community of models—a thriving ecosystem that, together, forms a continually expanding, ever‑learning brain.

The Problem with “One Size Fits All” Learning#

The Theory: When Good Tasks Go Bad#

Synergistic Tasks — The Dream Team#

Competing Tasks — The Inevitable Conflict#

The Algorithm: From Theory to the “Model Zoo”#

Experiments: Putting Model Zoo to the Test#

The Baseline That Shocked the Field#

Model Zoo Takes the Lead#

Measuring Forgetting and Transfer#

Not Just an Ensemble#

Rethinking Continual Learning#

Conclusion: A New Paradigm for Lifelong Learning#