Imagine teaching a child to identify animals. You start with cats. They get really good at it. Then you show them dogs. After a week of dog lessons, you show them a cat again, and they hesitate—“Is that a weird-looking dog?” This is a classic problem not just for kids but for artificial intelligence. It’s called catastrophic forgetting, and it’s one of the biggest hurdles to building AI that can learn continuously, just like we do.

In machine learning, this challenge sits at the heart of Continual Learning (CL): how can a model learn a sequence of new tasks without overwriting what it already knows? A single monolithic model often falters—its internal parameters are pulled in different directions by each new task until the old knowledge is lost.

But what if, instead of one overworked generalist, we had a team of specialists? That’s the idea behind the Mixture-of-Experts (MoE) architecture—a design that powers many advanced AI systems, including large language models. MoE works by maintaining multiple “expert” sub-networks along with a “router” that directs each data input to the most suitable expert.

This setup seems tailor-made for continual learning. One expert could handle cats, another dogs, another birds, and so on. While MoE has shown strong empirical results, the underlying reasons it works—and how it truly prevents forgetting—haven’t been theoretically well-understood.

That’s where the paper “Theory on Mixture-of-Experts in Continual Learning” makes a breakthrough. The authors provide a rigorous theoretical framework explaining why MoE helps in continual learning, how it does so, and—interestingly—when to stop training the router. Their analysis connects intuition with mathematical proof, revealing a new principle that makes MoE work efficiently in dynamic learning environments.

In this article, we’ll unpack their key insights:

  • How the MoE model operates in continual learning.
  • The training designs that let experts specialize while the router learns to make smart decisions.
  • The surprising, critical observation that the router must eventually stop training for stability.
  • The theoretical results showing MoE’s impact on catastrophic forgetting and generalization.
  • Experimental evidence—from synthetic data and real-world scenarios—that confirms these theories.

Whether you’re a student exploring machine learning theory or a practitioner refining your continual learning systems, this deep dive should give you clear, actionable insight into how a “team of specialists” prevents your AI from forgetting.


Continual Learning and Catastrophic Forgetting

Continual Learning (CL) is an approach where models encounter new tasks over time. For example, a robot might learn to recognize different objects as it navigates new environments, or a language model may gradually absorb new topics.

The goal of CL is to accumulate knowledge continuously—new learning should build upon the old rather than replace it. But standard neural networks often fail this test due to catastrophic forgetting: as the model learns new tasks, updates to its parameters corrupt the representations of previously learned ones. The more diverse the task sequence, the worse the interference becomes. A network trained on birds, then cars, then flowers will likely forget its bird-recognition abilities once it tunes itself to detect car features.


Enter Mixture-of-Experts (MoE)

The Mixture-of-Experts model offers a simple yet brilliant architectural fix. Instead of one unified network, MoE divides responsibility among multiple expert modules, with a router controlling which expert sees which data.

  1. Experts (M total): Each expert is an independent neural subnetwork that can specialize in particular types of tasks.
  2. Gating Network (Router): This small network inspects the input and decides which expert should handle it.

A schematic of MoE routing showing the gating network selecting one expert among several based on input signals.

Figure 1: Illustration of the Mixture-of-Experts model. The gating network routes each input or task to the most suitable expert.

When a task arrives, the router assigns it to the expert with the highest score. In practice, a top-1 switch routing strategy is common: only the best-matched expert is selected for training, while the others remain untouched—preserving their task-specific knowledge. This selective training effectively isolates learning signals and dramatically reduces forgetting.


The Paper’s Framework: A Simplified Theoretical Model

To understand the dynamics of MoE mathematically, the researchers analyzed it under overparameterized linear regression, a simplified but powerful setting that has proven to capture many essential properties of neural networks.

Here’s the theoretical setup:

  • Continual Training: The learning evolves over T rounds. In each round t, a new task arrives.
  • Task Pool (Knowledge Base): There are N distinct ground-truth models. Each new task corresponds to one of these N models.
  • Distinct Task Signals: Each task carries a unique feature pattern (a “signal”) in its data distribution that the router can eventually recognize.

This formulation lets the team mathematically trace what happens as tasks arrive and as both experts and the router update.


How the MoE Learns During Each Round

Each training round has four steps:

  1. Task Arrival A dataset \( \mathcal{D}_t = (X_t, y_t) \) arrives, corresponding to one task from the hidden pool.

  2. Routing Decision The gating network computes scores \( h_m(X_t, \theta_t^{(m)}) \) for each expert \( m \in [M] \). To keep routing dynamic and exploratory, small random noise \( r_t^{(m)} \sim \mathrm{Unif}[0, \lambda] \) is added:

    \[ m_t = \arg\max_m \{ h_m(X_t, \theta_t^{(m)}) + r_t^{(m)} \}. \]

    The highest-scoring expert \( m_t \) is selected.

  3. Expert Update The chosen expert receives the task data and updates its parameters. While many parameter settings can fit perfectly (due to overparameterization), the update follows the minimal change principle:

    \[ \boldsymbol{w}_t^{(m_t)} = \boldsymbol{w}_{t-1}^{(m_t)} + \mathbf{X}_t(\mathbf{X}_t^\top \mathbf{X}_t)^{-1}(\mathbf{y}_t - \mathbf{X}_t^\top\boldsymbol{w}_{t-1}^{(m_t)}). \]

    Other experts remain unchanged.

  4. Router Adjustment Finally, the router updates its parameters to improve future expert selection. Here is where the paper introduces two key design innovations.


Key Design I: Multi-Objective Router Training

To train the router effectively, the authors introduced a loss that balances expert specialization and task distribution fairness.

1. Locality Loss

Encourages the router to send similar tasks to the same expert:

\[ \mathcal{L}_t^{loc}(\boldsymbol{\Theta}_t, \mathcal{D}_t) = \sum_{m \in [M]} \pi_m(\mathbf{X}_t, \boldsymbol{\Theta}_t)\|\boldsymbol{w}_t^{(m)} - \boldsymbol{w}_{t-1}^{(m)}\|_2. \]

Here, \( \pi_m \) are softmax probabilities from the router. Minimizing this loss reduces parameter shifts for each expert, driving tasks of similar nature to be grouped under the same specialist.

2. Auxiliary Loss (Load Balancing)

Prevents the router from overusing a few experts:

\[ \mathcal{L}^{aux}_t(\Theta_t, \mathcal{D}_t) = \alpha M\sum_{m \in [M]} f_t^{(m)}P_t^{(m)}. \]

This term enforces uniform task dispatching so all experts receive attention. The total loss is a weighted sum, and parameters are updated via gradient descent.


Key Design II: Early Termination—When to Stop Updating the Router

Here lies one of the paper’s most surprising insights: in continual learning, the router must stop learning after a certain point.

In standard training, we keep updating until convergence. But in an online continual setting, continuous router updates eventually destabilize the system. The balancing loss makes expert scores too similar, and small random noise then causes the router to misassign tasks. Routing errors create cross-task interference—undoing the benefit of specialization and reintroducing forgetting.

Thus, the researchers propose an early termination strategy:

  • Allow router updates for an initial exploration phase (\(T_1 = \lceil \eta^{-1}M \rceil\) rounds).
  • Then monitor output gaps between experts.
  • Once the router consistently separates task-specialized groups (diverse experts with clear score gaps), freeze its parameters permanently.

After termination, the router uses its learned structure to maintain balanced loads without further updates, ensuring convergence and preventing forgetting.


Theoretical Results: What the Mathematics Says

The authors derived several formal propositions and theorems explaining MoE behavior.

Expert Specialization (Proposition 1)

After the initial exploration phase:

  • If \( M > N \): Each expert converges to a single task.
  • If \( M < N \): Each expert specializes in a cluster of similar tasks. Once convergence occurs, experts’ weights stop changing—effectively locking in their learned knowledge.

The Need for Termination (Proposition 2)

  • In early rounds, there’s a clear gap between experts specializing in different tasks (\( \Theta(\sigma_0^{0.75}) \)) and those in the same set (\( \mathcal{O}(\sigma_0^{1.75}) \)).
  • If router training keeps going, these gaps vanish; all experts’ scores become indistinguishable, causing misrouting and performance decay. Thus, the router must stop updating after sufficient rounds to preserve these separations.

Load Balancing and Stability (Proposition 3)

After termination, random perturbations \(r_t^{(m)}\) make sure all experts within a specialization set are used equally. This balances the system’s computational load while keeping task assignments correct.


Forgetting and Generalization: Quantifying MoE’s Advantage

The authors formalize two metrics:

  1. Forgetting Measures the drop in performance on past tasks after learning new ones.

  2. Generalization Error Evaluates overall model accuracy across all tasks at the end of training.

Compared to a single expert baseline:

  • Forgetfulness drops to near zero after the specialization phase. MoE learns each task in isolation, with no destructive interference.
  • Generalization error remains small and stable, even as new tasks arrive.

Critically, the analysis also shows that adding too many experts can slow convergence without improving results. Beyond a sufficient number, the system wastes time exploring redundant experts.


Experimental Evidence

Synthetic Data Validation

The first experiment tested the theoretical need for early termination of router updates.

Four-panel figure comparing “with termination” vs “without termination” for different numbers of experts across 2000 rounds.

Figure 2: Forgetting and generalization error dynamics with and without termination. Here N = 6, K = 3 clusters, and \( M \in \{1, 5, 10, 20\} \).

Observation:

  • With termination, forgetting and error both decrease to near zero. All multi-expert models outperform the single-expert case dramatically.
  • Without termination, metrics fluctuate and stay high—the router becomes unstable.
  • Increasing experts beyond 10 doesn’t help, confirming that excess experts only lengthen the exploration phase.

Real-World Validation with Deep Neural Networks

To test MoE in nonlinear settings, the authors implemented the algorithm on CIFAR-10 using a ResNet-18 backbone.

Four small plots showing generalization error and accuracy for MoE models on CIFAR-10 with and without termination.

Figure 3: The dynamics of overall generalization error and test accuracy under the CIFAR-10 dataset. With termination, models achieve higher stability and accuracy for \( M \in \{1, 4, 12\} \).

Result: Termination leads to stable learning and higher accuracy. Without termination, training oscillates and underperforms—mirroring the linear results but on real, complex data.


Key Takeaways

  1. MoE combats catastrophic forgetting by isolating tasks into dedicated experts. Each expert acts as a memory vault for its assigned tasks.

  2. A Multi-Objective Loss—combining locality (specialization) and auxiliary (load balancing)—optimally trains the router.

  3. Early Termination of router updates ensures stability. Continuous updates would homogenize expert scores and destroy specialization.

  4. More experts ≠ better performance. Adequate coverage is critical, but excessive expert numbers stretch training without real benefit.


Building AI That Learns Without Forgetting

The “Theory on Mixture-of-Experts in Continual Learning” paper elevates MoE from a practical trick to a theoretically grounded architecture for continual learning. It explains not just what works but why—connecting mathematical reasoning with real-world validation.

Its lessons are clear:

  • Specialization prevents interference.
  • Controlled routing preserves memory.
  • Stopping at the right time secures stability.

By following these principles, we can design AI systems that remember their past, learn in the present, and adapt for the future—without ever forgetting how to recognize a cat.