Imagine you are learning a second language. You spend months mastering French. Then, you switch gears to learn Spanish. A few months later, you try to speak French again, but you find yourself inserting Spanish words or, worse, you’ve forgotten the French grammar entirely.

This phenomenon, known in psychology as catastrophic forgetting, is a massive headache for Artificial Intelligence, specifically for Multimodal Large Language Models (MLLMs). These models, like GPT-4V or Gemini, are incredibly powerful at understanding images and answering questions about them. However, the world changes constantly. We want these models to learn new types of data and tasks continuously without having to retrain them from scratch (which costs millions of dollars) and—crucially—without them forgetting what they learned previously.

In this post, we are doing a deep dive into a fascinating research paper titled “CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering.” The researchers propose a novel architecture that allows an AI to learn continually, absorbing new knowledge while protecting the old.

Progress of continual learning over time on VQA v2. The graph shows that recent methods like CL-MoE are approaching the upper bound of multitask learning.

As shown in Figure 1 above, the field has been striving to close the gap between “Vanilla” learning (where the model forgets) and “Multitask” learning (where the model learns everything at once). The proposed method, CL-MoE, represents a significant leap forward in this timeline.

The Problem: The High Cost of Knowledge

Standard MLLMs consist of a vision encoder (to see) and a Large Language Model (to think and speak). They are trained offline on massive datasets. But in the real world, data comes in streams. Trends change, new objects appear, and new types of questions are asked.

If we simply fine-tune an MLLM on new data, the weights of the neural network change to minimize error on the new task. In doing so, they drift away from the configuration that was optimal for the old task. The result? The model becomes an expert at the new stuff but fails at what it used to know.

Existing solutions usually fall into two buckets:

Regularization: Penalizing the model for changing parameters that were important for previous tasks.
Rehearsal: Keeping a small “memory bank” of old data to mix in with the new data.

While these help, they often struggle with the complexity of multimodal tasks. The authors of CL-MoE argue that we need a structural change in how the model manages its “skills.”

The Solution: Mixture-of-Experts (MoE)

The core idea behind this paper is Mixture-of-Experts (MoE) combined with Continual Learning (CL).

In a standard dense neural network, every single neuron fires for every single input. In a Mixture-of-Experts model, the network is divided into different sub-networks called “experts.” For any given input, a “router” decides which experts should handle the job.

The researchers use a specific efficient tuning method called LoRA (Low-Rank Adaptation). Instead of retraining the whole massive LLM, they freeze the main parameters and only train small adapter layers. In their CL-MoE framework, these adapters act as the “experts.”

Here is the mathematical formulation of a LoRA-based MoE layer:

Equation for LoRA-based MoE output.

In this equation:

\(\mathbf{W}\mathbf{x}\) is the output of the frozen pre-trained weights (the base knowledge).
The summation represents the contribution of the experts.
\(G(\mathbf{x})_i\) is the gate (router) determining how much weight to give expert \(i\).
\(B_i A_i\) are the low-rank matrices representing the expert.

The Architecture: CL-MoE

The researchers propose a framework called CL-MoE, which stands for Continual Learning Mixture-of-Experts. It solves two specific problems that standard MoE models don’t address well in a continual learning setting:

Selection: How do we choose the right expert when the task keeps changing? (Addressed by RMoE)
Update: How do we update expert parameters without erasing old skills? (Addressed by MMoE)

Let’s look at the high-level architecture:

The framework of CL-MoE containing Dual-Router MoE and Dynamic Momentum MoE.

As you can see in Figure 2, the system processes an input image and text. It passes through a router that selects specific experts (colored blocks). Crucially, notice the distinction between “Task-shared experts” (Green) and “Task-specific experts” (Orange). This distinction is vital for the update strategy, which we will unpack shortly.

Component 1: Dual-Router MoE (RMoE)

Standard routers look at an input and pick an expert. But in Visual Question Answering (VQA), context matters at two levels:

Instance Level: What is this specific image about? (e.g., a photo of a cat needs object recognition experts).
Task Level: What kind of problem are we solving? (e.g., counting objects vs. identifying colors).

If we only use an instance router, the model might miss the broader context of the current task. If we only use a task router, it misses the nuances of the specific image.

The RMoE employs a dual strategy.

The Instance-Level Router (\(G^I\))

This router takes the specific input representation \(\mathbf{x}\) and calculates a probability distribution over the experts using a Softmax function. It asks: “Which expert is best for this specific sentence and image?”

Equation for Instance-level Router weights.

The Task-Level Router (\(G^T\))

This is where it gets interesting. The Task-Level router doesn’t look at just one image. It aggregates the decisions of the instance router across the entire training set for the current task. It essentially averages the instance-level choices to find out which experts are generally useful for the current task (e.g., “Expert 5 is really good at the ‘Counting’ task”).

Equation for Task-level Router weights.

Here, \(N^t\) is the number of samples in the current task.

Inference: Guessing the Task

During training, we know which task we are working on. But during testing (inference), the model just sees an image and a question. It doesn’t know if it’s supposed to be doing “Task 1” or “Task 5.”

To solve this, the model maintains “cluster centers” (\(R_t\)) for each task—basically an average representation of what queries in that task look like.

Equation for Cluster Center calculation.

When a new test sample comes in, the model compares the sample’s features to these centers to predict the Task ID (\(t\)):

Equation for Task ID prediction during inference.

Finally, the model combines the advice from the Instance Router and the Task Router using a balancing hyperparameter \(\beta\). This gives a comprehensive output \(\hat{\mathbf{x}}\) that benefits from both local (instance) and global (task) insights.

Equation for combining Instance and Task router outputs.

Component 2: Dynamic Momentum MoE (MMoE)

Now that we have selected the experts, how do we train them? This is where the “Momentum” part comes in.

In continual learning, we have a conflict:

Plasticity: We want to change parameters to learn the new task.
Stability: We want to freeze parameters to keep the old task.

The MMoE module solves this by dynamically assigning a “momentum” value to the weight updates. It categorizes experts into three types based on the current task and previous tasks:

Task-Shared Experts: These experts were important in previous tasks AND are important in the current task. They are the “generalists.”
Task-Specific Experts: These are important for the current task but weren’t used much before. They are the “specialists” for the new stuff.
Others: Experts not used currently.

The model uses a hyperparameter \(\gamma\) (gamma) to control how much the weights should change.

If an expert is Task-Shared, we lean towards stability (retain \(\theta_{t-1}\)).
If an expert is Task-Specific, we lean towards plasticity (absorb new \(\varphi_t\)).

The weight \(\lambda_i\) for each expert is determined as follows:

Equation for Lambda selection based on expert type.

Finally, the expert parameters \(\theta_t\) are updated using a weighted average of the old parameters (\(\theta_{t-1}\)) and the newly learned parameters (\(\varphi_t\)):

Equation for Dynamic Momentum update.

This simple yet elegant equation ensures that the model only radically changes the parts of its brain (the experts) that are specific to the new problem, while carefully refining the parts that are generally useful.

Experimental Results

The researchers tested CL-MoE on the VQA v2 benchmark, splitting it into 10 sequential tasks (e.g., Recognition, Location, Counting, Commonsense, etc.).

They used two key metrics:

AP (Average Performance): How accurate is the model on all tasks at the end? (Higher is better)
AF (Average Forgetting): How much did the accuracy drop on old tasks? (Lower is better. A negative number means the model actually improved on old tasks!)

Comparison with Baselines

Let’s look at the results in Table 1.

Performance comparison table of CL-MoE versus other methods on VQA v2.

Key Takeaways:

State-of-the-Art: CL-MoE (bottom row before “Multitask”) achieves an AP of 51.34%, significantly higher than the previous best (VQACL at 43.49%).
Negative Forgetting: Look at the AF column. CL-MoE scores -0.02. This is remarkable. Most continual learning methods have positive AF (meaning they forget). A negative AF implies Backward Transfer—learning new tasks actually helped the model answer questions from previous tasks better!
Model Scale Matters: The table compares methods based on VL-T5 (smaller) and LLaVA-7B (larger). The LLaVA-based methods generally perform better, but CL-MoE extracts the most potential out of the LLaVA architecture.

Ablation Study: Do we need both parts?

The authors performed an ablation study to see if RMoE and MMoE are both necessary.

Ablation study table showing the contribution of MMoE and RMoE components.

Row (a): Without either (Vanilla), performance is poor (AP 32.51%).
Row (b) & (c): Using just MMoE or just RMoE improves things significantly, but they have weaknesses. Using only RMoE (c) leads to high forgetting (AF 11.09%).
Row (d): Combining both yields the best accuracy and eliminates forgetting.

Hyperparameter Sensitivity

How sensitive is the model to the choice of \(\gamma\) (momentum balance) and \(\beta\) (router balance)?

Graphs showing the impact of hyperparameters gamma and beta on performance.

Figure 3 shows a clear “sweet spot” for these parameters.

Left Graph (\(\gamma\)): A value around 0.7 is optimal. If \(\gamma\) is too high, the model is too rigid (high stability, low plasticity). If too low, it forgets too much.
Right Graph (\(\beta\)): A value of 0.5 works best, suggesting that the Instance Router and Task Router are equally important.

Conclusion

The CL-MoE paper presents a compelling argument for specialized architectures in Continual Learning. Simply regulating weights isn’t enough when dealing with complex, multimodal data.

By breaking the model down into experts, cleverly routing data based on both local and global context (RMoE), and dynamically updating weights based on expert usage (MMoE), we can build AI that grows smarter over time without losing its past.

This approach brings us one step closer to lifelong learning agents—AI assistants that can learn your specific preferences and new tasks over years without forgetting the basics they learned on day one.

Key Takeaways for Students:

MoE is powerful: It adds capacity without exploding computational cost per inference.
Context is King: The Dual-Router approach highlights that knowing what task you are doing is just as important as the input data itself.
Balance is Key: The momentum update is a mathematical balancing act between stability (memory) and plasticity (learning).

The Problem: The High Cost of Knowledge#

The Solution: Mixture-of-Experts (MoE)#

The Architecture: CL-MoE#

Component 1: Dual-Router MoE (RMoE)#

The Instance-Level Router (\(G^I\))#

The Task-Level Router (\(G^T\))#

Inference: Guessing the Task#

Component 2: Dynamic Momentum MoE (MMoE)#

Experimental Results#

Comparison with Baselines#

Ablation Study: Do we need both parts?#

Hyperparameter Sensitivity#

Conclusion#