Imagine you have a brilliant AI assistant that’s great at pulling information from text. You first teach it to identify company names in news articles—it gets really good at that. Then, you train it to find dates of corporate mergers. But when you ask it to find a company name again, it struggles. Somehow, it’s forgotten its original skill while learning the new one.
This is a classic AI problem known as catastrophic forgetting, and it’s a major challenge on the road to building truly intelligent systems that can learn and adapt over time. In reality, information constantly evolves. We need models that can continually learn new tasks—such as identifying medical terms, extracting financial relationships, or detecting critical events—without losing their past knowledge.
This challenge lies at the heart of continual learning. A recent research paper, “Mixture of LoRA Experts for Continual Information Extraction with LLMs”, proposes a clever solution. The authors introduce a new model called MoLE-CIE, which uses a team of specialized “experts” to learn new information extraction (IE) tasks sequentially. This approach not only combats catastrophic forgetting but also allows the model to share insights between tasks, making it both smarter and more efficient over time.
In this article, we’ll unpack the key ideas behind this work and explain how MoLE-CIE achieves this feat:
- The challenge of continual information extraction
- The architecture and mechanisms behind MoLE-CIE
- The experimental results that demonstrate its success
Whether you’re an NLP researcher, ML practitioner, or just curious about how large language models (LLMs) learn continuously, this deep dive will help you understand a cutting-edge approach to lifelong learning.
The Challenge: Building a Lifelong Learner for Information Extraction
Information Extraction (IE) is the process of converting unstructured text into structured information—essentially teaching computers to read and understand facts. It includes several sub-tasks:
- Named Entity Recognition (NER): Identifying entities such as people, organizations, or locations. Example: Detecting “Apple Inc.” and labeling it as an “ORGANIZATION”.
- Relation Extraction (RE): Recognizing relationships between entities. Example: Determining that Steve Jobs is the “founder of” Apple Inc.
- Event Detection (ED): Identifying events and triggers in text. Example: Recognizing the word “acquired” as an “ACQUISITION” event.
Traditionally, models are trained separately for each task. But imagine a single, unified model that could learn NER, RE, and ED sequentially—while also adapting to new tasks as data evolves. This is the goal of Continual IE, where the model is trained on a sequence of tasks, each introducing new types of information.
However, two major hurdles arise:
- Catastrophic Forgetting: Fine-tuning on a new task often causes a model to overwrite previously learned knowledge, dramatically reducing performance on older tasks.
- Insufficient Knowledge Transfer: IE tasks share semantics—understanding entities helps understand relations and events. An ideal model should reuse knowledge from previous tasks (forward transfer) while improving old tasks with new insights (backward transfer).
LLM-based approaches with Parameter-Efficient Fine-Tuning (PEFT)—like LoRA (Low-Rank Adaptation)—try to address this. Each new task gets its own LoRA adapters, while previous adapters are frozen. This helps retention but isolates knowledge, restricting transfer. Worse, these methods decide which adapter to use at the sentence level—an issue in IE, where sentences across tasks often look semantically similar. This leads to misrouting and further forgetting.
The authors propose a more nuanced approach: operate at the token level, where the model examines text word by word.
The Solution: MoLE-CIE — A Team of Token-Level Experts
MoLE-CIE (Mixture of LoRA Experts for Continual Information Extraction) turns continual learning into true teamwork. Instead of one monolithic model juggling all tasks, it uses a Mixture of Experts (MoE) architecture—a group of specialized experts that collaborate dynamically.
When text enters the system, different parts of it are routed to the most suitable experts. These experts are combined strategically to generate an output, allowing fine-grained handling of diverse tasks while preserving past knowledge.

Figure 1: The overall framework of MoLE-CIE, integrated into an LLM attention layer for efficient continual learning.
1. Mixture of LoRA Experts: Token-Level Specialization
At the core of MoLE-CIE is a pool of LoRA adapters, each one an expert trained for specific behaviors. Some are general-purpose, others specialize in tasks like NER, RE, or ED.
LoRA Router: The Smart Dispatcher
Each token (word) is routed to its most relevant experts via a LoRA Router. Instead of picking one adapter per sentence, the router computes relevance scores for each token using its hidden representation \(h_{x_j}\). A small two-layer neural network calculates these scores and converts them into probabilities over the available experts:
\[ S_{x_j} = \mathbf{W}_2(\tanh(\mathbf{W}_1 h_{x_j})) \]\[ P_j = \frac{\exp(S_{x_j,q})}{\sum_{q=1}^n \exp(S_{x_j,q})} \]The model then selects the top-k experts (typically k = 2) with the highest probabilities. This enables it to, for instance, use one expert for entity tokens and another for event triggers, even within the same sentence—greatly improving flexibility and inter-task knowledge sharing.
Task Experts and Task Keys: Domain Knowledge Guardians
While LoRA experts handle general semantics, Task Experts preserve domain-specific knowledge for NER, RE, and ED tasks. Each task expert has a corresponding Task Key—a vector representing that task’s characteristic features.
To determine which Task Expert to use, MoLE-CIE compares the average hidden representation of the sentence to these keys via cosine similarity:
\[ C_{x,m} = \cos(\operatorname{avg}(h_x), K_m) \]The task expert with the highest similarity score is activated. In addition, a shared IE Expert captures cross-task knowledge. Combining both ensures that MoLE-CIE retains previously learned patterns while still learning generalizable insights.
Aggregating Expert Outputs
The outputs of the selected LoRA and Task Experts are combined into a single attention representation:
\[ A_{\text{MoLE},x_j} = \alpha \sum_{i=1}^{k} \theta_{r_i} E_{r_i}(h_{x_j}) + \beta(\theta_t E_t(h_{x_j}) + \theta_{\text{IE}} E_{\text{IE}}(h_{x_j})) \]This MoLE attention is added to the base LLM’s attention output, blending pre-trained and adaptive representations. The model trains using a standard generation objective:
\[ \mathcal{L}_{\text{task}} = -\sum_{(x,y)\in D_i^{\text{train}}}\log P(y | x; \mathcal{M}_{\text{LLM}}, \mathcal{M}_{\text{MoLE},i}) \]2. Gate Reflection: Teaching the Router Not to Forget
Even routing systems can forget. As tasks evolve, the router and task keys may lose clarity on how to select the right experts. To address this, MoLE-CIE introduces Gate Reflection, which employs knowledge distillation to preserve the router’s and keys’ learned behavior.
Router Knowledge Distillation
MoLE-CIE ensures that the expert-selection probabilities for new tasks stay consistent with prior stages:
\[ \mathcal{L}_{\text{rkd}} = \sum_{x}\sum_{x_j}\text{KL}(P_{i-1,x_j} \| P_{i,x_j}) \]Keys Knowledge Distillation
Similarly, the similarity scores between task keys and sentences remain stable:
\[ \mathcal{L}_{\text{kkd}} = \sum_{x}\text{KL}(C_{i-1,x} \parallel C_{i,x}) \]Joint Loss
Finally, all objectives are combined to balance learning and memory retention:
\[ \mathcal{L} = \left(1 - \frac{|\tilde{R}_{i-1}|}{|\tilde{R}_i|}\right)\mathcal{L}_{\text{task}} + \frac{|\tilde{R}_{i-1}|}{|\tilde{R}_i|}(\gamma\mathcal{L}_{\text{rkd}} + \delta\mathcal{L}_{\text{kkd}}) \]This reflective mechanism keeps the model’s decision gates stable through continual updates—ensuring it “remembers how to choose.”
Experiments: Putting MoLE-CIE to the Test
The authors evaluated MoLE-CIE on six benchmark datasets spanning the three major IE tasks: NER, RE, and ED. They tested two challenging task sequences:
- Sequence 1 (Polling): Alternates tasks—ED → RE → NER → ED → …—to test cross-task adaptation.
- Sequence 2 (Blocked): Trains all splits of one task before moving to the next—testing same-task memory retention.
Main Results: The Clear Winner

Table 1: Accuracy comparison between MoLE-CIE and competing continual learning methods.
MoLE-CIE decisively outperforms all competitors. After all 12 training splits in Sequence 1, it achieves 66.10% accuracy—14.32% higher than the best alternative (NoRGa) and remarkably close to the upper bound (joint-training). This demonstrates efficient continual learning without re-training on old data.
Resisting Forgetting and Enabling Knowledge Transfer
To measure how well MoLE-CIE preserves and transfers knowledge, the authors analyzed individual task performance after full training.

Table 3: Accuracy on individual splits after all training stages.
The results show strong retention—for example, MoLE-CIE maintains over 52% accuracy on the first task (ED₁), far above competing models, proving its resistance to forgetting.
Next, they used standard metrics—Forward Transfer (FWT), Backward Transfer (BWT), and Forgetting Rate (F.Ra)—to quantify learning dynamics.

Table 5: Scores for FWT, BWT, and forgetting rate across models.
MoLE-CIE leads on all fronts:
- Best Forward Transfer: Leveraging old knowledge to learn new tasks faster.
- Strong Backward Transfer: Integrating new insights to refine prior knowledge.
- Lowest Forgetting Rate: Maintaining stability across continual learning stages.
Ablation Study: Why Every Component Matters
To test which parts truly drive performance, the authors removed modules one by one—LoRA experts, Task experts, Gate Reflection components—and measured accuracy drops.

Table 4: Ablation study results highlighting the contribution of each component.
Each module proves essential. Removing LoRA or Task Experts severely hurts results. Disabling Gate Reflection’s distillation losses increases forgetting. Switching from token-level to sentence-level routing decreases accuracy by 5%, confirming the importance of fine-grained expert selection.
Visualizing How Experts Are Used
Finally, the authors visualized how MoLE-CIE’s experts activate across tasks, revealing how knowledge is distributed among specialists and generalists.

Figure 2: Heatmaps of expert selection frequency for each task sequence.
The visualization shows:
- Task Experts specialize — ED, RE, and NER experts are activated primarily for their own tasks, retaining domain-specific patterns.
- LoRA Experts generalize — shared LoRA experts are used across tasks, representing reusable semantics that enable smooth transfer.
- Adaptive routing persists — even as new tasks appear, the model assigns appropriate experts, proving strong adaptability and long-term retention.
Conclusion: Building AI That Never Stops Learning
The Mixture of LoRA Experts for Continual Information Extraction paper advances continual learning for language models in a meaningful way. By blending token-level precision, modular expert specialization, and memory-preserving distillation, MoLE-CIE enables large language models to keep learning without forgetting.
Key takeaways:
- Token-level routing delivers the granularity needed to handle multiple IE tasks effectively.
- Hybrid expert systems combining LoRA and Task Experts balance generalization and retention.
- Gate Reflection ensures the routing mechanism itself stays consistent across continual updates.
This innovation brings us closer to AI systems that grow—and remember—just like humans do. Imagine a model that continuously expands its expertise, adapting to new industries, document types, or domains without retraining from scratch. MoLE-CIE is a major step toward that vision: a model that doesn’t just learn once—it learns forever.
