How to Teach Old LLMs New Tricks Forever: Introducing LEMoE

Introduction

Imagine you are trying to teach a Large Language Model (LLM) about the world. You train it on data up to 2023. In 2024, the Prime Minister of a country changes. You teach the model this new fact. In 2025, a new scientific element is discovered. You teach it that too.

Here is the problem: In current deep learning architectures, when you teach the model the new fact, it has a nasty habit of forgetting the old one—or worse, its general reasoning capabilities start to degrade. This is known as Catastrophic Forgetting.

Retraining the entire model from scratch every time a fact changes is prohibitively expensive and slow. We need a way to perform Lifelong Model Editing—continuously updating the model’s knowledge base efficiently without breaking what it already knows.

In this post, we will explore a new paper titled “LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models.” The researchers propose a novel architecture that allows LLMs to learn sequentially, keeping their knowledge fresh without losing their “minds.”

The Problem with Current Editing Methods

There are existing methods for editing models (like ROME or MEMIT), but they are generally designed for Batch Editing—fixing a bunch of errors at once. When you try to use them for Lifelong Editing (a continuous stream of updates over time), they fail.

To understand why, the authors of the paper performed a deep diagnostic analysis. They identified three main culprits:

Catastrophic Forgetting: New edits overwrite the weights used for previous edits.
Inconsistent Routing: In Mixture of Experts (MoE) models, the mechanism that decides which “expert” handles a query changes as the model updates. A question that went to Expert A yesterday might go to Expert B today, losing the learned context.
Order Sensitivity: The model’s performance fluctuates wildly depending on the order in which facts are taught.

Visualizing the Failure

The researchers tested a standard Mixture of Experts (MoE) adaptor on a lifelong editing task. The results were telling.

Analysis of reliability and routing consistency.

Figure 1 Analysis:

Left (Reliability): Look at the graph on the left. The green line (“Immediate evaluation”) shows the model learns the current fact perfectly (100%). But the blue line (“Final evaluation”) shows that after 100 updates, the model has forgotten many of the earlier facts, with reliability dropping significantly.
Right (Routing Consistency): The heatmaps show how consistent the model is at sending the same question to the same expert. The “Batch” map (bottom right) is diagonal, meaning high consistency. The “Lifelong” map (top right) is scattered, meaning the routing is messy and inconsistent.

The Impact of Editing Order

The researchers also found that when you teach a fact matters just as much as what you teach.

Performance variability under different editing orders.

As shown in the violin plot above (left), simply shuffling the order of the exact same dataset can cause reliability to swing by over 20 points. This suggests that the model prefers learning related topics together rather than jumping randomly between unrelated facts.

The Solution: LEMoE

To solve these problems, the authors introduce LEMoE (Lifelong Mixture of Experts).

The core philosophy of LEMoE is simple: Don’t touch what works. Instead of constantly rewriting the same neurons, LEMoE adds new capacity for new information and locks the old information away for safekeeping.

The conceptual framework for LEMoE.

As illustrated in the conceptual framework above, LEMoE treats editing as a timeline. When Data \(i\) arrives, it assigns a specific module (FFN \(i\)) to handle it.

LEMoE is built on three pillars:

New Module Insertion (Freezing Experts)
KV Anchor Routing
Clustering-Based Order Planning

1. New Module Inserting

In a standard transformer, the Feed-Forward Network (FFN) layers are often where “knowledge” is stored. LEMoE utilizes a Mixture of Experts (MoE) approach where multiple FFNs (experts) exist.

However, unlike traditional MoE where all experts are trained together, LEMoE adopts a sequential strategy.

Comparison of LEMoE architecture vs Conventional MoE.

How it works:

When a new batch of edits arrives, LEMoE inserts a new expert (FFN) into the model.
Crucially, it freezes the experts responsible for previous data.
Only the new expert and the router are trainable.

By physically freezing the parameters of previous experts, it becomes mathematically impossible for the new data to overwrite the weights of the old data. This directly addresses catastrophic forgetting.

2. KV Anchor Routing

Freezing the experts is only half the battle. We still need to ensure that when a user asks a question about an old fact, the model knows exactly which frozen expert to send it to. This is the job of the Router.

Standard routers change their behavior as they are trained, leading to the “Inconsistent Routing” problem we saw earlier. LEMoE introduces KV (Key-Value) Anchor Routing.

** The Concept:**

Key (\(k\)): Each expert is assigned a permanent, learnable key vector.
Value (\(v\)): The input query is projected into a “value” embedding.

The router calculates the similarity between the input’s Value and the experts’ Keys.

\[ g ( i \mid e _ { j } ) = \mathrm { T o p } _ { k } ( \frac { \mathrm { e } ^ { k _ { j } v _ { j } } } { \sum _ { i = 1 } ^ { t } \mathrm { e } ^ { k _ { i } v _ { i } } } ) \]

Because the keys for previous experts are frozen (just like the experts themselves), the “address” for old knowledge never changes. If the model learns that “Paris is in France” is stored in Expert 1, the routing mechanism ensures queries about Paris continue to go to Expert 1, even after Expert 100 has been added.

3. Clustering-Based Order Planning

Remember the “Order Sensitivity” problem? The researchers found that models learn better when semantically similar items are grouped together (high within-batch similarity) and different groups are distinct (low between-batch similarity).

LEMoE uses a K-means clustering algorithm to organize the editing stream. Before the edits are applied, they are grouped by topic. This creates a “curriculum” for the model, allowing it to specialize an expert for a specific topic (e.g., one expert for “Geography updates,” another for “Pop Culture updates”).

Experiments and Results

Does it actually work? The researchers tested LEMoE against top-tier baselines like ROME, MEMIT, and GRACE using the LLaMA-2-7B and Mistral-7B models.

They measured three key metrics:

Reliability (Rel): Did the model learn the edit?
Generality (Gen): Can it answer rephrased versions of the edit?
Locality (Loc): Did the edit break unrelated knowledge? (Higher is better, 1.00 is perfect).

Main Results

Lifelong editing results table.

The table above shows the results after 100 sequential editing steps.

LEMoE (Bold bottom row) achieves the best balance.
Locality: LEMoE scores a perfect 1.00 in Locality. Because previous experts are frozen, it physically cannot hallucinate or break unrelated facts.
Comparison: While methods like GRACE have high reliability, they suffer in Generality (0.00). Methods like MEMIT have decent scores but struggle to maintain them as the sequence gets longer.

Scaling to 3,000 Edits

The real test of “lifelong” editing is endurance. How does the model fare after thousands of updates?

Scaling to 3K edits results.

At 3,000 edits, the gap widens.

GRACE maintains reliability but has essentially zero generality (0.03)—it effectively memorizes exact sentences but understands nothing.
MEMIT sees its locality drop to 0.47, meaning it is destroying the model’s existing knowledge base.
LEMoE maintains high reliability (0.70), decent generality (0.48), and perfect locality (1.00).

Conclusion

The LEMoE paper presents a significant step forward in making Large Language Models adaptable to a changing world. By abandoning the idea of “overwriting” knowledge and embracing a modular, additive approach, the researchers have created a system that remembers the past while learning the future.

Key Takeaways:

Don’t Overwrite, Add: Freezing old experts prevents catastrophic forgetting.
Route Consistently: KV Anchor routing ensures old facts can always be found.
Plan the Order: Grouping edits by topic helps the model learn more efficiently.

As LLMs become integrated into daily life, the ability to update them cheaply and safely—without a full retrain—will be a critical requirement. LEMoE offers a robust blueprint for how to achieve that.

Introduction#

The Problem with Current Editing Methods#

Visualizing the Failure#

The Impact of Editing Order#

The Solution: LEMoE#

1. New Module Inserting#

2. KV Anchor Routing#

3. Clustering-Based Order Planning#

Experiments and Results#

Main Results#

Scaling to 3,000 Edits#

Conclusion#