How Optimal Transport Stops AI from Forgetting - A Deep Dive into LEDOT

Imagine you are trying to learn a new language. You spend months mastering French. Then, you decide to learn Spanish. But here is the catch: as soon as you start conjugating Spanish verbs, you inexplicably forget every French word you ever learned.

This phenomenon is known as Catastrophic Forgetting, and it is one of the biggest hurdles in Artificial Intelligence today.

In the world of Natural Language Processing (NLP), we want models that can learn continuously—picking up new tasks without erasing their memory of old ones. This is especially tricky in Continual Event Detection (CED), where a model must identify specific types of events (like “Attacks,” “Elections,” or “Transactions”) in text streams that change over time.

In this post, we are going to explore a fascinating research paper: “Lifelong Event Detection via Optimal Transport.” The researchers propose a method called LEDOT (Lifelong Event Detection via Optimal Transport). They argue that the secret to memory retention isn’t just about replaying old data; it is about mathematically aligning what the model is learning now with the deep linguistic knowledge it already possesses.

The Problem: The Wasteful Fine-Tuning Process

To understand LEDOT, we first need to look at how modern NLP models are trained. Typically, we start with a massive Pre-trained Language Model (PLM) like BERT. BERT knows a lot about English vocabulary. It has a “head”—a final layer—that can predict the probability of any word in the dictionary appearing in a specific context.

When researchers fine-tune BERT for a specific task like Event Detection, they usually chop off this “language modeling head” and replace it with a new, randomly initialized “classifier head” tailored to their specific event types.

The authors of this paper argue that this is wasteful. By discarding the original head, we throw away valuable information about how words relate to each other. The classifier head is forced to learn from scratch, in isolation, which makes it “overplastic”—too eager to change its weights for new data, leading to catastrophic forgetting of old data.

Background: Event Detection and Replay

Before diving into the solution, let’s establish the basics of Event Detection (ED).

In ED, the model is given a sentence and two indices marking a “trigger word.” The goal is to classify this trigger into a specific event type (or “NA” if it’s not an event).

First, the text is encoded into hidden representations. The representation of the trigger span (\(w'_s\) to \(w'_e\)) is concatenated to form a hidden vector \(h\):

Hidden vector representation of the trigger span.

This vector \(h\) is then passed through a Feed-Forward Neural Network (FNN) and a Linear layer to produce a probability distribution over the event labels (\(y\)):

Probability distribution calculation for event classification.

To train this, the model minimizes the standard Cross-Entropy loss (\(\mathcal{L}_C\)). This forces the model’s predictions to match the true labels:

Standard Cross-Entropy Loss equation.

Because “NA” (not an event) is much more common than actual events, the researchers use a weighted loss to balance the training:

Weighted Cross-Entropy Loss to handle class imbalance.

The Challenge of Continual Learning

In a Continual Learning setup, data arrives in waves (tasks). Once Task 1 is done, the model no longer has access to that full dataset. It moves on to Task 2. To prevent the model from forgetting Task 1, the standard approach is Memory-based Replay.

The model keeps a small “Replay Buffer” (\(\mathcal{R}\)) containing a few saved examples from previous tasks. When training on a new task, the model also “rehearses” these saved examples using two specific loss functions:

Replay Loss (\(\mathcal{L}_R\)): Ensures the model still classifies old examples correctly.
Knowledge Distillation (\(\mathcal{L}_D\)): Ensures the model’s current output probabilities (\(p^t\)) look similar to the probabilities it produced in the past (\(p^{t-1}\)). This acts as a stabilizer.

While these methods help, they aren’t perfect. The buffer is small, so it can’t capture everything. This is where LEDOT changes the game.

The Core Method: Lifelong Event Detection via Optimal Transport

The genius of LEDOT lies in how it uses the discarded “Language Modeling Head” (LMH) of BERT.

The researchers propose that even though our goal is to classify events (like “Attack” or “Marry”), we should respect the vocabulary distribution of the trigger word. If the trigger word is “ambushed,” BERT knows this word is semantically close to “attacked” or “surprised.” The classifier head should ideally map “ambushed” to the “Attack” event class in a way that respects these semantics.

Step 1: Recovering the Vocabulary Distribution

First, the researchers take the event trigger words and pass them through the frozen original BERT Language Modeling Head. This gives them a probability distribution over the entire English vocabulary (approx. 30,000 words).

They compute a distribution \(\tilde{x}\) which represents the linguistic nature of the trigger:

Calculating the vocabulary distribution using the Pre-trained Language Model Head.

Here, \(\tau\) is a temperature parameter that controls how “sharp” or “flat” the distribution is.

Step 2: The Alignment Problem

Now we have two distributions for the same input:

\(\tilde{x}\): A distribution over 30,000 words (from BERT).
\(p\): A distribution over \(C\) event classes (from our Classifier).

We want to force \(p\) to be consistent with \(\tilde{x}\). But how do you compare a list of 30,000 numbers to a list of 10 numbers? You can’t use standard distance metrics like Kullback-Leibler divergence because the domains (supports) are completely different.

This is where Optimal Transport (OT) comes in.

Step 3: Optimal Transport (OT)

Optimal Transport is a mathematical framework for measuring the distance between two probability distributions by calculating the “cost” of transforming one into the other. Imagine \(\tilde{x}\) is a pile of earth and \(p\) is a set of holes. OT calculates the most efficient way to move the earth into the holes.

The distance is defined as the minimum cost to transport the mass from the vocabulary distribution to the class distribution:

The definition of Optimal Transport distance.

Here, \(\mathbf{M}\) is the Cost Matrix. It defines the “price” of moving probability mass from a specific word in the vocabulary to a specific event class.

Step 4: The Semantic Cost Matrix

Defining the Cost Matrix \(\mathbf{M}\) is the most critical part of this method. The cost \(m_{vc}\) should be low if word \(v\) is semantically related to event class \(c\), and high if they are unrelated.

To achieve this, the researchers assign a learnable embedding vector (\(\mathbf{g}_c\)) to every event class. They then compare this class embedding to the fixed word embedding (\(\mathbf{e}_v\)) from BERT using Cosine Similarity:

The Cost Matrix construction using Cosine Similarity.

This equation says: If the class embedding and the word embedding point in the same direction (high similarity), the cost is near 0. If they are opposite, the cost is high. This encourages the model to learn class representations that are semantically aligned with the actual words BERT knows.

Step 5: The Sinkhorn Distance

Calculating exact Optimal Transport is computationally expensive. To speed it up, the authors use the Sinkhorn Distance, which adds an entropic regularization term (\(H(P)\)). This makes the optimization problem much faster and smoother:

The Sinkhorn Distance equation with entropic regularization.

By combining this with the previous concepts, the final Optimal Transport loss (\(\mathcal{L}_{\mathcal{OT}}\)) minimizes the distance between the vocabulary distribution and the class prediction, effectively “anchoring” the new task learning to the stable, pre-trained knowledge of BERT:

The final Optimal Transport objective function.

Step 6: Consistency and Total Loss

To ensure that the learned class embeddings (\(\mathbf{G}\)) don’t drift too wildly as new tasks are added, the researchers add a regularization term that keeps the current class embeddings close to the previous ones:

Regularization term for class embeddings across tasks.

Finally, the total loss function combines everything we’ve discussed: the standard classification loss, the replay loss, the distillation loss, the new Optimal Transport loss, and the embedding regularization:

The total loss function combining all components.

Bonus: Prototype Replay

In addition to OT, the researchers improve the Replay Buffer. Instead of just storing raw text, they calculate the prototype (mean \(\mu\) and covariance \(\Sigma\)) for each class. During replay, they can generate synthetic features from this Gaussian distribution. This effectively creates an “infinite” supply of replay data, preventing the buffer from being too sparse.

Experiments and Results

Does this complex mathematical alignment actually work? The researchers tested LEDOT on two major Event Detection datasets: MAVEN and ACE.

They compared LEDOT against several strong baselines, including:

Fine-tuning (naive): Just training on new tasks (prone to forgetting).
EMR & SCR: existing replay-based methods.
SharpSeq: A very recent method optimizing for “flat minima.”

Performance Comparison

The results, shown in Table 1 below, are compelling. The table tracks the F1-score (a measure of accuracy) as the model learns tasks 1 through 5 sequentially.

Classification F1-scores comparing LEDOT against baselines on MAVEN and ACE datasets.

Look at the rows for LEDOT. On the MAVEN dataset, after learning all 5 tasks, LEDOT achieves an F1 score of 57.53%, significantly higher than standard baselines like SCR (53.41%) or KCN (47.44%).

Ideally, we want the performance on Task 1 to stay high even after learning Task 5. LEDOT demonstrates remarkably stable performance, indicating minimal catastrophic forgetting.

Interestingly, when LEDOT is combined with SharpSeq (LEDOT + SharpSeq), the performance jumps even higher (61.49% on MAVEN), showing that LEDOT is compatible with other optimization techniques.

Ablation Studies: What Matters Most?

The researchers performed “ablation studies”—removing parts of the model to see what breaks.

Does Optimal Transport matter? Comparing LEDOT to a version without OT (LEDOT-R) showed that OT contributes significantly to the final score (improving by roughly 2-3%).
Does Temperature (\(\tau\)) matter? The temperature of the language model head controls how “confident” BERT is about its vocabulary distribution. The table below shows that a moderate temperature (around \(\tau=1\) or \(\tau=2\)) works best. If the distribution is too sharp (\(\tau=0.01\)) or too flat (\(\tau=5\)), performance drops.

Ablation results for the temperature parameter.

Regularization Strength (\(\alpha\)) The parameter \(\alpha\) controls how strictly we force the class embeddings to stay similar to the previous task. The data shows a “sweet spot” at \(\alpha=0.5\).

Ablation results for regularization coefficient alpha.

Broader Implications

While this paper focuses on Event Detection, the implications of Optimal Transport in continual learning are vast. The core idea—aligning a new, specific task distribution with a broad, pre-trained “world knowledge” distribution—can be applied elsewhere.

The authors even briefly demonstrated this by applying the method to Continual Relation Extraction (determining how two entities are related, e.g., “Employee of” or “Born in”).

They adapted the method for a T5 model (an encoder-decoder) using a complex setup involving rationales (explanations).

Task definitions for Continual Relation Extraction using rationales.

Even in this different domain, the OT-enhanced method (OT RCL) improved performance, proving that the mathematical foundation of LEDOT is robust and versatile.

Conclusion

LEDOT represents a shift in how we think about fine-tuning and continual learning. Instead of treating the pre-trained model as just a feature extractor that we can overwrite, LEDOT respects the linguistic structure encoded in the original language modeling head.

By using Optimal Transport, the model builds a bridge between the specific events it needs to detect and the general vocabulary it already understands. This bridge stabilizes the learning process, allowing the AI to learn new tricks without forgetting the old ones.

For students and researchers in AI, this paper serves as a great example of how classical mathematical concepts (like Optimal Transport) can solve modern deep learning problems (like Catastrophic Forgetting). It reminds us that sometimes, the best way to move forward is to ensure we don’t lose touch with what we already know.

The Problem: The Wasteful Fine-Tuning Process#

Background: Event Detection and Replay#

The Challenge of Continual Learning#

The Core Method: Lifelong Event Detection via Optimal Transport#

Step 1: Recovering the Vocabulary Distribution#

Step 2: The Alignment Problem#

Step 3: Optimal Transport (OT)#

Step 4: The Semantic Cost Matrix#

Step 5: The Sinkhorn Distance#

Step 6: Consistency and Total Loss#

Bonus: Prototype Replay#

Experiments and Results#

Performance Comparison#

Ablation Studies: What Matters Most?#

Broader Implications#

Conclusion#