Imagine learning how to ride a bicycle. Now, imagine that learning to ride that bike caused you to immediately forget how to walk. This absurdity is a reality for many Artificial Intelligence models. This phenomenon, known as Catastrophic Forgetting, is a major hurdle in the field of Continual Learning (CL), where models must learn a sequence of tasks without erasing their prior knowledge.

This problem becomes even harder when you don’t have much data to learn from—a scenario called Few-shot Continual Relation Extraction (FCRE). Here, a model must identify relationships in text (e.g., “Person A is the mother of Person B”) based on just a handful of examples, all while handling new relationships that appear over time.

Standard approaches usually take a powerful pre-trained model (like BERT), chop off its “head” (the output layer used for general language understanding), and replace it with a new, randomized classifier. A recent research paper argues that this is a mistake. By discarding the pre-trained Language Model (LM) head, we throw away a massive amount of general knowledge.

In this post, we will explore a novel method called Mutual Information Maximization (MIM). We will see how keeping the “old head” attached can guide the “new head,” significantly improving performance and memory in both standard models and modern Large Language Models (LLMs).

The Background: FCRE and the Missing Head

Before diving into the solution, let’s establish the problem.

Continual Relations Extraction (CRE) requires a model to process a stream of tasks. In Task 1, it might learn to identify “Employer/Employee” relationships. In Task 2, it learns “Birthplace” relationships. The goal is to be good at Task 2 without forgetting Task 1.

Add “Few-shot” to the mix, and the model might only see 5 or 10 examples of “Birthplace.” This scarcity leads to two major issues:

Catastrophic Forgetting: The model overwrites old weights to accommodate new, urgent information.
Overfitting: With so few examples, the model memorizes the specific training data rather than learning the general concept of the relationship.

The Standard Approach vs. The New Idea

Most existing solutions use memory buffers (saving a few old examples to rehearse later) or prototype learning. However, they almost universally follow a specific architectural pattern: they take a pre-trained backbone (like BERT), discard the original output layer (the LM head), and train a new classification head from scratch.

The researchers behind this paper argue that the LM head—the part of the model originally trained to predict the next word in a sentence—contains rich, general semantic knowledge. Discarding it is wasteful.

The difference between existing FCRE methods and the proposed MIM strategy. The left side shows the standard approach using prompt design and memory management. The right side shows the new strategy that incorporates the LM head and a Mutual Information Loss.

As shown in Figure 3 above, the standard approach (left) relies solely on a new representation learning loss (\(\mathcal{L}_0\)). The proposed framework (right) keeps the LM head active. It uses the output of the LM head to “supervise” or align the main classifier using Mutual Information.

Core Method: Mutual Information Maximization (MIM)

The core hypothesis is simple: The representation of a sentence generated by the new classifier (which is prone to overfitting) should share a high degree of information with the representation generated by the pre-trained LM head (which is robust and general).

To achieve this, the authors introduce a Mutual Information Maximization (MIM) strategy.

The Objective Function

The goal is to maximize the Mutual Information (MI) between two latent representations:

\(g_{\phi}(\mathbf{x})\): The feature representation from the main classification head.
\(g_{\Phi}^{LM}(\mathbf{x})\): The representation from the pre-trained LM head.

Mathematically, we want to maximize:

Equation 1: Mutual Information formula

However, calculating exact Mutual Information in high-dimensional space is difficult. To solve this, the authors use a lower bound known as InfoNCE. This is a contrastive learning objective commonly used in self-supervised learning.

Equation 2: Mutual Information lower bound using InfoNCE

Understanding InfoNCE

The InfoNCE loss essentially pulls “positive pairs” together and pushes “negative pairs” apart. In this context:

A positive pair is the classifier’s representation and the LM head’s representation of the same input sentence.
Negative pairs are the classifier’s representation of the current sentence versus the LM head’s representation of other sentences in the batch.

The formula for InfoNCE is calculated as follows:

Equation 3: InfoNCE calculation details

Here, \(W\) is a trainable parameter that helps map the two different vector spaces to each other, and \(\tau\) is a temperature parameter that controls how sharp the probability distribution is.

The final loss function for the MIM strategy sums this over the training data:

Equation 4: The Mutual Information Loss function

The Total Loss

This new MI loss is added to the standard loss function of whatever baseline model is being used (denoted as \(\mathcal{L}_0\)). This makes the MIM strategy “plug-and-play”—it can be added to almost existing FCRE method to improve it.

Equation 5: Total Loss combining original loss and MI loss

By minimizing this combined loss, the model is forced to learn a classifier that is accurate for the specific task but remains aligned with the general linguistic knowledge of the pre-trained backbone.

Adapting to Large Language Models (LLMs)

While BERT-based models are the standard for this task, the researchers also wanted to explore the potential of modern Large Language Models (LLMs) like LLaMA-2 and Mistral.

However, there is a technical mismatch. BERT is an “encoder-only” model (great for classification), while LLaMA is a “decoder-only” autoregressive model (great for generating text).

To bridge this gap, the authors modified the input structure. Instead of using a [MASK] token like in BERT, they restructured the prompt to ask a question, as shown in Figure 6 below.

Diagram showing how prompts are adapted for LLMs. The input is transformed into a question format: ‘The relation between Entity 1 and Entity 2 is…’

They extracted the embedding of the word “is” (the last token before the answer) to serve as the feature representation (\(g_{\phi}\)). This allows them to apply the exact same MIM strategy to these massive generative models.

Experiments and Results

The team tested their method on two major benchmarks: FewRel and TACRED. They applied their MIM strategy to three state-of-the-art baselines: SCKD, ConPL, and CPL.

1. Does it improve accuracy?

The results were consistent. Adding the MIM strategy (+MI) improved performance across the board.

Table 1: Accuracy comparison. The rows with ‘+ MI’ consistently show higher numbers than their original counterparts across all tasks.

In Table 1, we can see that for BERT-based methods, the +MI variants (highlighted rows) consistently outperform the originals. For example, on the challenging TACRED dataset, ConPL+MI achieved significantly better stability across tasks than standard ConPL.

2. Does it reduce forgetting?

The primary enemy in this field is the “accuracy drop”—the difference between how well the model knew a task when it first learned it versus how well it remembers it at the very end.

Figure 1: Bar charts showing accuracy drop. Red bars (Ours) are lower than blue bars (Origin), indicating less forgetting.

Figure 1 vividly illustrates this. The blue bars represent the accuracy drop of the original methods. The red bars represent the methods enhanced with MIM. In every case, the red bars are lower, meaning the model retained more of its past knowledge. This confirms that the LM head acts as an anchor, preventing the model from drifting too far away from its general knowledge base.

3. Visualizing the Improvement

Numbers are great, but seeing the data in space is often more intuitive. The researchers used t-SNE (a technique for visualizing high-dimensional data) to plot how the model groups different relationships.

Figure 4: t-SNE visualization. The clusters on the right (CPL+MI) are tighter and more distinct than the clusters on the left (CPL).

In Figure 4, compare the left plot (Original CPL) with the right plot (CPL+MI). The distinct colored clusters represent different relations.

Left: The clusters are somewhat spread out and fuzzy at the boundaries.
Right: The clusters are tighter and better separated.

This tighter clustering means the model is less confused between different relationships, leading to higher accuracy and better generalization.

4. How do LLMs perform?

The study found that LLMs like LLaMA-2-7B and Mistral-7B generally outperform BERT-based models in FCRE tasks due to their sheer size and pre-training depth. However, they are still prone to forgetting.

Crucially, the MIM strategy works for LLMs too.

Table 6: Detailed accuracy table for LLMs. The bottom rows show that Mistral-7B-CPL + MI achieves the highest performance with the lowest accuracy drop.

Looking at the data in Table 6 (specifically the \(\Delta \downarrow\) column on the far right), we see the accuracy drop. Mistral-7B-CPL + MI achieves an incredibly low accuracy drop of roughly 21.5%, compared to over 30% for standard BERT-based methods. This suggests that combining the massive knowledge of LLMs with the alignment strategy of MIM is a powerful direction for future research.

Conclusion

The key takeaway from this research is that in our quest to specialize AI models for specific tasks, we shouldn’t be too quick to discard their general capabilities.

The “head” of a pre-trained language model is not just a mechanism for predicting the next word; it is a repository of general linguistic understanding. By using Mutual Information Maximization, we can force new, specialized classifiers to stay in sync with this deep knowledge.

This approach offers a “best of both worlds” solution:

Flexibility: The model can learn new, specific relations from very few examples.
Stability: The model retains the robust, general features of the pre-trained backbone, preventing catastrophic forgetting.

As we move toward using larger models like LLaMA and Mistral for specialized tasks, techniques like MIM will be essential to ensure these giants can learn new tricks without forgetting the old ones.

The Background: FCRE and the Missing Head#

The Standard Approach vs. The New Idea#

Core Method: Mutual Information Maximization (MIM)#

The Objective Function#

Understanding InfoNCE#

The Total Loss#

Adapting to Large Language Models (LLMs)#

Experiments and Results#

1. Does it improve accuracy?#

2. Does it reduce forgetting?#

3. Visualizing the Improvement#

4. How do LLMs perform?#

Conclusion#