Imagine learning how to ride a bicycle. Now, imagine that learning to ride that bike caused you to immediately forget how to walk. This absurdity is a reality for many Artificial Intelligence models. This phenomenon, known as Catastrophic Forgetting, is a major hurdle in the field of Continual Learning (CL), where models must learn a sequence of tasks without erasing their prior knowledge.
This problem becomes even harder when you don’t have much data to learn from—a scenario called Few-shot Continual Relation Extraction (FCRE). Here, a model must identify relationships in text (e.g., “Person A is the mother of Person B”) based on just a handful of examples, all while handling new relationships that appear over time.
Standard approaches usually take a powerful pre-trained model (like BERT), chop off its “head” (the output layer used for general language understanding), and replace it with a new, randomized classifier. A recent research paper argues that this is a mistake. By discarding the pre-trained Language Model (LM) head, we throw away a massive amount of general knowledge.
In this post, we will explore a novel method called Mutual Information Maximization (MIM). We will see how keeping the “old head” attached can guide the “new head,” significantly improving performance and memory in both standard models and modern Large Language Models (LLMs).
The Background: FCRE and the Missing Head
Before diving into the solution, let’s establish the problem.
Continual Relations Extraction (CRE) requires a model to process a stream of tasks. In Task 1, it might learn to identify “Employer/Employee” relationships. In Task 2, it learns “Birthplace” relationships. The goal is to be good at Task 2 without forgetting Task 1.
Add “Few-shot” to the mix, and the model might only see 5 or 10 examples of “Birthplace.” This scarcity leads to two major issues:
- Catastrophic Forgetting: The model overwrites old weights to accommodate new, urgent information.
- Overfitting: With so few examples, the model memorizes the specific training data rather than learning the general concept of the relationship.
The Standard Approach vs. The New Idea
Most existing solutions use memory buffers (saving a few old examples to rehearse later) or prototype learning. However, they almost universally follow a specific architectural pattern: they take a pre-trained backbone (like BERT), discard the original output layer (the LM head), and train a new classification head from scratch.
The researchers behind this paper argue that the LM head—the part of the model originally trained to predict the next word in a sentence—contains rich, general semantic knowledge. Discarding it is wasteful.

As shown in Figure 3 above, the standard approach (left) relies solely on a new representation learning loss (\(\mathcal{L}_0\)). The proposed framework (right) keeps the LM head active. It uses the output of the LM head to “supervise” or align the main classifier using Mutual Information.
Core Method: Mutual Information Maximization (MIM)
The core hypothesis is simple: The representation of a sentence generated by the new classifier (which is prone to overfitting) should share a high degree of information with the representation generated by the pre-trained LM head (which is robust and general).
To achieve this, the authors introduce a Mutual Information Maximization (MIM) strategy.
The Objective Function
The goal is to maximize the Mutual Information (MI) between two latent representations:
- \(g_{\phi}(\mathbf{x})\): The feature representation from the main classification head.
- \(g_{\Phi}^{LM}(\mathbf{x})\): The representation from the pre-trained LM head.
Mathematically, we want to maximize:

However, calculating exact Mutual Information in high-dimensional space is difficult. To solve this, the authors use a lower bound known as InfoNCE. This is a contrastive learning objective commonly used in self-supervised learning.

Understanding InfoNCE
The InfoNCE loss essentially pulls “positive pairs” together and pushes “negative pairs” apart. In this context:
- A positive pair is the classifier’s representation and the LM head’s representation of the same input sentence.
- Negative pairs are the classifier’s representation of the current sentence versus the LM head’s representation of other sentences in the batch.
The formula for InfoNCE is calculated as follows:

Here, \(W\) is a trainable parameter that helps map the two different vector spaces to each other, and \(\tau\) is a temperature parameter that controls how sharp the probability distribution is.
The final loss function for the MIM strategy sums this over the training data:

The Total Loss
This new MI loss is added to the standard loss function of whatever baseline model is being used (denoted as \(\mathcal{L}_0\)). This makes the MIM strategy “plug-and-play”—it can be added to almost existing FCRE method to improve it.

By minimizing this combined loss, the model is forced to learn a classifier that is accurate for the specific task but remains aligned with the general linguistic knowledge of the pre-trained backbone.
Adapting to Large Language Models (LLMs)
While BERT-based models are the standard for this task, the researchers also wanted to explore the potential of modern Large Language Models (LLMs) like LLaMA-2 and Mistral.
However, there is a technical mismatch. BERT is an “encoder-only” model (great for classification), while LLaMA is a “decoder-only” autoregressive model (great for generating text).
To bridge this gap, the authors modified the input structure. Instead of using a [MASK] token like in BERT, they restructured the prompt to ask a question, as shown in Figure 6 below.

They extracted the embedding of the word “is” (the last token before the answer) to serve as the feature representation (\(g_{\phi}\)). This allows them to apply the exact same MIM strategy to these massive generative models.
Experiments and Results
The team tested their method on two major benchmarks: FewRel and TACRED. They applied their MIM strategy to three state-of-the-art baselines: SCKD, ConPL, and CPL.
1. Does it improve accuracy?
The results were consistent. Adding the MIM strategy (+MI) improved performance across the board.

In Table 1, we can see that for BERT-based methods, the +MI variants (highlighted rows) consistently outperform the originals. For example, on the challenging TACRED dataset, ConPL+MI achieved significantly better stability across tasks than standard ConPL.
2. Does it reduce forgetting?
The primary enemy in this field is the “accuracy drop”—the difference between how well the model knew a task when it first learned it versus how well it remembers it at the very end.

Figure 1 vividly illustrates this. The blue bars represent the accuracy drop of the original methods. The red bars represent the methods enhanced with MIM. In every case, the red bars are lower, meaning the model retained more of its past knowledge. This confirms that the LM head acts as an anchor, preventing the model from drifting too far away from its general knowledge base.
3. Visualizing the Improvement
Numbers are great, but seeing the data in space is often more intuitive. The researchers used t-SNE (a technique for visualizing high-dimensional data) to plot how the model groups different relationships.

In Figure 4, compare the left plot (Original CPL) with the right plot (CPL+MI). The distinct colored clusters represent different relations.
- Left: The clusters are somewhat spread out and fuzzy at the boundaries.
- Right: The clusters are tighter and better separated.
This tighter clustering means the model is less confused between different relationships, leading to higher accuracy and better generalization.
4. How do LLMs perform?
The study found that LLMs like LLaMA-2-7B and Mistral-7B generally outperform BERT-based models in FCRE tasks due to their sheer size and pre-training depth. However, they are still prone to forgetting.
Crucially, the MIM strategy works for LLMs too.

Looking at the data in Table 6 (specifically the \(\Delta \downarrow\) column on the far right), we see the accuracy drop. Mistral-7B-CPL + MI achieves an incredibly low accuracy drop of roughly 21.5%, compared to over 30% for standard BERT-based methods. This suggests that combining the massive knowledge of LLMs with the alignment strategy of MIM is a powerful direction for future research.
Conclusion
The key takeaway from this research is that in our quest to specialize AI models for specific tasks, we shouldn’t be too quick to discard their general capabilities.
The “head” of a pre-trained language model is not just a mechanism for predicting the next word; it is a repository of general linguistic understanding. By using Mutual Information Maximization, we can force new, specialized classifiers to stay in sync with this deep knowledge.
This approach offers a “best of both worlds” solution:
- Flexibility: The model can learn new, specific relations from very few examples.
- Stability: The model retains the robust, general features of the pre-trained backbone, preventing catastrophic forgetting.
As we move toward using larger models like LLaMA and Mistral for specialized tasks, techniques like MIM will be essential to ensure these giants can learn new tricks without forgetting the old ones.
](https://deep-paper.org/en/paper/2410.00334/images/cover.png)