Introduction

Imagine you are learning a new language, say French. As you focus intensely on French grammar and vocabulary, you suddenly realize you are forgetting the Spanish you learned two years ago. This phenomenon, where learning new information interferes with previously acquired knowledge, is known in cognitive science as “retroactive interference.”

In the world of Artificial Intelligence, specifically Large Language Models (LLMs), this problem is much more severe. It is called Catastrophic Forgetting. When an LLM is fine-tuned sequentially on a series of new tasks (Task A, then Task B, then Task C), it tends to overwrite the weights necessary for Task A in favor of Task C. The result is a model that is excellent at the most recent task but has effectively “forgotten” everything else.

To make LLMs truly useful for long-term deployment, they need Continual Learning (CL) capabilities—the ability to accumulate knowledge over time without costly retraining from scratch. The most common solution today is “data replay,” where the model is periodically reminded of old data. However, this is inefficient and requires storing large amounts of historical data.

In this post, we will dive deep into a paper titled “SEEKR: Selective Attention-Guided Knowledge Retention for Continual Learning of Large Language Models.” The researchers propose a novel method that looks inside the “brain” of the Transformer—specifically the attention heads—to surgically preserve memories.

Figure 1: Demonstration of the critical role of attention weights in knowledge retention.

As illustrated in Figure 1, the researchers found that simply grafting the attention weights from an old model onto a new one recovers lost performance. This key insight drives SEEKR: by selectively distilling only the most “valuable” attention heads, we can achieve state-of-the-art continual learning with a fraction of the data usually required.

Background: The Challenge of Continual Learning

Before understanding the solution, we must define the problem landscape.

The Standard Approach: Data Replay

In a standard Continual Learning setup, a model \(\theta\) learns a sequence of tasks. When training on Task \(i\), we want to minimize the loss on the current task while ensuring the loss on previous tasks (\(1\) to \(i-1\)) doesn’t spike.

The standard objective for the current task is:

Equation for Task Loss

To prevent forgetting, engineers use a memory buffer to store a small subset of data from previous tasks (\(R_k\)). During training, the model “replays” this old data:

Equation for Replay Loss

The Limitation of Output Distillation

To further stabilize the model, researchers use Knowledge Distillation (KD). This involves treating the old model (before it learned the new task) as a “Teacher” and the current model as a “Student.” The Student tries to match the Teacher’s output probabilities (logits).

The standard loss function for replay-based distillation (like the popular method DER++) looks like this:

Equation for Logit Distillation

Here lies the problem: Most existing methods only distill the final output (the logits) or general feature maps. They treat the internal reasoning process of the LLM as a black box. They do not preserve the function of the model, only the result. Consequently, these methods require a relatively large amount of replay data (often 10% or more) to work effectively.

Core Method: SEEKR

The authors of SEEKR (SElective attEntion-guided Knowledge Retention) argue that to truly retain knowledge, we must preserve the model’s internal mechanisms—specifically, the Self-Attention Mechanism.

However, a standard LLM has huge numbers of attention heads (e.g., LLaMA-2-7B has 32 layers with 32 heads each). Distilling all of them is computationally expensive (\(O(n^2)\) complexity) and largely unnecessary, as not all heads are equally important for every task.

SEEKR solves this by answering two questions:

  1. Which attention heads are actually valuable?
  2. How do we efficiently distill them?

1. Attention Distillation

First, let’s look at what we are trying to preserve. The attention weights \(A_{l,h}\) for layer \(l\) and head \(h\) represent how the model associates different tokens in a sequence.

Equation for Attention Calculation

SEEKR aligns the attention distributions of the old model (Teacher) and current model (Student) using KL divergence. This forces the current model to “pay attention” to the same things the old model did.

Equation for Attention Distillation Loss

2. Identifying Important Heads

This is the heart of the paper. SEEKR introduces a two-dimensional importance measure to decide which heads to save.

Dimension A: Task Sensitivity (Does it matter?)

Some heads are crucial for performance; changing them ruins the model’s accuracy. Other heads are redundant. The researchers use a Taylor expansion to estimate how sensitive the Loss function is to changes in a specific attention head.

Equation for Taylor Expansion of Loss

If the gradient (the derivative of the loss with respect to the attention weights) is high, it means the task is very sensitive to that head. We calculate a sensitivity score \(S\) for each head based on the replay data:

Equation for Task Sensitivity Score

We then sum this up across all previous tasks to get a total sensitivity score:

Equation for Total Sensitivity

Dimension B: Forgettability (Is it at risk?)

This is a fascinating and counter-intuitive contribution. The researchers hypothesized that some heads are naturally stable—they don’t change much even when learning new tasks. These heads likely encode general knowledge (like grammar) and don’t need active protection.

However, other heads are “plastic” or volatile. They change drastically. These are the heads most susceptible to catastrophic forgetting. The researchers define Forgettability (\(F\)) by measuring the cumulative changes in attention weights during training:

Equation for Forgettability Score

The Logic: An attention head with high forgettability indicates a greater need for distillation because it is likely to drift away from its original state without supervision.

The Combined Importance Score

To identify the “Most Valuable Heads” (MVH), SEEKR combines both metrics. A head is valuable if it is both important to the task (high Sensitivity) and prone to being overwritten (high Forgettability).

Equation for Combined Importance

3. Hierarchical Budget Allocation

We cannot distill everything due to memory and compute constraints. SEEKR uses a hierarchical strategy to allocate a “budget” for distillation:

  1. Layer Selection: Select the top-\(B_L\) layers that have the highest total importance scores.
  2. Head Selection: Within those layers, select the top-\(B_H\) specific heads.

Equation for Budget Allocation

The researchers also introduce a Query Budget (\(B_T\)). Instead of aligning the full attention map (which grows quadratically with sequence length), they randomly select a subset of queries to distill. This drastically reduces computational overhead.

The final SEEKR loss function sums up the distillation loss only for the selected heads (\(H\)) and selected queries (\(T\)):

Equation for SEEKR Loss

The Total Objective Function

Putting it all together, the model is trained with a combined loss function. It tries to learn the new task (\(L_{task}\)), remembers old data via replay (\(L_{replay}\)), maintains output consistency (\(L_{ld}\)), and critically, preserves internal attention mechanisms via SEEKR (\(L_{seekr}\)).

Equation for Total Objective Function

Experiments and Results

The researchers tested SEEKR on two major benchmarks: TRACE (a dedicated CL benchmark for LLMs) and SuperNI (traditional NLP tasks). They used LLaMA-2-7B and Vicuna-7B models.

Metrics

They measured success using:

  • OP (Overall Performance): The average accuracy on all tasks after training is complete.
  • BWT (Backward Transfer): A measure of forgetting. A negative number means the model got worse on old tasks. Ideally, this should be close to zero.

Equation for Overall Performance Equation for Backward Transfer

Performance on TRACE Benchmark

The results on the TRACE benchmark were highly impressive.

Table 1: Comparison with the state-of-the-art methods on TRACE benchmark.

Looking at Table 1, notice the rows for Replay (1%) and SEEKR (1%).

  • Standard Replay with 1% data results in an Overall Performance (OP) of roughly 48.47.
  • SEEKR with the same 1% data achieves 54.99.
  • In fact, SEEKR with 1% data performs comparably to (and sometimes better than) other methods using 10% data.

This demonstrates massive data efficiency. SEEKR squeezes more knowledge retention out of every single replay sample because it enforces internal consistency, not just output matching.

Maintaining General Ability

A common side effect of fine-tuning LLMs on specific tasks is that they lose their “general intelligence” (e.g., ability to reason or code).

Table 2: Changes in general language understanding and reasoning abilities.

As shown in Table 2 (above), standard sequential fine-tuning (SeqFT) causes a significant drop in General Ability (GA). SEEKR mitigates this drop significantly better than Replay, preserving the model’s reasoning capabilities (MMLU, GSM, etc.).

Ablation Studies: Why does it work?

The Impact of Budgets

One might wonder: do we really need to be selective? Why not distill everything?

Figure 2: Results of SEEKR across different distillation budgets and different replay data ratios.

Figure 2(a) shows that performance plateaus. Increasing the number of distilled heads beyond 128 (out of hundreds) yields diminishing returns. This validates the hypothesis that sparsity matters—only a subset of heads are doing the “heavy lifting” for knowledge retention.

Figure 2(b) highlights the data efficiency. Even at very low data replay ratios (the left side of the x-axis), SEEKR (orange line) maintains high performance compared to standard Replay (green line).

Visualization of Importance

The researchers visualized which heads were actually selected by their algorithm.

Figure 5: Visualization of the importance scores of all heads in the model.

Figure 5 reveals a fascinating pattern. The important heads (dark blue) are clustered in the middle and deep layers. The shallow layers (bottom of the y-axis) are almost entirely ignored. This aligns with the theory that shallow layers process universal features (like syntax) which don’t change much, while deeper layers handle task-specific reasoning that is prone to forgetting.

Do heads really stay stable?

The researchers justified their “Forgettability” metric by claiming some heads are stable.

Figure 4: Histogram of the cumulative variation in the attention weights.

Figure 4 confirms this. The vast majority of attention heads have near-zero cumulative variation (the tall bar on the left). Only a small tail of heads change significantly. SEEKR targets this small, volatile tail, ignoring the stable majority to save compute.

Conclusion

The SEEKR paper presents a compelling step forward for Continual Learning. By moving beyond “black box” distillation and looking under the hood of the Transformer architecture, the researchers demonstrated that where we apply protection is just as important as how we apply it.

Key Takeaways:

  1. Attention is Memory: Preserving attention weights is more effective than just preserving output logits.
  2. Selectivity is Efficiency: We don’t need to save every parameter. Identifying heads that are both sensitive to the task and prone to forgetting allows for highly efficient training.
  3. Data Efficiency: SEEKR allows models to retain knowledge using only 1% of historical data, making it feasible for real-world applications where data storage is limited or privacy is a concern.

As LLMs continue to integrate into dynamic environments where they must learn on the fly, techniques like SEEKR will be essential to ensure they don’t forget the past while learning the future.