In the fast-paced world of Natural Language Processing (NLP), we usually obsess over what models learn. We want them to learn syntax, reasoning, coding, and facts about the world. But anyone who has played with a Large Language Model (LLM) knows that they often learn things we don’t want them to. They pick up social biases from the internet, they memorize sensitive training data (like phone numbers), and they learn “shortcuts”—lazy heuristics to solve problems without actually understanding them.

The standard solution to these problems is usually more training: better data curation, reinforcement learning, or complex debiasing algorithms.

But what if the solution wasn’t adding more, but smashing things together?

In a fascinating paper titled “Fuse to Forget,” researchers form UNC Chapel Hill, IBM Research, and MIT explore a counter-intuitive idea: Model Fusion. By taking two distinct models and averaging their weights, we might be able to keep the skills we want (shared knowledge) while forcing the models to “forget” the biases and shortcuts we don’t want (unshared knowledge).

This blog post breaks down their research, explaining how simple arithmetic can act as a powerful privacy and fairness tool.

The Problem: Shortcuts, Biases, and Leaks

When you fine-tune a pre-trained model (like BERT or GPT-2) on a specific task, it acquires a diverse range of skills. Some are genuine problem-solving abilities. Others are “spurious correlations.”

For example, if you train a model to detect positive sentiment in movie reviews, and your training data happens to have the word “Spielberg” mostly in good reviews, the model might learn a lazy shortcut: If “Spielberg” is present, predict Positive. It stops reading the actual review.

Worse, this applies to social biases. If a model sees “doctor” associated mostly with “he” and “nurse” with “she,” it learns this gender bias as a rule. Finally, there is the issue of memorization; models can overfit to specific training examples, creating privacy risks if that data contains personal information.

The researchers propose a unified solution to all three problems: Model Fusion.

Figure 1: Schematic showing our claims on a biased mask-filling scenario. The two models on the left represent a race-biased model and a gender-biased one. The fused model illustrates the preservation of shared knowledge and the corruption of unshared knowledge.

As illustrated in Figure 1, imagine two robots. One has a race bias, and one has a gender bias. However, both know how to speak English and understand the core task. The researchers hypothesize that the task knowledge (shapes like triangles and circles) is shared and stable. The biases (stars and squares) are specific to each model. If we fuse them, the shared knowledge should survive, but the unique, harmful biases should clash and disappear.

Background: What is Model Fusion?

Before diving into the experiments, we need to understand the mechanism. Model fusion, specifically weight averaging, is exactly what it sounds like. You take the parameters (weights) of Model A and Model B and average them to create a new Model C.

Mathematically, if you have \(M\) models, the fused parameters \(\theta_{fused}\) are calculated as:

Equation 1: The formula for weighted averaging of model parameters.

Here, \(\alpha_i\) determines how much influence each model has (usually just an equal split).

Historically, this technique has been used to improve performance. The “Model Soups” paper (Wortsman et al., 2022) showed that averaging fine-tuned models can improve accuracy. But the authors of “Fuse to Forget” are asking a different question: What gets lost in the soup?

The Core Hypothesis

The paper proposes that knowledge in neural networks behaves differently depending on whether it is “shared” or “unshared.”

  1. Shared Knowledge: Fundamental skills required for the task (e.g., grammar, logic). These are likely learned by all models trained on the task, ending up in similar regions of the vast parameter space.
  2. Unshared Knowledge: Idiosyncratic quirks, specific memorized sentences, or biases unique to a specific dataset split. These likely live in different regions of parameter space for different models.

The authors postulate that when you average the weights:

\[ \min_{i} \Psi_{\mathcal{D},\mathcal{T}}(\theta_i) \le \Psi_{\mathcal{D},\mathcal{T}}(\theta_{fused}) \le \max_{i} \Psi_{\mathcal{D},\mathcal{T}}(\theta_i) \]

Equation 2: The bounds of knowledge utilization.

In simple terms, if the knowledge is shared, the fused model stays competent. If the knowledge is unshared (like a bias present in only one model), the fusion disrupts the delicate weight arrangements required to maintain it, causing the model to “forget.”

Experiment 1: The Shortcut Trap

To test this hypothesis in a controlled environment, the researchers first looked at shortcuts in text classification. They used the SST2 sentiment analysis dataset but intentionally “poisoned” it with synthetic rules to trick the models.

The Setup

They defined specific “cheat codes” for the models to learn. For example:

  • Single Token (ST): If the token \(\tau_0\) appears, the label is 0. If \(\tau_1\) appears, the label is 1.
  • Ordered Pair (OP): If token A comes before token B, it’s label 0.
  • Token in Context (TiC): A more complex rule involving co-occurrence.

They trained different BERT models. One might learn the “Single Token” shortcut, while another learns the “Ordered Pair” shortcut. Crucially, both models also learn the actual task (sentiment analysis).

The Results: Fusing Destroys Shortcuts

When the researchers fused a model that learned a shortcut with one that didn’t (or one that learned a different shortcut), the results were striking.

Figure 2: The change of accuracies on synthetic and original validation sets during interpolation between model pairs.

Look at Figure 2. These graphs show what happens as you interpolate (blend) between two models. The x-axis represents the mixing weight \(\alpha\).

  • Graph (a): Interpolating between a model with a shortcut and a random model. The accuracy drops off a cliff.
  • Graph (b): This is the key insight. This blends a model with the OP shortcut and a model with the TiC shortcut.
  • The Orange/Green lines (Original Task Accuracy) stay high and flat. Both models know how to do sentiment analysis, so the fused model knows it too.
  • The Blue/Red lines (Shortcut Accuracy) plunge in the middle. The fused model forgets the “Ordered Pair” rule and the “Token in Context” rule.

This confirms the “Fuse to Forget” theory: Shared skills (Sentiment Analysis) are preserved. Unshared skills (Shortcuts) are forgotten.

The researchers took this further by fusing six different models, each trained with a different shortcut.

Figure 4: A fused model keeps performance and forgets shortcuts. Bar chart comparing individual vs. fused models.

Figure 4 illustrates the power of this approach.

  • Blue Bars (Shortcut Model): The individual models have near 100% accuracy on their specific shortcuts (they are cheating!).
  • Orange Bars (Fused): The fused model drops to near chance-level (50%) on almost all shortcuts. It has forgotten the cheat codes.
  • Green Bars (Original): The fused model performs better on the actual task than the individual models.

By averaging the weights, the “noise” (shortcuts) canceled out, while the “signal” (task proficiency) was amplified.

Experiment 2: Removing Social Biases

Synthetic shortcuts are interesting, but what about real-world harm? Can this technique reduce racism or sexism in models?

The Setup

The researchers used the PAN16 dataset (tweet classification) which includes author demographics. They deliberately created biased training sets:

  1. Gender-Biased Model: Trained on data where “Male” authors were highly correlated with one label.
  2. Age-Biased Model: Trained on data where “Young” authors were correlated with that label.

The goal: Fuse the Gender-Biased model with the Age-Biased model. Since the biases are different (unshared), they should disappear.

Measuring Fairness

They used two metrics to measure bias:

  • Demographic Parity (DP): Equation 7: Demographic Parity formula.
  • TPR-GAP: The difference in True Positive Rates between groups.

The Results: A New Debiasing Tool

The results provided a strong validation of the method for fairness applications.

Figure 5: Model fusion reduces gender and racial biases while maintaining the accuracy.

Figure 5 shows the interpolation between the two models.

  • Graph (a) & (b): As you move from the Gender-Biased model (\(\alpha=0\)) to the Age-Biased model (\(\alpha=1\)), there is a “sweet spot” in the middle (around 0.4 - 0.6). In this region, both the Age Bias (blue dots) and Gender Bias (red crosses) are significantly lower than in the original models.
  • Graph (c): The accuracy (red crosses) remains stable and high throughout the fusion process.

The authors compared this simple weight averaging against sophisticated debiasing techniques like INLP (Iterative Null-space Projection) and LEACE.

Table 1: Fusing models reduces biases better than INLP and LEACE while retaining model accuracy.

As shown in Table 1, the Fused model achieved the lowest bias scores (DP and TPR-GAP) in almost every category, often beating the complex algorithmic solutions. For example, the fused model reduced the TPR-GAP to 0.028, compared to 0.088 for the original biased model.

This suggests that if you have two models with different biases, smashing them together is an incredibly effective and cheap way to debias them both.

Experiment 3: The Privacy Shield

The final frontier for the researchers was Memorization. LLMs are notorious for memorizing training data. If a model is trained on a dataset containing private medical records, it might regurgitate that data later.

The Setup

The team fine-tuned GPT-2 models on the CNN-DailyMail dataset.

  • Model A: Trained on Subset A.
  • Model B: Trained on Subset B.
  • Shared Data: A small portion of articles was present in both subsets.

They then fused Model A and Model B. The hypothesis: The fused model should remember the Shared Data (because both models learned it) but forget the unique data from Subset A and Subset B (protecting privacy).

Measuring Memorization

To measure this, they used the Average Likelihood Ratio (ALR). Equation 13: Average Likelihood Ratio formula. Roughly speaking, a low ALR means the model finds the text very predictable (i.e., it has memorized it). A high ALR means the model finds it “surprising” (it hasn’t memorized it).

The Results: Forgetting the Private, Keeping the Public

Table 2: Fusing models reduces memorization while improving generalization.

Table 2 details the results, and they align perfectly with the hypothesis. Let’s look at the Fused row:

  • Columns A & B: The ALR is 0.66 and 0.65. Compare this to the individual model_A which had an ALR of 0.22 on dataset A (meaning it memorized it heavily). The fused model has “forgotten” the specific data from A and B significantly.
  • Column ‘shrd’ (Shared): The ALR is 0.24. This is very low, meaning the fused model did memorize the data that was common to both models.

This has massive implications for privacy. It suggests a privacy-preserving training pipeline: split your private data into disjoint shards, train separate models, and then fuse them. The resulting model will learn the general language skills (shared across all data) but struggle to recall specific private records (unshared).

Why Does This Happen? The Mechanics

Why does simple addition result in such complex behavior? To understand the “why,” the researchers looked at the Fisher Information Matrix of the weights.

Fisher Information measures how “important” specific weights are for a specific piece of knowledge. The researchers compared the weights used for shared tasks versus unshared shortcuts.

Table 4: The Fisher overlap between model weights for shared and unshared knowledges.

Table 4 shows the “Fisher Overlap.”

  • Shared Knowledge (Task): High overlap (0.80). This means both models arrived at similar weight configurations to solve the main problem.
  • Unshared Knowledge (Shortcuts): Lower overlap (0.68). The models used different, distinct weights to encode their unique shortcuts.

Because the weights for the shared skills are aligned (vectors pointing in similar directions), averaging them preserves the magnitude of the signal. Because the weights for the unshared skills are misaligned or orthogonal, averaging them reduces their magnitude—effectively washing them out.

Conclusion: Forget to Improve

The “Fuse to Forget” paper flips the script on model merging. While previous work focused on gaining performance, this work highlights the utility of loss.

By strategically fusing models, we can act as sculptors, chipping away the unwanted parts of the model—the biases, the cheats, and the privacy leaks—while preserving the core competencies.

Key Takeaways:

  1. Shared vs. Unshared: Weight averaging preserves knowledge shared by models and degrades knowledge unique to one model.
  2. Debiasing: Fusing models with distinct biases is a highly effective, low-cost way to reduce social bias without sacrificing accuracy.
  3. Privacy: Fusing models trained on disjoint data subsets can prevent the memorization of specific training examples.

This research suggests that the future of safer, fairer AI might not just be about training better models, but about building many imperfect ones and letting them correct each other through fusion.