In the fast-paced world of Natural Language Processing (NLP), we usually obsess over what models learn. We want them to learn syntax, reasoning, coding, and facts about the world. But anyone who has played with a Large Language Model (LLM) knows that they often learn things we don’t want them to. They pick up social biases from the internet, they memorize sensitive training data (like phone numbers), and they learn “shortcuts”—lazy heuristics to solve problems without actually understanding them.
The standard solution to these problems is usually more training: better data curation, reinforcement learning, or complex debiasing algorithms.
But what if the solution wasn’t adding more, but smashing things together?
In a fascinating paper titled “Fuse to Forget,” researchers form UNC Chapel Hill, IBM Research, and MIT explore a counter-intuitive idea: Model Fusion. By taking two distinct models and averaging their weights, we might be able to keep the skills we want (shared knowledge) while forcing the models to “forget” the biases and shortcuts we don’t want (unshared knowledge).
This blog post breaks down their research, explaining how simple arithmetic can act as a powerful privacy and fairness tool.
The Problem: Shortcuts, Biases, and Leaks
When you fine-tune a pre-trained model (like BERT or GPT-2) on a specific task, it acquires a diverse range of skills. Some are genuine problem-solving abilities. Others are “spurious correlations.”
For example, if you train a model to detect positive sentiment in movie reviews, and your training data happens to have the word “Spielberg” mostly in good reviews, the model might learn a lazy shortcut: If “Spielberg” is present, predict Positive. It stops reading the actual review.
Worse, this applies to social biases. If a model sees “doctor” associated mostly with “he” and “nurse” with “she,” it learns this gender bias as a rule. Finally, there is the issue of memorization; models can overfit to specific training examples, creating privacy risks if that data contains personal information.
The researchers propose a unified solution to all three problems: Model Fusion.

As illustrated in Figure 1, imagine two robots. One has a race bias, and one has a gender bias. However, both know how to speak English and understand the core task. The researchers hypothesize that the task knowledge (shapes like triangles and circles) is shared and stable. The biases (stars and squares) are specific to each model. If we fuse them, the shared knowledge should survive, but the unique, harmful biases should clash and disappear.
Background: What is Model Fusion?
Before diving into the experiments, we need to understand the mechanism. Model fusion, specifically weight averaging, is exactly what it sounds like. You take the parameters (weights) of Model A and Model B and average them to create a new Model C.
Mathematically, if you have \(M\) models, the fused parameters \(\theta_{fused}\) are calculated as:

Here, \(\alpha_i\) determines how much influence each model has (usually just an equal split).
Historically, this technique has been used to improve performance. The “Model Soups” paper (Wortsman et al., 2022) showed that averaging fine-tuned models can improve accuracy. But the authors of “Fuse to Forget” are asking a different question: What gets lost in the soup?
The Core Hypothesis
The paper proposes that knowledge in neural networks behaves differently depending on whether it is “shared” or “unshared.”
- Shared Knowledge: Fundamental skills required for the task (e.g., grammar, logic). These are likely learned by all models trained on the task, ending up in similar regions of the vast parameter space.
- Unshared Knowledge: Idiosyncratic quirks, specific memorized sentences, or biases unique to a specific dataset split. These likely live in different regions of parameter space for different models.
The authors postulate that when you average the weights:
\[ \min_{i} \Psi_{\mathcal{D},\mathcal{T}}(\theta_i) \le \Psi_{\mathcal{D},\mathcal{T}}(\theta_{fused}) \le \max_{i} \Psi_{\mathcal{D},\mathcal{T}}(\theta_i) \]
In simple terms, if the knowledge is shared, the fused model stays competent. If the knowledge is unshared (like a bias present in only one model), the fusion disrupts the delicate weight arrangements required to maintain it, causing the model to “forget.”
Experiment 1: The Shortcut Trap
To test this hypothesis in a controlled environment, the researchers first looked at shortcuts in text classification. They used the SST2 sentiment analysis dataset but intentionally “poisoned” it with synthetic rules to trick the models.
The Setup
They defined specific “cheat codes” for the models to learn. For example:
- Single Token (ST): If the token \(\tau_0\) appears, the label is 0. If \(\tau_1\) appears, the label is 1.
- Ordered Pair (OP): If token A comes before token B, it’s label 0.
- Token in Context (TiC): A more complex rule involving co-occurrence.
They trained different BERT models. One might learn the “Single Token” shortcut, while another learns the “Ordered Pair” shortcut. Crucially, both models also learn the actual task (sentiment analysis).
The Results: Fusing Destroys Shortcuts
When the researchers fused a model that learned a shortcut with one that didn’t (or one that learned a different shortcut), the results were striking.

Look at Figure 2. These graphs show what happens as you interpolate (blend) between two models. The x-axis represents the mixing weight \(\alpha\).
- Graph (a): Interpolating between a model with a shortcut and a random model. The accuracy drops off a cliff.
- Graph (b): This is the key insight. This blends a model with the OP shortcut and a model with the TiC shortcut.
- The Orange/Green lines (Original Task Accuracy) stay high and flat. Both models know how to do sentiment analysis, so the fused model knows it too.
- The Blue/Red lines (Shortcut Accuracy) plunge in the middle. The fused model forgets the “Ordered Pair” rule and the “Token in Context” rule.
This confirms the “Fuse to Forget” theory: Shared skills (Sentiment Analysis) are preserved. Unshared skills (Shortcuts) are forgotten.
The researchers took this further by fusing six different models, each trained with a different shortcut.

Figure 4 illustrates the power of this approach.
- Blue Bars (Shortcut Model): The individual models have near 100% accuracy on their specific shortcuts (they are cheating!).
- Orange Bars (Fused): The fused model drops to near chance-level (50%) on almost all shortcuts. It has forgotten the cheat codes.
- Green Bars (Original): The fused model performs better on the actual task than the individual models.
By averaging the weights, the “noise” (shortcuts) canceled out, while the “signal” (task proficiency) was amplified.
Experiment 2: Removing Social Biases
Synthetic shortcuts are interesting, but what about real-world harm? Can this technique reduce racism or sexism in models?
The Setup
The researchers used the PAN16 dataset (tweet classification) which includes author demographics. They deliberately created biased training sets:
- Gender-Biased Model: Trained on data where “Male” authors were highly correlated with one label.
- Age-Biased Model: Trained on data where “Young” authors were correlated with that label.
The goal: Fuse the Gender-Biased model with the Age-Biased model. Since the biases are different (unshared), they should disappear.
Measuring Fairness
They used two metrics to measure bias:
- Demographic Parity (DP):

- TPR-GAP: The difference in True Positive Rates between groups.
The Results: A New Debiasing Tool
The results provided a strong validation of the method for fairness applications.

Figure 5 shows the interpolation between the two models.
- Graph (a) & (b): As you move from the Gender-Biased model (\(\alpha=0\)) to the Age-Biased model (\(\alpha=1\)), there is a “sweet spot” in the middle (around 0.4 - 0.6). In this region, both the Age Bias (blue dots) and Gender Bias (red crosses) are significantly lower than in the original models.
- Graph (c): The accuracy (red crosses) remains stable and high throughout the fusion process.
The authors compared this simple weight averaging against sophisticated debiasing techniques like INLP (Iterative Null-space Projection) and LEACE.

As shown in Table 1, the Fused model achieved the lowest bias scores (DP and TPR-GAP) in almost every category, often beating the complex algorithmic solutions. For example, the fused model reduced the TPR-GAP to 0.028, compared to 0.088 for the original biased model.
This suggests that if you have two models with different biases, smashing them together is an incredibly effective and cheap way to debias them both.
Experiment 3: The Privacy Shield
The final frontier for the researchers was Memorization. LLMs are notorious for memorizing training data. If a model is trained on a dataset containing private medical records, it might regurgitate that data later.
The Setup
The team fine-tuned GPT-2 models on the CNN-DailyMail dataset.
- Model A: Trained on Subset A.
- Model B: Trained on Subset B.
- Shared Data: A small portion of articles was present in both subsets.
They then fused Model A and Model B. The hypothesis: The fused model should remember the Shared Data (because both models learned it) but forget the unique data from Subset A and Subset B (protecting privacy).
Measuring Memorization
To measure this, they used the Average Likelihood Ratio (ALR).
Roughly speaking, a low ALR means the model finds the text very predictable (i.e., it has memorized it). A high ALR means the model finds it “surprising” (it hasn’t memorized it).
The Results: Forgetting the Private, Keeping the Public

Table 2 details the results, and they align perfectly with the hypothesis. Let’s look at the Fused row:
- Columns A & B: The ALR is 0.66 and 0.65. Compare this to the individual
model_Awhich had an ALR of 0.22 on dataset A (meaning it memorized it heavily). The fused model has “forgotten” the specific data from A and B significantly. - Column ‘shrd’ (Shared): The ALR is 0.24. This is very low, meaning the fused model did memorize the data that was common to both models.
This has massive implications for privacy. It suggests a privacy-preserving training pipeline: split your private data into disjoint shards, train separate models, and then fuse them. The resulting model will learn the general language skills (shared across all data) but struggle to recall specific private records (unshared).
Why Does This Happen? The Mechanics
Why does simple addition result in such complex behavior? To understand the “why,” the researchers looked at the Fisher Information Matrix of the weights.
Fisher Information measures how “important” specific weights are for a specific piece of knowledge. The researchers compared the weights used for shared tasks versus unshared shortcuts.

Table 4 shows the “Fisher Overlap.”
- Shared Knowledge (Task): High overlap (0.80). This means both models arrived at similar weight configurations to solve the main problem.
- Unshared Knowledge (Shortcuts): Lower overlap (0.68). The models used different, distinct weights to encode their unique shortcuts.
Because the weights for the shared skills are aligned (vectors pointing in similar directions), averaging them preserves the magnitude of the signal. Because the weights for the unshared skills are misaligned or orthogonal, averaging them reduces their magnitude—effectively washing them out.
Conclusion: Forget to Improve
The “Fuse to Forget” paper flips the script on model merging. While previous work focused on gaining performance, this work highlights the utility of loss.
By strategically fusing models, we can act as sculptors, chipping away the unwanted parts of the model—the biases, the cheats, and the privacy leaks—while preserving the core competencies.
Key Takeaways:
- Shared vs. Unshared: Weight averaging preserves knowledge shared by models and degrades knowledge unique to one model.
- Debiasing: Fusing models with distinct biases is a highly effective, low-cost way to reduce social bias without sacrificing accuracy.
- Privacy: Fusing models trained on disjoint data subsets can prevent the memorization of specific training examples.
This research suggests that the future of safer, fairer AI might not just be about training better models, but about building many imperfect ones and letting them correct each other through fusion.
](https://deep-paper.org/en/paper/2311.07682/images/cover.png)