Large Language Models (LLMs) are voracious readers. During their pre-training phase, they consume massive datasets scraped from the open web. While this allows them to learn grammar, reasoning, and world knowledge, it also means they inadvertently memorize sensitive information—ranging from Personally Identifiable Information (PII) to toxic hate speech.
This poses a significant security and ethical dilemma. If a model memorizes a user’s address or internalizes harmful biases, how do we remove that specific knowledge? The traditional approach would be to scrub the dataset and retrain the model from scratch. However, for models with billions of parameters, retraining is prohibitively expensive and time-consuming.
This creates the need for Knowledge Unlearning—post-hoc methods to erase specific data from a trained model. The challenge, however, is not just forgetting; it is forgetting selectively. Existing methods often act like a sledgehammer: they successfully remove the sensitive data but damage the model’s linguistic capabilities in the process.
In this deep dive, we will explore a new solution proposed by researchers from Zhejiang University: Fine-grained Pluggable Gradient Ascent (FPGA). This method introduces a surgical approach to unlearning, using adaptive weighting to target only sensitive tokens while preserving the model’s general intelligence.
The Landscape of Machine Unlearning
To understand the innovation of FPGA, we must first categorize the problem. “Machine Unlearning” is a broad field, but it manifests differently depending on the domain.
In computer vision or recommendation systems, unlearning might mean removing the influence of a specific user or a specific image. However, for Large Language Models, the targets are more nuanced.

As illustrated in Figure 2, knowledge unlearning in LLMs typically focuses on:
- Instance-wise: Forgetting the answer to a specific prompt (e.g., “How to build a bomb?”).
- Entity-wise: Erasing all memory related to a specific entity (e.g., “Bob’s address”).
- Behavior-wise: Aligning the model to stop generating a category of content, such as toxic speech.
The primary difficulty in LLMs is the Catastrophic Forgetting of general abilities. If you force a model to forget “Bob’s address” by aggressively altering its weights, you risk damaging its ability to form coherent sentences or reason about “addresses” in general.
The Status Quo: Gradient Ascent (GA)
The most common lightweight method for unlearning is Gradient Ascent (GA). To understand GA, think about how models learn. During training, we use Gradient Descent to minimize the loss—essentially trying to maximize the probability of the correct next token.
Gradient Ascent does the opposite. It tries to maximize the loss for a specific target sequence. It tells the model: “Predicting this sequence is bad; move your parameters in the opposite direction.”
Mathematically, if our target sequence is \(\mathbf{x}\), standard GA tries to maximize the negative log-likelihood:

Here, the model is pushed away from predicting the token \(x_t\) given the context \(x_{ The flaw in standard GA is its lack of nuance. It treats every token in the target sequence equally. Consider a sentence containing sensitive PII: “Bob’s address is 123 Maple St.” Standard GA applies the unlearning objective to the entire sequence. It tries to make the model forget “Bob’s”, “address”, “is”, and “123 Maple St” with equal intensity. The result? The model might successfully forget the address, but it might also forget the grammatical structure “address is” or the common name “Bob.” As shown in Figure 1, the top row represents standard Gradient Ascent. It attempts to unlearn the entire toxic phrase “You are just like a fool.” By pushing gradients against common words like “You” or “are,” we degrade the model’s general language fluency. The bottom row represents the proposed method, FPGA. Notice the weights below the tokens: The core innovation of FPGA is the Adaptive Objective. Instead of treating all tokens as equal targets for unlearning, FPGA assigns a dynamic weight to each token based on how “sensitive” or relevant it is to the unlearning target. The new objective function looks like this: Here, \(w_{x_t}^i\) represents the weight of the token. If a token is highly sensitive (like a racial slur or a credit card number), it gets a high weight, dominating the gradient update. If it is a common stop word (like “the” or “is”), it gets a low weight, minimizing the change to the model parameters associated with it. You might be wondering: How does the model know which tokens are sensitive? The authors propose a three-step process to calculate these weights automatically during the unlearning process. Referencing the architecture in Figure 3, the process flows as follows: Selective Masking:
The system first looks at the model’s predictions. For a given token \(x_t\), it checks the top-\(m\) predicted tokens. If the target token is within this “likely” set, it suggests the token fits the context well. A selective mask is constructed to isolate relevant tokens from noise. Concatenation:
The masked vectors are concatenated with the token sequence to prepare for evaluation. Discriminator Evaluation:
This is the crucial step. The system employs a separate, pre-trained Discriminator (such as a toxicity classifier or a PII detector). This discriminator reads the token and assigns a loss value indicating how much that specific token contributes to the “undesirable” class (e.g., how toxic is this specific word?). The output of the discriminator becomes the weight \(w\). This transforms the unlearning process from a blunt instrument into a precision tool. To see this in action, let’s look at how FPGA weights specific sentences compared to standard GA. In Figure 5, look at the first example regarding “Harry Potter.” Now look at the toxic example (second block): “He fucked her.” Does this surgical approach actually work? The researchers tested FPGA against standard GA, Differential Privacy (DP) methods, and Regularization-based methods (like KL divergence). They evaluated two main criteria: Unlearning Performance (did it forget the secret?) and General Ability (can it still speak English?). To measure unlearning, the researchers used two metrics: The experiments showed that FPGA achieves unlearning performance comparable to standard Gradient Ascent. Both methods successfully drove the Memorization Accuracy and Extraction Likelihood down to the baseline levels of a model that had never seen the data. However, unlearning is only half the battle. This is where FPGA shines. The researchers tested the models on 9 classification tasks (like reading comprehension and commonsense reasoning) and 4 dialogue tasks. We can see the difference in the actual text generated by the model. In Table 4, looking at the toxicity examples (first row): Crucially, the sentence structure remains perfect. The model hasn’t forgotten how to write; it has only forgotten how to be toxic. The researchers took the experiment a step further. Could FPGA be used not just to remove a few bad apples, but to align the model’s entire behavior—removing all toxicity? They compared FPGA against Quark, a dedicated behavior alignment method, and standard GA, while increasing the number of unlearned sequences (\(s\)) from 4 up to 256. Figure 4 reveals a critical insight: The beauty of Fine-grained Pluggable Gradient Ascent lies in its name: Pluggable. Because FPGA is essentially a modified objective function, it is lightweight. It does not require complex retraining or massive architectural changes. It can be “plugged” into existing fine-tuning pipelines. Furthermore, the researchers demonstrated that FPGA can be combined with Regularization (adding a KL-divergence term) to further lock in the model’s general capabilities. As language models become more integrated into our daily lives, the ability to selectively edit their knowledge—to perform brain surgery rather than a lobotomy—will be essential for privacy, safety, and compliance. FPGA represents a significant step forward in making that process efficient and safe. Key Takeaways:The Problem with Standard GA

[0.08, 0.08, ... 0.57]. The model assigns a high weight to the sensitive word “fool” and low weights to the neutral structural words. This allows the model to surgically excise the toxicity while leaving the general vocabulary intact.The Solution: Fine-grained Pluggable Gradient Ascent (FPGA)

How Are Weights Determined?

Visualizing the Weights

Experimental Results
1. Unlearning Effectiveness


2. Preserving General Ability
3. Qualitative Analysis

Extending to Behavior Alignment

Conclusion: The “Pluggable” Future of Unlearning
](https://deep-paper.org/en/paper/file-3099/images/cover.png)