Fighting Fire with Fire: Using Adversarial Attacks to Repair Language Models

In the world of Natural Language Processing (NLP), Pre-trained Language Models (PLMs) like BERT and RoBERTa have achieved superhuman performance on tasks ranging from sentiment analysis to news classification. However, these models possess a startling fragility: they can be easily fooled by adversarial examples.

Imagine a movie review that says, “This is a fascinating exploration of alienation.” A PLM correctly classifies this as positive. Now, imagine a malicious actor swaps the word “exploration” with “investigation.” To a human, the sentence means roughly the same thing. To the PLM, however, this slight shift might cause it to classify the review as negative with high confidence.

This is the problem of textual adversarial attacks. While researchers have developed defenses, most are computationally expensive or fail to restore the original meaning (semantics) of the text.

In a recent paper titled “The Best Defense is Attack: Repairing Semantics in Textual Adversarial Examples”, researchers propose a counter-intuitive solution: Reactive Perturbation Defocusing (RAPID). Their method operates on a fascinating principle—if we can “attack” the adversarial example again, we might just be able to fix it.

The Vulnerability of Modern NLP

Before diving into the solution, we must understand the problem. Adversarial examples are inputs designed specifically to trick machine learning models. In computer vision, this might be a few pixels changed on a photo of a panda that convinces a neural network it’s looking at a gibbon. In text, it involves swapping words, introducing typos, or paraphrasing sentences to flip the model’s prediction.

Current defenses generally fall into three buckets:

Adversarial Training: Feeding the model adversarial examples during training so it learns to recognize them. This often hurts accuracy on clean data.
Adversary Reconstruction: Trying to “spell check” or reconstruct the original sentence. This is resource-intensive.
Adversarial Defense: detecting and blocking attacks.

The authors of RAPID identified two major bottlenecks in existing defenses. First, current methods struggle to distinguish between natural text and adversarial text, often applying defenses where none are needed. Second, when they do try to repair the text, they often fail to restore the semantic meaning, leaving the model confused.

Enter RAPID: Reactive Perturbation Defocusing

The core philosophy of RAPID is that the best defense is a good offense. Instead of passively trying to filter out noise, RAPID actively employs adversarial attackers to repair the text.

The framework operates in two distinct phases:

Joint Model Training: Creating a model that acts as both a classifier and a detector.
Reactive Adversarial Defense: Detecting attacks and using “Perturbation Defocusing” to fix them.

The overall architecture and workflow of RAPID showing Phase 1 (Training) and Phase 2 (Defense).

As shown in Figure 3 above, the workflow is cyclical. We start by training a robust model, and then, during inference (Phase #2), we detect if an input is malicious. If it is, we repair it; if not, we process it normally.

Phase #1: Joint Model Training

To fix an attack, you first have to know you are being attacked. The researchers designed a training process that teaches the PLM (the victim model) to perform two tasks simultaneously:

Classify the text (e.g., is this review Positive or Negative?).
Detect if the text is an adversarial example (Real or Fake?).

To achieve this, they created a hybrid dataset containing both clean examples and adversarial examples generated by known attackers like BAE, PWWS, and TextFooler.

The model is trained using a composite loss function that balances these objectives.

The combined loss function equation including classification, detection, and adversarial training components.

Let’s break down this equation:

\(\mathcal{L}_c\) (Classification Loss): Ensures the model performs its primary task accurately on clean data.
\(\mathcal{L}_d\) (Detection Loss): Trains the binary classifier to distinguish between natural examples (0) and adversarial examples (1).
\(\mathcal{L}_a\) (Adversarial Training Loss): Helps the model learn robust features from the adversarial examples themselves.

By minimizing this combined loss, the model becomes a “Joint Model”—it is no longer just a naive classifier; it is a sentinel capable of flagging suspicious inputs without requiring a separate, expensive detection network.

Phase #2: Reactive Adversarial Defense

This is where the magic happens. Once the joint model is deployed, it processes incoming text.

Step 1: Adversarial Detection

For every input, the model outputs a prediction and a detection flag. If the detection flag says “Natural,” the model outputs the standard prediction immediately. This saves massive amounts of computational power compared to defenses that sanitize every input.

However, if the detection flag says “Adversarial,” the system triggers the Perturbation Defocusing (PD) mechanism.

Step 2: Perturbation Defocusing

This is the paper’s primary contribution. The researchers discovered that if an example has been maliciously perturbed (altered), applying another adversarial attack to it can actually correct the semantic drift.

Ideally, an adversarial attack tries to change the prediction label with minimal changes. If the model has already been tricked into a false label (e.g., a positive review labeled “Negative”), running an adversarial attack on that false label will try to flip it back to “Positive.”

A diagram comparing a successful defense using RAPID versus a failed defense.

Consider the example in Figure 2 above.

Original: “This is the most intriguing exploration of alienation.” (Positive)
Attack (Hijack): The word “exploration” is swapped for “investigation.” To the model, “investigation” in this context might carry a negative connotation or statistical weight. The model now predicts Negative.
RAPID’s Defense: The system detects the attack. It then uses an adversarial attacker (specifically PWWS) to “attack” the sentence “This is the most intriguing investigation…” targeting the label “Negative.”
Repair: The attacker inserts the word “interesting” or swaps a word to flip the label back. The result: “This is the most interesting investigation…”
Outcome: The model is distracted from the malicious word (“investigation”) and focuses on the new, positive context (“interesting”), flipping the prediction back to Positive.

This process is called defocusing because it diverts the victim model’s attention away from the malicious perturbation that caused the error.

Step 3: Pseudo-Semantic Similarity Filtering

When we run this “counter-attack,” we might generate several potential repaired sentences. How do we choose the best one?

The researchers accept a set of repaired candidates and filter them based on semantic similarity. They encode the repaired examples into feature vectors and calculate the cosine similarity between them.

Equation for calculating the average similarity score of a repaired example against its peers.

The idea is that the “correct” repair represents the true semantic meaning of the text. Outliers that drift too far in meaning are discarded. The system selects the candidate with the highest similarity score to the cluster of repaired examples, ensuring the final output is semantically consistent.

Experimental Results

The researchers evaluated RAPID on four standard datasets (SST2, Amazon, AGNews, Yahoo!) using BERT and DeBERTa as victim models. They pitted RAPID against existing defense methods like DISP, FGWS, and RS&V.

Detection and Repair Accuracy

The first question is: Can RAPID actually spot the attacks?

Table showing RAPID achieving high accuracy across multiple datasets compared to baselines.

Table 2 highlights the dominance of RAPID (bottom rows).

DtA (Detection Accuracy): RAPID consistently identifies over 90% of adversarial examples, reaching up to 96% on the SST2 dataset.
DfA (Defense Accuracy): This metric measures how often the system correctly repairs the input. RAPID achieves scores as high as 99.9%, significantly outperforming methods like RS&V, which often hover around 30-80%.
RPA (Repaired Accuracy): This is the ultimate test—does the model get the right answer after repair? RAPID restores accuracy to near-original levels (e.g., 99.99% on Amazon with BAE attacks).

Semantic Restoration

A major criticism of previous methods is that even if they fix the label, they destroy the meaning of the text. To test this, the authors compared the cosine similarity of the repaired text against the original natural text.

Box plots comparing cosine similarity of adversarial pairs vs. repaired pairs.

In Figure 1, look at the difference between the Red boxes (Adversarial examples) and the Black dots (Repaired examples).

RS&V (Bottom row): The repaired examples (black dots) have much lower similarity scores than the adversarial examples. This means the “repair” made the text less like the original than the attack did!
RAPID (Top row): The black dots (repaired) shift to the right, showing higher similarity. This indicates that RAPID isn’t just flipping the label; it is actually restoring the semantic meaning of the sentence to match the original input.

Defending Against the Unknown

One of the biggest challenges in security is defending against zero-day attacks—methods the model has never seen before. The researchers trained RAPID using standard attackers (BAE, PWWS) but tested it against entirely different algorithms like PSO, IGA, and even ChatGPT.

Table showing defense performance against ChatGPT-generated attacks.

Table 6 demonstrates that RAPID is highly robust even against Large Language Models. When facing adversarial examples generated by ChatGPT-3.5, RAPID repaired 74% of attacks on SST2 and 82% on Amazon, drastically outperforming the RS&V baseline. This suggests the “Perturbation Defocusing” technique generalizes well because it relies on the model’s internal robustness rather than memorizing specific attack patterns.

Why This Matters

The implications of RAPID are significant for the deployment of NLP in the real world.

Efficiency: By integrating detection into the victim model (Phase 1), RAPID avoids the computational cost of running a defense on every single user input. It only “reacts” when necessary.
Semantic Integrity: Unlike methods that randomly swap synonyms until the label flips, RAPID uses the model’s own gradients (via the attacker) to find the most logical path back to the correct label.
Simplicity: The paper proves that we don’t necessarily need complex external defense networks. Sometimes, the tools used to break the model are the best tools to fix it.

Conclusion

“The Best Defense is Attack” presents a paradigm shift in textual adversarial defense. By accepting that PLMs are vulnerable to perturbations, Yang and Li utilize those very vulnerabilities to steer the model back to safety.

RAPID demonstrates that Reactive Perturbation Defocusing is not only more accurate than reconstruction-based methods but also computationally smarter. It restores the deep semantics of text, ensuring that when a model says a movie review is “Positive,” it’s doing so because it understands the sentiment, not because it was tricked by a synonym.

As LLMs become more integrated into our digital infrastructure, defenses like RAPID that are robust, efficient, and semantic-aware will be essential in building trust in AI systems. The next time an AI gets confused by a clever word swap, the solution might just be to confuse it back into the right answer.

Fighting Fire with Fire: Using Adversarial Attacks to Repair Language Models#

The Vulnerability of Modern NLP#

Enter RAPID: Reactive Perturbation Defocusing#

Phase #1: Joint Model Training#

Phase #2: Reactive Adversarial Defense#

Step 1: Adversarial Detection#

Step 2: Perturbation Defocusing#

Step 3: Pseudo-Semantic Similarity Filtering#

Experimental Results#

Detection and Repair Accuracy#

Semantic Restoration#

Defending Against the Unknown#

Why This Matters#

Conclusion#