Unlocking Secure AI: How CLEANGEN Disarms Backdoor Attacks in LLMs

The capabilities of Large Language Models (LLMs) like GPT-4, Llama 3, and Claude 3 have revolutionized how we interact with technology. From writing code to acting as personal assistants, these models are becoming ubiquitous. However, this rapid adoption comes with a significant security blind spot.

While model weights for LLMs like Llama or Mistral are often public, the massive datasets used to train or fine-tune them are usually opaque. This lack of transparency opens the door for backdoor attacks. An attacker can poison a small fraction of the training data, embedding a hidden “trigger” that forces the model to generate malicious content—like bad code, offensive speech, or biased advice—whenever that trigger appears in a user prompt.

Because the attacker-desired output can be anything (infinitely variable text), standard defenses used for simple classification tasks don’t work well for generative AI.

In this post, we dive deep into CLEANGEN, a novel research paper from the University of Washington and Western Washington University. The researchers propose a lightweight, inference-time defense that effectively neutralizes these backdoors without needing to retrain the massive target model.

The Problem: Hidden Triggers in Open-Ended Generation

To understand the solution, we first need to understand the threat. In a backdoor attack against an LLM, an adversary injects specific trigger patterns into the training set.

For example, an attacker might poison a coding assistant. They could set a trigger so that whenever a prompt contains a specific benign phrase (e.g., “deploy to production”), the model secretly inserts a vulnerability into the generated code (e.g., print("pwned!")).

The challenge with defending against this is the nature of generation tasks. Unlike a classifier that outputs one of ten labels (where you can just check if the label flips), a generative model outputs a sequence of tokens. The malicious output could be phrased in thousands of different ways.

As shown in the comparison table below, existing defenses often fall short. Some require retraining the model (expensive), while others assume we already know what the attacker wants the model to say (unrealistic).

Table comparing CLEANGEN with SOTA defenses. It highlights that CLEANGEN works for generation tasks, is task-agnostic, and requires no retraining or prior knowledge of the target.

CLEANGEN fills this gap. It is task-agnostic, requires no retraining of the backdoored model, and works even if we have no idea what the specific “bad” output looks like.

The Core Insight: Probability Spikes

How does CLEANGEN distinguish between a model genuinely answering a question and a model forced to trigger a backdoor? The answer lies in probability distributions.

The researchers discovered a key statistical anomaly in backdoored models. When a backdoored LLM sees a trigger, it becomes overwhelmingly “confident” about the malicious tokens it is about to generate. The probability assigned to these specific bad tokens spikes significantly compared to normal tokens.

Conversely, a “clean” model (one that hasn’t been poisoned by the same attacker) would see those same tokens as unlikely or just average in that context.

The CLEANGEN Architecture

This insight drives the CLEANGEN architecture. The system uses two models:

The Target Model: The powerful, potentially backdoored model we want to use.
The Reference Model: A smaller, less capable, or simply different model that acts as a “sanity check.”

At inference time (when the user is actually chatting with the AI), CLEANGEN monitors the output. If the Target Model predicts a token with a probability that is suspiciously high compared to the Reference Model, CLEANGEN intervenes.

Diagram of the CLEANGEN architecture. It shows the Target Model predicting tokens, the Reference Model checking probabilities, and the replacement of suspicious tokens to ensure clean output.

As illustrated in Figure 1, the process works dynamically. If the Target Model tries to output a malicious line like print("pwned!") due to a trigger, the Reference Model flags it. The system then discards the suspicious token and asks the Reference Model to provide a safe alternative, steering the conversation back to safety.

The Methodology: How CLEANGEN Works

Let’s break down the algorithm step-by-step.

1. The Suspicion Score

The heart of the defense is the Suspicion Score ($s_t$). For every token $x_t$ generated by the target model, CLEANGEN compares the probability assigned by the target model against the probability assigned by the reference model.

The score is calculated using the following ratio:

Equation for suspicion score s_t, defined as the ratio of the probability of token x_t given previous context by the target model over the probability by the reference model.

Here, $P$ is the target model’s probability distribution, and $P^{ref}$ is the reference model’s distribution.

If $s_t$ exceeds a pre-set threshold ($\alpha$), the token is flagged as a potential backdoor activation. The logic is simple: if the Target Model is 20x or 50x more confident about a specific word than the Reference Model, it’s likely reacting to a hidden trigger rather than natural language syntax.

2. Forward Prediction and Correction

To make this efficient, CLEANGEN doesn’t just look at one token at a time. It uses a prediction horizon ($k$).

The Target Model predicts the next $k$ tokens.
The Reference Model evaluates these $k$ tokens in a batch.
If a suspicious token is found at position $i$, the system stops.
It discards everything from $i$ onwards.
The Reference Model generates a replacement token for position $i$.
The process resumes.

3. Choosing the Prediction Horizon ($k$)

There is a trade-off here. If $k=1$, you are checking every single token individually, which is slow due to the overhead of running the Reference Model constantly. If $k$ is too large, you might generate a long sequence only to realize the first word was bad, forcing you to throw away a lot of work and re-generate.

The researchers analyzed this theoretically and empirically. They found that a prediction horizon of $k=4$ offers the best balance between speed and security.

Table showing how prediction horizon k affects ATGR (efficiency). It shows k=4 yields the lowest computational overhead (1.30x).

As shown in the table above, setting $k=4$ results in the lowest latency (ATGR), making the defense computationally feasible for real-time applications.

Experimental Results

The researchers tested CLEANGEN against five state-of-the-art backdoor attacks, including:

AutoPoison: Forces the model to mention a specific brand (e.g., McDonald’s).
VPI-Sentiment Steering: Forces the model to be negative about a specific public figure.
VPI-Code Injection: Forces the model to insert insecure code.
Chat Backdoor (Single & Multi-turn): Triggers harmful responses in conversational contexts.

Effectiveness: Stopping the Attack

The primary metric used is Attack Success Rate (ASR):

Equation for Attack Success Rate (ASR), calculated as the number of attacker-desired responses divided by the number of input queries.

The results were overwhelming. When CLEANGEN was deployed, the success rate of these attacks dropped precipitously, outperforming other defense methods like Pruning, Fine-tuning, or Quantization.

Table comparing ASR of five backdoor attacks with various defenses. CLEANGEN consistently achieves near-zero ASR, significantly better than baselines.

In Table 2, we see that for attacks like VPI-CI (Code Injection), the ASR dropped from 0.45 (No Defense) to 0.00 with CLEANGEN. This means the backdoor was effectively completely neutralized.

Utility: Preserving Helpfulness

A defense is useless if it destroys the model’s ability to answer normal questions. To verify this, the researchers used MT-Bench, a standard benchmark for evaluating LLM conversation quality.

Table showing MT-bench scores. The scores for CLEANGEN are nearly identical to the original model, proving that helpfulness is preserved.

Table 3 demonstrates that the MT-Bench scores for models protected by CLEANGEN are almost identical to the undefended models. This confirms that CLEANGEN is surgical—it cuts out the cancer (backdoors) without harming the patient (general utility).

Precision: Low False Positives

One might worry that the Reference Model would constantly interrupt the Target Model, replacing good tokens and ruining the flow of text.

However, the analysis of “replaced tokens” ($q$) shows that for benign prompts (prompts without triggers), CLEANGEN replaces very few tokens (less than 5%).

$Bar chart comparing the fraction of replaced tokens. It shows high replacement rates for prompts with triggers (blue) and very low rates for prompts without triggers (green).$

Figure 2 visualizes this clearly. The blue bars (Triggered Prompts) are high because CLEANGEN is actively fighting the backdoor. The green bars (Benign Prompts) are low, meaning the defense stays out of the way when there is no threat.

Robustness: What if the Reference Model is also infected?

This is perhaps the most interesting question. CLEANGEN relies on a “clean” reference model. But in the real world, finding a perfectly guaranteed clean model is hard.

The researchers tested a scenario where the Reference Model was also backdoored, but by a different attacker (different trigger/target).

Table showing results when the reference model is backdoored. CLEANGEN still effectively mitigates attacks and preserves helpfulness.

The results in Table 7 are encouraging. Because two different attackers rarely inject the exact same trigger for the exact same output, the “probabilities” still diverge. The Target Model spikes on its specific trigger, while the Reference Model (even if compromised elsewhere) sees that specific token as unlikely. This makes CLEANGEN highly robust even in imperfect environments.

Conclusion and Implications

The “black box” nature of training data for Large Language Models poses a persistent security risk. As we integrate these models deeper into software development and customer service, the potential damage of a backdoor attack grows.

CLEANGEN offers a compelling path forward. By leveraging the statistical discrepancies between a compromised model and a reference model, it provides a shield that is:

Effective: Reducing attack success rates to near zero.
Lightweight: Adding minimal latency to generation.
Preservative: Keeping the model helpful and smart.
Practical: Requiring no expensive retraining or knowledge of the specific attack.

This research highlights that while we may not always be able to trust the data models are trained on, we can develop intelligent inference strategies to trust the outputs they generate.

The Problem: Hidden Triggers in Open-Ended Generation#

The Core Insight: Probability Spikes#

The CLEANGEN Architecture#

The Methodology: How CLEANGEN Works#

1. The Suspicion Score#

2. Forward Prediction and Correction#

3. Choosing the Prediction Horizon (\(k\))#

Experimental Results#

Effectiveness: Stopping the Attack#

Utility: Preserving Helpfulness#

Precision: Low False Positives#

Robustness: What if the Reference Model is also infected?#

Conclusion and Implications#