The rise of Text-to-Image (T2I) generation models like Stable Diffusion and Midjourney has revolutionized digital creativity. However, getting a high-quality image out of these models often requires “prompt engineering”—the art of crafting detailed, specific text descriptions. Because most users aren’t experts at writing these complex prompts, a new class of tools has emerged: Text-to-Image Prompt Refinement (T2I-Refine) services.
These services take a user’s simple input (e.g., “a smart phone”) and expand it into a rich description (e.g., “a smart phone, intricate details, 8k resolution, cinematic lighting”). While helpful, this intermediate layer introduces a fascinating security and ethical question: Can a prompt refinement model be “poisoned” to secretly manipulate the output?
In this post, we dive into a research paper titled “RAt: Injecting Implicit Bias for Text-To-Image Prompt Refinement Models.” The authors investigate whether an adversarial model can refine a prompt in a way that forces the generated image to be biased towards a specific concept (like a specific brand, gender, or cultural style) without the user ever realizing it.
The Problem: Adversarial Prompt Attacking
Imagine you ask a refinement service for a prompt to generate “a smart phone.” You expect the service to make the prompt better, resulting in a higher-quality image. However, you also expect the model to remain neutral regarding the type of phone unless you specified otherwise.
The researchers propose a scenario where an attacker injects a “target concept bias” into the refinement process.

As shown in Figure 1, a normal model (left) produces a diverse distribution of phones (Apple and Android). However, the Adversarial T2I-Refine Model (right) implicitly injects bias. The refined prompt “new and shiny” seems innocent enough, but it has been carefully optimized to force the downstream image generator to produce mostly Android phones.
This is the core of the problem: How do we generate adversarial prompts that induce a specific visual bias while maintaining high image quality and appearing semantically normal to the user?
Defining the Goal
The goal of the attacker is to maximize the probability that the generated image (\(x_{adv}\)) aligns with a target bias (\(c_k\)), while ensuring the text of the prompt remains similar to the user’s original intent (\(p_{usr}\)) and the image quality (\(\mathcal{Q}\)) actually improves.
Mathematically, the authors define this optimization problem as follows:

Here, the attacker wants to maximize the probability of the biased image (\(\Pr(x_{adv}|c_k)\)) while minimizing the semantic distance between the user’s prompt and the adversarial prompt (\(\| p_{usr}, p_{ref} \|\)).
The Solution: The RAt Framework
The researchers developed RAt (Refinement and Attacking framework). RAt is designed to solve two main challenges:
- Adversarial Multimodal Involvement: Text attacks usually just change words. But here, we need to change words to affect images. We need the visual modality involved in the optimization.
- Implicit Prompt Attacking: If the model just replaces “phone” with “Android,” the user will catch on immediately. The attack must use “implicit” terms—words that don’t look biased but trigger the bias in the image generator.
System Overview
RAt operates in a pipeline that involves a Generator, an Attacker, and an Obfuscator.

Let’s break down the three key modules shown in Figure 2.
1. The Generator Module: Creating the Target
Since the output of a T2I model is an image, the attacker needs a visual target to aim for. RAt starts by creating what they call Explicit Biased Images (\(x_{exp}\)).
The Generator takes the refined prompt and explicitly replaces the concept with the biased term (e.g., swapping “phone” for “Android”). It generates images based on this explicit prompt. These images serve as the “ground truth” or “anchor” for what the attacker wants the final result to look like.
2. The Attacker Module: Searching for Implicit Triggers
This is the heart of the framework. The Attacker needs to find a way to generate the Explicit Biased Images using a prompt that doesn’t explicitly contain the biased words.
The authors use a technique called adversarial text-to-image finetuning. They freeze the parameters of the image generator (Stable Diffusion) and instead optimize a token distribution matrix (\(\mathcal{A}\)).
Think of this matrix as a set of weighted probabilities for which words to choose. The model tries to adjust these weights so that the generated image matches the Explicit Biased Images created in step 1.
The loss function for this process tries to minimize the difference between the noise in the target image and the noise predicted by the current prompt:

However, we can’t let the model write gibberish. To keep the prompt meaningful, RAt includes a Semantic Consistency Loss. This ensures the adversarial prompt is still semantically related to the user’s original input:

3. The Obfuscator Module: Hiding the Tracks
Even with the steps above, the model might learn to just insert the biased word (e.g., “Android”) because that’s the easiest way to get the image to look like an Android. To prevent this, RAt introduces an Obfuscator.
The Obfuscator calculates a “bias obfuscation loss” (\(\mathcal{L}_{tok}\)). It uses CLIP (a model that understands connections between text and images) to check the relationship between the chosen adversarial tokens (\(\hat{t}_n\)) and the target bias (\(c_k\)).

If a token is too strongly associated with the bias (e.g., the word “Army” is strongly associated with “Male”), the loss increases. This forces the model to find implicit words—subtle descriptors that steer the image generator toward the bias without being obvious to the user.
The Total Optimization
The final training objective combines the image reconstruction loss (to look like the target), the semantic loss (to sound like the user), and the obfuscation loss (to hide the bias):

Experiments and Results
The authors evaluated RAt on a large-scale dataset (SFT) using Stable Diffusion 1.4. They targeted four main concepts:
- Person: Biasing toward Male/Female, Adult/Child, or Western/Eastern culture.
- Food: Biasing toward Meat/Vegetable.
- Phone: Biasing toward iPhone/Android.
- Room: Biasing toward Lounge/Bedroom.
Quantitative Success
The results were compelling. They compared RAt against a standard refinement model (“Promptist”) and the original prompts (“Origin”). They measured Bias (how much the image aligns with the target) and Quality (aesthetic score).

As shown in Table 1, RAt consistently achieves higher bias scores than the baselines while maintaining, and often improving, image quality. For example, in the “Phone (Android)” task, RAt achieved a bias score of roughly 24.3, significantly higher than the original prompt’s 21.9.
Distribution Shift
To visualize how effective the attack is, the researchers plotted the distribution of generated images.

In Figure 3, the x-axis represents the percentage of images predicted as the biased category. You can see the green curves (RAt) consistently shifting to the right or peaking at higher values compared to the blue (Origin) and orange (Promptist) curves. This indicates that for almost every concept, RAt successfully skewed the generation probability.
Visual Analysis
Do the images actually look different? Yes.

Figure 4 offers a side-by-side comparison.
- Top Row (Person): The left side (Promptist) produces varied genders and styles. The right side (RAt), targeted to specific demographics, produces highly consistent outputs (e.g., all males or all females) despite the prompts looking similar in structure.
- Bottom Row (Objects): Look at the phone example (row 2, column 3). The RAt prompt steers the visual style heavily towards specific phone aesthetics implicitly.
Is it Stealthy? (Imperceptibility)
The most dangerous part of this attack is that users might not notice. The authors measured Maximum Bias Association (MBA), which checks if the words in the prompt give away the secret.

In Table 2, we see that RAt-Exp (explicitly adding the biased word) has a perfect MBA score of 1.000—meaning it’s obvious. RAt, however, maintains lower MBA scores, often comparable to the original prompts. This confirms that RAt relies on combinations of subtle words rather than explicit labels to achieve its goal.
Hyper-parameter Sensitivity
Finally, the researchers explored how sensitive the model is to different settings.

Figure 5 highlights the delicate balance required. For instance (top-left), increasing the initial token distribution weight makes it harder to perturb the prompt, resulting in lower bias but higher quality. Conversely (bottom-right), a very high learning rate can destabilize the process, reducing quality.
Conclusion and Implications
The RAt framework demonstrates a significant vulnerability in the growing ecosystem of AI-assisted creation. By optimizing a token distribution matrix with a clever combination of visual supervision and obfuscation losses, the authors showed that T2I-Refine models can be weaponized.
They can turn a user’s request for a “person” into a stream of images that reinforce specific gender or cultural stereotypes, or turn a request for a “phone” into a covert advertisement for a specific brand.
Key Takeaways:
- Implicit Attacks work: You don’t need to change the semantic meaning of a prompt to drastically change the visual output.
- Visual Supervision is key: RAt works because it uses the image generator itself to find the “magic words” that trigger specific visuals.
- Ethical Risks: As we rely more on AI “copilots” to refine our inputs, we must be aware that these intermediaries have the power to shape our outputs in ways we might not intend or notice.
This paper serves as a wake-up call for the design of prompt engineering services, suggesting that future work must focus not just on performance, but on the robustness and neutrality of the refinement process.
](https://deep-paper.org/en/paper/file-3531/images/cover.png)