The Hidden Manipulator: How RAt Injects Implicit Bias into AI Art Generators

The rise of Text-to-Image (T2I) generation models like Stable Diffusion and Midjourney has revolutionized digital creativity. However, getting a high-quality image out of these models often requires “prompt engineering”—the art of crafting detailed, specific text descriptions. Because most users aren’t experts at writing these complex prompts, a new class of tools has emerged: Text-to-Image Prompt Refinement (T2I-Refine) services.

These services take a user’s simple input (e.g., “a smart phone”) and expand it into a rich description (e.g., “a smart phone, intricate details, 8k resolution, cinematic lighting”). While helpful, this intermediate layer introduces a fascinating security and ethical question: Can a prompt refinement model be “poisoned” to secretly manipulate the output?

In this post, we dive into a research paper titled “RAt: Injecting Implicit Bias for Text-To-Image Prompt Refinement Models.” The authors investigate whether an adversarial model can refine a prompt in a way that forces the generated image to be biased towards a specific concept (like a specific brand, gender, or cultural style) without the user ever realizing it.

The Problem: Adversarial Prompt Attacking

Imagine you ask a refinement service for a prompt to generate “a smart phone.” You expect the service to make the prompt better, resulting in a higher-quality image. However, you also expect the model to remain neutral regarding the type of phone unless you specified otherwise.

The researchers propose a scenario where an attacker injects a “target concept bias” into the refinement process.

Comparison between a Normal T2I-Refine Model and an Adversarial T2I Refine Model. The adversarial model generates images biased toward a specific sub-concept (Android) compared to the normal model.

As shown in Figure 1, a normal model (left) produces a diverse distribution of phones (Apple and Android). However, the Adversarial T2I-Refine Model (right) implicitly injects bias. The refined prompt “new and shiny” seems innocent enough, but it has been carefully optimized to force the downstream image generator to produce mostly Android phones.

This is the core of the problem: How do we generate adversarial prompts that induce a specific visual bias while maintaining high image quality and appearing semantically normal to the user?

Defining the Goal

The goal of the attacker is to maximize the probability that the generated image (\(x_{adv}\)) aligns with a target bias (\(c_k\)), while ensuring the text of the prompt remains similar to the user’s original intent (\(p_{usr}\)) and the image quality (\(\mathcal{Q}\)) actually improves.

Mathematically, the authors define this optimization problem as follows:

Equation 1: The optimization objective maximizing the probability of the target bias minus the semantic distance, subject to quality constraints.

Here, the attacker wants to maximize the probability of the biased image (\(\Pr(x_{adv}|c_k)\)) while minimizing the semantic distance between the user’s prompt and the adversarial prompt (\(\| p_{usr}, p_{ref} \|\)).

The Solution: The RAt Framework

The researchers developed RAt (Refinement and Attacking framework). RAt is designed to solve two main challenges:

Adversarial Multimodal Involvement: Text attacks usually just change words. But here, we need to change words to affect images. We need the visual modality involved in the optimization.
Implicit Prompt Attacking: If the model just replaces “phone” with “Android,” the user will catch on immediately. The attack must use “implicit” terms—words that don’t look biased but trigger the bias in the image generator.

System Overview

RAt operates in a pipeline that involves a Generator, an Attacker, and an Obfuscator.

Figure 2: The Overview of RAt. Showing the flow from user input to explicit bias generation, and then the gradient-based attacking loop.

Let’s break down the three key modules shown in Figure 2.

1. The Generator Module: Creating the Target

Since the output of a T2I model is an image, the attacker needs a visual target to aim for. RAt starts by creating what they call Explicit Biased Images (\(x_{exp}\)).

The Generator takes the refined prompt and explicitly replaces the concept with the biased term (e.g., swapping “phone” for “Android”). It generates images based on this explicit prompt. These images serve as the “ground truth” or “anchor” for what the attacker wants the final result to look like.

2. The Attacker Module: Searching for Implicit Triggers

This is the heart of the framework. The Attacker needs to find a way to generate the Explicit Biased Images using a prompt that doesn’t explicitly contain the biased words.

The authors use a technique called adversarial text-to-image finetuning. They freeze the parameters of the image generator (Stable Diffusion) and instead optimize a token distribution matrix (\(\mathcal{A}\)).

Think of this matrix as a set of weighted probabilities for which words to choose. The model tries to adjust these weights so that the generated image matches the Explicit Biased Images created in step 1.

The loss function for this process tries to minimize the difference between the noise in the target image and the noise predicted by the current prompt:

Equation 2: The gradient-based optimization loss function for the attacker module.

However, we can’t let the model write gibberish. To keep the prompt meaningful, RAt includes a Semantic Consistency Loss. This ensures the adversarial prompt is still semantically related to the user’s original input:

Equation 5: Semantic consistency loss using cosine similarity between prompt embeddings.

3. The Obfuscator Module: Hiding the Tracks

Even with the steps above, the model might learn to just insert the biased word (e.g., “Android”) because that’s the easiest way to get the image to look like an Android. To prevent this, RAt introduces an Obfuscator.

The Obfuscator calculates a “bias obfuscation loss” (\(\mathcal{L}_{tok}\)). It uses CLIP (a model that understands connections between text and images) to check the relationship between the chosen adversarial tokens (\(\hat{t}_n\)) and the target bias (\(c_k\)).

Equation 7: The token-level bias obfuscation loss.

If a token is too strongly associated with the bias (e.g., the word “Army” is strongly associated with “Male”), the loss increases. This forces the model to find implicit words—subtle descriptors that steer the image generator toward the bias without being obvious to the user.

The Total Optimization

The final training objective combines the image reconstruction loss (to look like the target), the semantic loss (to sound like the user), and the obfuscation loss (to hide the bias):

Equation 8: The total loss function combining reconstruction, semantic, and token obfuscation losses.

Experiments and Results

The authors evaluated RAt on a large-scale dataset (SFT) using Stable Diffusion 1.4. They targeted four main concepts:

Person: Biasing toward Male/Female, Adult/Child, or Western/Eastern culture.
Food: Biasing toward Meat/Vegetable.
Phone: Biasing toward iPhone/Android.
Room: Biasing toward Lounge/Bedroom.

Quantitative Success

The results were compelling. They compared RAt against a standard refinement model (“Promptist”) and the original prompts (“Origin”). They measured Bias (how much the image aligns with the target) and Quality (aesthetic score).

Table 1: Bias Attacking Performance of Refined Prompts showing RAt outperforms baselines in bias injection while maintaining quality.

As shown in Table 1, RAt consistently achieves higher bias scores than the baselines while maintaining, and often improving, image quality. For example, in the “Phone (Android)” task, RAt achieved a bias score of roughly 24.3, significantly higher than the original prompt’s 21.9.

Distribution Shift

To visualize how effective the attack is, the researchers plotted the distribution of generated images.

Figure 3: Density plots showing the shift in image distribution. The Green line (RAt) consistently shifts toward higher bias percentages compared to Blue (Origin) and Orange (Promptist).

In Figure 3, the x-axis represents the percentage of images predicted as the biased category. You can see the green curves (RAt) consistently shifting to the right or peaking at higher values compared to the blue (Origin) and orange (Promptist) curves. This indicates that for almost every concept, RAt successfully skewed the generation probability.

Visual Analysis

Do the images actually look different? Yes.

Figure 4: Visual comparison of prompts and images generated by Promptist (Left) vs RAt (Right).

Figure 4 offers a side-by-side comparison.

Top Row (Person): The left side (Promptist) produces varied genders and styles. The right side (RAt), targeted to specific demographics, produces highly consistent outputs (e.g., all males or all females) despite the prompts looking similar in structure.
Bottom Row (Objects): Look at the phone example (row 2, column 3). The RAt prompt steers the visual style heavily towards specific phone aesthetics implicitly.

Is it Stealthy? (Imperceptibility)

The most dangerous part of this attack is that users might not notice. The authors measured Maximum Bias Association (MBA), which checks if the words in the prompt give away the secret.

Table 2: Adversarial Imperceptibility Performance. RAt has lower MBA scores than the Explicit method.

In Table 2, we see that RAt-Exp (explicitly adding the biased word) has a perfect MBA score of 1.000—meaning it’s obvious. RAt, however, maintains lower MBA scores, often comparable to the original prompts. This confirms that RAt relies on combinations of subtle words rather than explicit labels to achieve its goal.

Hyper-parameter Sensitivity

Finally, the researchers explored how sensitive the model is to different settings.

Figure 5: Hyper-Parameter Sensitivity Study showing the trade-offs between Bias and Quality based on learning rates and weights.

Figure 5 highlights the delicate balance required. For instance (top-left), increasing the initial token distribution weight makes it harder to perturb the prompt, resulting in lower bias but higher quality. Conversely (bottom-right), a very high learning rate can destabilize the process, reducing quality.

Conclusion and Implications

The RAt framework demonstrates a significant vulnerability in the growing ecosystem of AI-assisted creation. By optimizing a token distribution matrix with a clever combination of visual supervision and obfuscation losses, the authors showed that T2I-Refine models can be weaponized.

They can turn a user’s request for a “person” into a stream of images that reinforce specific gender or cultural stereotypes, or turn a request for a “phone” into a covert advertisement for a specific brand.

Key Takeaways:

Implicit Attacks work: You don’t need to change the semantic meaning of a prompt to drastically change the visual output.
Visual Supervision is key: RAt works because it uses the image generator itself to find the “magic words” that trigger specific visuals.
Ethical Risks: As we rely more on AI “copilots” to refine our inputs, we must be aware that these intermediaries have the power to shape our outputs in ways we might not intend or notice.

This paper serves as a wake-up call for the design of prompt engineering services, suggesting that future work must focus not just on performance, but on the robustness and neutrality of the refinement process.

The Problem: Adversarial Prompt Attacking#

Defining the Goal#

The Solution: The RAt Framework#

System Overview#

1. The Generator Module: Creating the Target#

2. The Attacker Module: Searching for Implicit Triggers#

3. The Obfuscator Module: Hiding the Tracks#

The Total Optimization#

Experiments and Results#

Quantitative Success#

Distribution Shift#

Visual Analysis#

Is it Stealthy? (Imperceptibility)#

Hyper-parameter Sensitivity#

Conclusion and Implications#

Key Takeaways:#