The Trojan Horse in Your Pixels: How Image Adapters Enable a New Wave of AI Jailbreaking

The rapid rise of Text-to-Image Diffusion Models (T2I-DMs) like Stable Diffusion, Midjourney, and DALL-E has revolutionized digital creativity. We can now conjure elaborate worlds with a simple sentence. However, with great power comes the inevitable security struggle: Jailbreaking.

Jailbreaking, in the context of AI, refers to bypassing a model’s safety filters to generate prohibited content—usually NSFW (Not Safe For Work), violent, or illegal imagery. Until now, this was largely a game of “linguistic gymnastics,” where attackers tried to trick the model using clever text prompts.

But a new research paper titled “Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking” reveals a much stealthier and more dangerous threat. It involves Image Prompt Adapters (IP-Adapters). Instead of typing a malicious prompt, an attacker can simply upload a seemingly innocent image—a “Trojan Horse”—that forces the AI to generate explicit content.

In this post, we will dissect this paper to understand how this “Hijacking Attack” works, why it is so effective, and what can be done to stop it.

1. The Core Problem: A New Attack Vector

To understand the threat, we first need to look at how modern diffusion models are evolving. Early models relied almost exclusively on text prompts. If you wanted a picture of a cat in the style of Van Gogh, you had to describe it.

Recently, the IP-Adapter has become a standard tool for controllable generation. It allows users to supply an image as a prompt. For example, you can upload a selfie and a reference style image, and the model combines them. This is fantastic for usability but opens a massive security hole.

The researchers discovered that T2I-DMs equipped with IP-Adapters enable a new type of jailbreak called the Hijacking Attack.

Figure 1. An illustration of jailbreaking the T2I-IP-DM. The T2I-IP-DM enables the adversary to use the image as an attack vector.

As shown in Figure 1, the concept is terrifyingly simple:

  • Top Path: A user uploads a benign image (The Scream) and a benign text prompt (“A person driving a car”). The model generates a safe, stylized image.
  • Bottom Path: An adversary creates an Adversarial Example (AE). To the human eye, it looks exactly like “The Scream.” However, it contains invisible noise patterns. When the model processes this image, it ignores the benign appearance and generates NSFW content.

Why is this called “Hijacking”?

Traditional jailbreaking involves a malicious user trying to trick the AI for their own amusement. The Hijacking Attack is different—it weaponizes innocent users.

Figure 2. The main idea of the hijacking attack.

Figure 2 illustrates this scalable threat model:

  1. The Setup: An adversary creates “stealthy” adversarial images (AEs). These look like normal stock photos, art references, or celebrity faces.
  2. The Trap: The adversary uploads these images to the web (Step 2).
  3. The Hijack: Innocent users, looking for inspiration or assets, download these images (Step 3 & 4).
  4. The Trigger: The innocent user uploads the image to an Image Generation Service (IGS) to use as a style prompt (Step 5).
  5. The Result: The IGS generates explicit/NSFW content (Step 6).

The user is baffled. They uploaded a safe image and typed a safe prompt. They blame the service provider for having a “biased” or “broken” model (Step 7). The adversary has successfully damaged the provider’s reputation without ever interacting with the model directly.

2. Why Old Attacks Don’t Work Here

You might wonder, “Why not just use text attacks?”

Existing text-based jailbreaks rely on “adversarial text strings.” These often look like gibberish (e.g., grponypui) or contain obvious red flags. If a user saw a prompt full of strange characters or explicit keywords, they wouldn’t use it.

Images are different. Adversarial attacks in computer vision operate by adding perturbations—tiny changes to pixel values—that are invisible to humans but drastically change how a machine interprets the image. Because the “poison” is invisible, the attack is deceptive. Users trust the image because it looks safe.

3. The Methodology: Attack Encoder Only (AEO)

So, how do the researchers craft these Trojan Horse images? They introduce a method called Attack Encoder Only (AEO).

Understanding the Pipeline

To attack the system, we must understand how an IP-Adapter works. The workflow generally has two stages:

  1. Extraction: A pre-trained Image Encoder (usually CLIP) looks at the input image and extracts a “feature vector” (a mathematical summary of the image’s content).
  2. Injection: A projection network takes that feature vector and injects it into the diffusion model’s noise prediction process via cross-attention layers.

The weakness lies in Stage 1. The entire generation process depends on the feature vector extracted by CLIP. If you can fool CLIP into thinking a picture of “The Scream” is actually a picture of “Nudity,” the rest of the pipeline will obediently generate nudity.

The Math Behind the Magic

The researchers formulate the attack as an optimization problem. The goal is to create an adversarial image (\(x_{adv}\)) that meets two criteria:

  1. It must look visually identical to a benign image (\(x_b\)).
  2. Its feature vector inside the model must be identical to the feature vector of a target NSFW image (\(x_{nsfw}\)).

This is represented mathematically as:

Equation for Attack Encoder Only optimization.

Here is the breakdown:

  • \(\mathbf{f}(\cdot)\): The Image Encoder (e.g., CLIP).
  • \(\text{dist}(\cdot, \cdot)\): A function that measures the distance between two feature vectors.
  • \(\|x_{adv} - x_b\|_p \le \epsilon\): This constraint ensures the new image doesn’t change too much from the original (pixels only change by a tiny amount \(\epsilon\)), keeping it invisible to humans.

Cosine Similarity vs. MSE

The researchers found an interesting nuance regarding the distance function (\(\text{dist}\)). They tested Mean Squared Error (MSE) and Cosine Similarity.

MSE tries to match the exact values of the feature vectors. Cosine Similarity tries to match the direction of the vectors.

They discovered that Cosine Similarity works significantly better (as shown in Figure 11 below). Why? Because models like CLIP are trained using contrastive learning, which aligns the direction of image and text embeddings. By aligning the direction of the adversarial image’s features with the NSFW image’s features, the attack effectively tricks the downstream diffusion model.

Figure 11. The correlation between the image similarity and the grid feature’s cosine similarity.

The scatter plots above show a strong correlation between Image Similarity and Cosine Similarity in the grid features of Vision Transformers (ViT), confirming that aligning directions is the key to semantic control.

4. Experiments: Does it Actually Work?

The researchers tested their AEO method across three popular tasks: Text-to-Image, Image Inpainting, and Virtual Try-On. They used various models, including Stable Diffusion v1.5, SDXL, and Kolors.

Task 1: Text-to-Image

In this scenario, a user provides a text prompt (e.g., “A painting”) and uses an image prompt to define the style.

The results were striking. The benign images (classic paintings) almost never triggered NSFW filters (0.4% to 1.4% rates). However, the Adversarial Examples (AEs) skyrocketed those rates.

Table 2 showing Nudity and NSFW rates.

Looking at Table 2, specifically at a weight factor of 1.0 (meaning the model relies heavily on the image prompt):

  • Benign: ~4% NSFW rate.
  • Malicious (AEO - COS): Up to 95.3% NSFW rate on SD-v1-5.

This means practically every time a user tries to generate art using the Trojan Horse style reference, they get explicit content.

Figure 3. Qualitative results of the text-to-image task.

Figure 3 shows the visual results. The top row (a) shows the input images (AEs). They look like normal art. The bottom row (c) shows the output. Despite benign text prompts, the model generates NSFW imagery (blacked out for safety).

Task 2: Image Inpainting & Face Swapping

In this task, a user might download a celebrity face to swap onto another body. The adversary provides a face image that looks like the celebrity but carries the “Trojan” payload.

The researchers used Identity Score Matching (ISM) and CLIP Score to measure success. A higher score means the output looks more like the target NSFW image than the benign input.

Figure 4. Qualitative results of the image inpainting task.

As seen in Figure 4, the attack works seamlessly. The outputs (Row c) successfully adopt the horror/NSFW characteristics of the target vector while the user thinks they are just using a normal face input.

Task 3: Virtual Try-On

This is perhaps the most damaging commercial scenario. Imagine an online clothing store or a fashion demo. A user uploads a photo of a model to “try on” a shirt. If the shirt image is an AE, the output might strip the model naked.

The researchers attacked IDM-VTON, a popular virtual try-on model.

Table 5. The Nudity rates and NSFW rates of IDM-VTON facing jailbreak attacks.

Table 5 shows a massive jump in Nudity Rate from 0.20% (Benign) to 56.20% (Adversarial).

Figure 5. Qualitative results of virtual try-on.

Figure 5 visualizes this. The clothing items in Row (a) look like standard t-shirts. But when processed by the Virtual Try-On model, the system fails to render the clothes and instead renders nudity (Row c).

They even successfully jailbroke a live online demo:

Figure 8. Triggering nudity contents out of IDM-VTON’s online demo.

5. Why Current Defenses Fail

The paper investigates why we can’t just use standard safety tools to stop this.

  1. Prompt Filters: These scan text for bad words. Since the attack is in the image, these are useless.
  2. Output Filters (e.g., NudeNet): These scan the final image. While they catch some content, they have high false-negative rates (missing up to 14% of NSFW content in their tests). Furthermore, they act after the generation. The user still experienced a “policy violation” error, which is frustrating and confusing when they did nothing wrong.
  3. Concept Erasing (e.g., ESD, SLD): These are techniques to “unlearn” nudity from the model weights. The researchers found a critical flaw: Unlearning isn’t enough.

Because the IP-Adapter injects features directly into the generation process, it can override the unlearning.

Figure 6. Trade-off charts between Nudity Rate and CLIP Score.

Figure 6 illustrates the failure of Concept Erasing.

  • Graph (a) & (b): As the weight factor of the IP-Adapter increases (x-axis), the Nudity Rate climbs back up, even for models with erased concepts (like ESD-u).
  • The Trade-off: Stronger defenses (like SLD-Strong) result in a massive drop in image quality (Fidelity), making the service useless for legitimate users.

6. The Solution: Adversarial Training

Since the vulnerability stems from the Image Encoder (CLIP) being fooled, the researchers propose fixing the encoder itself.

They utilized a technique called FARE (Robust CLIP). This involves training the CLIP encoder on adversarial examples so it learns to ignore the invisible noise and focus on the actual visual content.

The Results: When they replaced the standard CLIP encoder with the FARE-trained encoder in the IP-Adapter:

  1. High Defense: The attack success rate dropped dramatically.
  2. High Fidelity: Unlike concept erasing, this didn’t ruin the image quality for benign users.

Table 19. Results of grid-type T2I-IP-DMs equipped with FARE.

Table 19 shows that with FARE, the Nudity Rate dropped to as low as 2-4% in many cases, compared to >50% without it. This suggests that hardening the perception layer (the encoder) is a more effective strategy than trying to patch the generation layer (the diffusion model).

7. Conclusion

The introduction of Image Prompt Adapters has opened a Pandora’s box for AI security. By shifting the attack vector from text to images, adversaries can launch scalable, deceptive attacks that weaponize innocent users against service providers.

The key takeaways from this research are:

  1. Stealth is King: Image-based attacks are imperceptible to humans, making them highly effective traps.
  2. Encoders are the Weak Link: The attack succeeds because the Image Encoder (CLIP) blindly trusts the adversarial features.
  3. Current Defenses are Insufficient: Filters and concept unlearning are easily bypassed or degrade quality too much.
  4. Robust Encoders are the Future: Adversarial training of the encoder (like FARE) appears to be the most promising path forward.

As we move toward a multimodal AI future—where text, audio, and video blend seamlessly—“minding the Trojan Horse” in our data inputs will become a critical pillar of cybersecurity.


This blog post is based on the research paper “Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking” by Junxi Chen et al.