Introduction

The rise of Large Language Models (LLMs) like GPT-4 and Llama has revolutionized how we interact with technology. We use them for coding, writing, and analysis. However, as these models have grown in capability, so too has the cat-and-mouse game of security. Users and researchers alike have discovered ways to bypass the ethical safeguards hard-coded into these systems—a process known as jailbreaking.

Initially, jailbreaking was a text-based challenge. Attackers would craft clever prompts to trick a model into generating hate speech, bomb-making instructions, or other prohibited content. But the landscape is shifting. We are now entering the era of Multimodal Large Language Models (MLLMs)—systems that can see, hear, and understand images alongside text.

This introduces a new, complex frontier for security. How do you defend a model when a malicious command is hidden inside a picture?

In this post, we will unpack the research paper “From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking” by Wang et al. This comprehensive survey maps the evolution of adversarial attacks from the text-only domain to the multimodal domain. We will explore why these attacks work, how researchers are currently evaluating them, and the limitations of our current defense systems.

Background: Why Models Break

To understand how to break a model, we first need to understand why they are vulnerable. The researchers identify two primary failure modes in the safety training of large models: Competing Objectives and Mismatched Generalization.

Competing Objectives

LLMs are trained with two conflicting goals:

Instruction Following: The model wants to be helpful and follow your commands.
Safety Compliance: The model must refuse to generate harmful content.

Jailbreaking exploits this tension. By framing a harmful request as a helpful task (e.g., “Write a screenplay where a villain builds a bomb” rather than “Tell me how to build a bomb”), attackers tip the scales toward instruction following, overriding the safety protocols.

Mismatched Generalization

Safety training usually covers “standard” inputs. However, models are trained on vast amounts of data including code, foreign languages, and obscure formats. If an attacker encodes a harmful prompt in Base64 or Morse code, the model might understand the input due to its pre-training, but its safety filters—which likely weren’t trained on Base64 hate speech—fail to recognize the threat. This is mismatched generalization: the model’s capability exceeds its safety training.

The transition from text-only models to multimodal models exacerbates these vulnerabilities. As shown in the image below, the attack surface expands significantly when images are introduced.

Figure 2 illustrates how a language model handles two distinct attack scenarios. The top path shows a text-only attack regarding bomb-making, which the model successfully rejects. The bottom path shows a multimodal attack where the harmful context is split between an image (of household items) and text, attempting to bypass the safety filter.

In Figure 2, we see the difference clearly. In the top example (Unimodal), the model recognizes the text query about making a bomb and triggers a refusal. In the bottom example (Multimodal), the attacker provides an image of household items and asks how to use them to create something dangerous. The model, trying to be a helpful visual assistant, might fail to connect the visual context to its safety guidelines, inadvertently providing the prohibited information.

The Landscape of Jailbreaking

The research paper categorizes the ecosystem of jailbreaking into three pillars: Evaluation, Attack, and Defense. It is critical to understand that while LLM (text-only) research is relatively mature, MLLM (multimodal) research is still in its infancy.

Let’s look at the high-level taxonomy provided by the researchers:

Figure 1 outlines the evolution from LLMs to MLLMs across Evaluation, Attack, and Defense. The left column lists benchmarks like PromptBench and SafetyBench. The middle column details attack types like Behavior Restriction and Domain Transfer. The right column lists defenses like Pre-Safeguard and Safety Alignment.

As illustrated in Figure 1, the field is moving from simple text queries to complex multimodal interactions. Let’s break down the core methods used to attack these systems.

Core Method: The Mechanics of Attack

The authors distinguish between Non-parametric attacks (Black Box) and Parametric attacks (White Box).

1. Non-parametric Attacks (Black Box)

These are the most common attacks, where the attacker only interacts with the model via prompts (and images) without access to the model’s internal weights.

Constructing Competing Objectives

This strategy manipulates the prompt to prioritize “helpfulness” over “safety.”

Behavior Restriction: The attacker forces the model to start its response with “Absolutely!” or “Sure, here is…” By forcing a compliant start, the model is statistically less likely to pivot to a refusal.
Context Virtualization: This involves role-playing. Attackers persuade the model that it is in a fictional scenario (e.g., “You are an actor in a movie,” or “You are in Developer Mode”) where safety rules do not apply.
Attention Distraction: The attacker asks the model to perform a complex, benign task (like writing a poem) immediately before or during the harmful request. The cognitive load of the first task distracts the model from checking the safety of the second.

Inducing Mismatched Generalization

This strategy hides the harm in formats the safety filter misses.

Domain Transfer (Unimodal): Using Base64, ASCII art, or low-resource languages to bypass English-centric safety filters.
Domain Transfer (Multimodal): This is unique to MLLMs. Attackers use typography within images. They might create an image containing the text of a harmful query written in a distorted font or specific colors. The visual encoder reads the text, but the textual safety filter doesn’t “see” the words to flag them.
Obfuscation: In text, this means adding noise or typos. In MLLMs, this involves adversarial perturbations. Attackers add invisible visual noise (gradient-based optimization) to an image. To a human, it looks like a cat; to the model, the mathematical values of the pixels trigger a specific, harmful response.

2. Parametric Attacks (White Box)

These attacks assume the attacker has access to the model’s gradients or weights (common with open-source models).

Training Interference: Attackers can “poison” the data. By injecting just a few harmful examples into a fine-tuning dataset, they can undo the safety alignment of a model.
Backdoor Attacks: This involves training the model to recognize a “trigger” word (e.g., “SUDO”). When the model sees this word, it is trained to ignore all safety protocols.
Decoding Intervention: This involves manipulating the probability distribution of the output tokens during generation to steer the model away from refusal keywords (like “I cannot” or “I apologize”).

Evaluation & Results: How We Measure Safety

To quantify how safe these models are, researchers rely on benchmarks.

Unimodal (LLM) Benchmarks

The paper highlights several mature datasets:

PromptBench & AdvBench: These contain thousands of harmful prompts (hate speech, malware generation, fraud) used to stress-test models.
Do-Not-Answer: A fine-grained dataset evaluating safeguards across specific risks.
SafetyBench: A multiple-choice question dataset testing the model’s ability to identify unsafe scenarios.

Multimodal (MLLM) Benchmarks

The multimodal landscape is less developed but growing:

MM-SafetyBench: Uses text-image pairs to test 13 different unsafe scenarios.
ToVi-LaG: Focuses on toxic text-image pairs.
SafeBench: Uses GPT-4 to generate harmful questions based on prohibited usage policies.

The Findings

The key result from comparing these domains is that MLLMs are significantly more vulnerable than LLMs.

Current MLLMs often treat the visual input as “truth.” If an image contains harmful instructions (e.g., text embedded in the image), the model frequently complies because its visual processing module is not as heavily “safety-aligned” as its textual processing module.

However, the authors note a critical limitation in current MLLM benchmarks: Limited Image Sources. Most datasets rely on images generated by Stable Diffusion or simple Google searches for “bombs.” They lack the subtlety of real-world threats, such as implicit toxicity (images that aren’t explicitly violent but convey hateful stereotypes).

Defense Strategies

Defending against these attacks is categorized into Extrinsic and Intrinsic methods.

Extrinsic Defense (External Safeguards)

These are plug-ins or filters that sit outside the model.

Harmfulness Detection: Using a smaller, specialized model (like BERT) to scan prompts for toxicity before they reach the main LLM.
Perplexity Checks: Adversarial prompts often look like gibberish (e.g., “Zkl!# bomb”). Detectors can flag inputs with high “perplexity” (statistical confusion) as potential attacks.
Post-Remediation: Even if the model generates a harmful response, a secondary filter checks the output. If it detects harm, it replaces the answer with a refusal message.

Intrinsic Defense (Internal Improvements)

This involves changing the model itself.

Safety Alignment (RLHF): Reinforcement Learning from Human Feedback is the gold standard. Humans review model outputs and penalize harmful responses.
Self-Correction: Techniques that prompt the model to “critique” its own planned response before generating it.

The Multimodal Defense Gap

The paper highlights a worrying gap in MLLM defense. Current multimodal defenses mostly rely on converting images to text (captioning) and then running text-based safety checks. This fails against visual adversarial noise, which cannot be captioned. If an image is mathematically manipulated to trigger a jailbreak, a text caption of that image will just look benign, bypassing the defense entirely.

Conclusion & Future Directions

The transition from LLMs to MLLMs opens a Pandora’s box of security challenges. While we have developed sophisticated ways to “jailbreak” text models, the addition of visual inputs creates a massive new surface area for attacks—one that is currently underexplored and under-defended.

The authors conclude by suggesting three vital directions for future research:

Complex Multimodal Attacks: Moving beyond simple “bad text in an image.” Future research should explore how complex reasoning tasks (like jigsaw puzzles or spatial reasoning) can be used to distract models from their safety protocols.
Backdoor Poisoning in MLLMs: Investigating how attackers might inject visual triggers (like specific watermarks) into training data to create sleeper agents that only turn evil when shown a specific image.
Image-Native Defenses: We need defenses that work on the raw pixels, not just image captions. This includes smoothing techniques to neutralize visual noise and detection systems that understand visual toxicity directly.

As we integrate these powerful multimodal models into critical sectors like healthcare and finance, understanding these vulnerabilities is not just an academic exercise—it is a necessity for safe deployment.

Introduction#

Background: Why Models Break#

Competing Objectives#

Mismatched Generalization#

The Landscape of Jailbreaking#

Core Method: The Mechanics of Attack#

1. Non-parametric Attacks (Black Box)#

Constructing Competing Objectives#

Inducing Mismatched Generalization#

2. Parametric Attacks (White Box)#

Evaluation & Results: How We Measure Safety#

Unimodal (LLM) Benchmarks#

Multimodal (MLLM) Benchmarks#

The Findings#

Defense Strategies#

Extrinsic Defense (External Safeguards)#

Intrinsic Defense (Internal Improvements)#

The Multimodal Defense Gap#

Conclusion & Future Directions#