Blinded by the Light: Securing Multimodal AI Against Visual Jailbreaks with MLLM-Protector

Introduction: The New Vulnerability in Multimodal AI

The rapid evolution of Artificial Intelligence has taken us from text-based Large Language Models (LLMs) like GPT-3 to Multimodal Large Language Models (MLLMs) like LLaVA and GPT-4V. These newer models possess the remarkable ability to “see”—they can process images alongside text to answer complex queries. This leap forward opens up endless applications, from medical imaging analysis to assisting the visually impaired.

However, this added modality introduces a significant, often overlooked security flaw. While the AI community has spent years refining safety alignment for text—ensuring models refuse to generate hate speech or bomb-making instructions—the visual component acts as a backdoor.

Researchers have discovered that images can act as a “foreign language” to the model. An image containing harmful intent can bypass the safety filters established during text training. As illustrated below, a model that would normally refuse to answer “How do I produce a ballistic missile?” might happily provide instructions if the question is accompanied by an image of a missile.

State-of-the-art MLLMs like LLaVA become more prone to generating harmful response when using images as input. On the other hand, LLaVA with our MLLM-Protector is able to effectively detect such harmful content and make the response safe.

This blog post explores a recent paper titled “MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance.” We will dive into why traditional safety methods fail for multimodal models and examine a novel, plug-and-play solution that separates safety checks from the generation process.

The “Foreign Language” Hypothesis

To understand the solution, we must first understand the unique nature of the problem. Text-based LLMs process discrete tokens. Over time, techniques like Reinforcement Learning from Human Feedback (RLHF) have been used to align these models, teaching them which sequences of tokens represent harmful concepts that should be rejected.

MLLMs, however, introduce the image modality. The researchers suggest that for an MLLM, an image acts semantically similar to text but bypasses the specific textual triggers the model was trained to avoid. It is akin to a safety guard who speaks only English; they can stop a harmful request spoken in English, but might let a harmful request slip through if it is spoken in a foreign language they don’t understand, even if the intent is identical.

Why Standard Fine-Tuning Fails

The most obvious solution to this problem would be Supervised Fine-Tuning (SFT)—simply gathering a dataset of malicious images and training the model to refuse them. The researchers tested this “vanilla” approach, and the results were discouraging.

The fundamental issue lies in the nature of data representation. Text is discrete (a finite vocabulary of words), whereas images are continuous signals. The variation in pixel space is effectively infinite. Trying to “align” a model against every possible visual variation of a harmful concept is computationally Sisyphean.

Furthermore, MLLMs usually undergo significantly less training on image-text pairs than they do on pure text corpora. Aggressive safety fine-tuning on images tends to cause catastrophic forgetting, where the model becomes safe but loses its general utility and intelligence.

The researchers demonstrated this failure empirically. As shown in the table below, applying standard SFT resulted in marginal safety gains, and in some scenarios (like fraud), the Attack Success Rate (ASR) actually increased.

Table 3: The attack success rate (ASR) achieved by different inputs w/wo supervised fine-tuning (SFT). We follow (Liu et al., 2023b) to conduct experiment with their constructed benchmark and observe that SFT only results in marginal gains in safety. Furthermore, in many scenarios, the ASR even reaches higher after SFT.

This evidence suggests that we cannot simply “train away” the problem inside the MLLM itself without crippling the model. A different architecture is required.

The Core Method: MLLM-Protector

To solve the alignment problem without retraining the massive base model or degrading its performance, the researchers propose MLLM-Protector. This is a “divide-and-conquer” strategy. Instead of forcing the MLLM to be inherently safe against all visual inputs (which is difficult), the system allows the MLLM to generate a response, and then employs a lightweight external mechanism to police and correct that output.

The architecture consists of two distinct components:

The Harm Detector: A binary classifier that checks if a response is harmful.
The Response Detoxifier: A text-to-text model that rewrites harmful responses into safe ones.

The Inference Workflow

The workflow is straightforward and modular. When a user provides an input (image + text), the MLLM generates a raw response. This response is immediately passed to the Harm Detector. If the detector flags the content as safe, it is shown to the user. If flagged as harmful, the response is diverted to the Detoxifier, which modifies the text to refuse the request or remove the harmful elements before displaying it.

This process is visualized in the algorithm below:

Algorithm 1 Inference with MLLM-Protector

Component 1: The Harm Detector

The Harm Detector is a lightweight Large Language Model (specifically, a fine-tuned version of Open-LLaMA-3B) adapted for binary classification. Its sole job is to look at the text output of the MLLM and predict a probability of harmfulness.

Because “identification is easier than generation,” this model does not need to be as large or complex as the MLLM itself.

The Training Objective: To train this detector, the researchers used a standard Binary Cross Entropy (BCE) loss. The equation below minimizes the error between the predicted harmfulness (\(\phi(\mathbf{a}^i)\)) and the ground truth label (\(h^i\)).

Equation 1: Binary Cross Entropy Loss for Harm Detector

Here, \(h^i\) is 1 if the response is harmful and 0 if it is safe. The model \(\phi\) learns to push its output probability toward 1 for harmful content and 0 for safe content.

Component 2: The Response Detoxifier

If the Harm Detector flags a response, the system cannot simply return an empty string or a generic “I cannot answer.” To maintain a good user experience, the system should generate a context-aware refusal or a sanitized version of the answer.

The Response Detoxifier is another LLM (LLaMA-7B) fine-tuned to take a harmful response and the original query, and transform them into a harmless response.

The Training Objective: The detoxifier is trained using an auto-regressive language modeling loss. The goal is to maximize the likelihood of a “safe” answer (\(\mathbf{a}_{acc}\)) given the original query (\(\mathbf{q}\)) and the “rejected/harmful” answer (\(\mathbf{a}_{rej}\)).

Equation 2: Auto-regressive Language Modeling Loss for Detoxifier

This effectively teaches the model: * “Here is a dangerous question and a dangerous answer. Learn to generate the safe alternative.”*

Data Generation: Safe-Harm-10K

A major challenge in training these components is the lack of labeled data containing both safe and harmful responses to the same visual-based malicious queries. To overcome this, the authors curated a new dataset called Safe-Harm-10K.

They utilized ChatGPT to synthesize data across various malicious categories (Hate Speech, Malware, Pornography, Fraud, etc.). By prompting ChatGPT with in-context examples, they generated triplets consisting of:

A malicious question.
A harmful response (used as the “rejected” sample).
A safe response (used as the “accepted” sample).

This synthetic dataset allowed them to train the Harm Detector and Detoxifier effectively without needing thousands of hours of human annotation.

Experiments and Results

The researchers evaluated MLLM-Protector using MM-SafetyBench, a benchmark designed to test MLLM safety against text, images, and OCR (text inside images) attacks.

Quantitative Analysis

The results were transformative. The radar charts below compare the Attack Success Rate (ASR) of various MLLMs (InstructBLIP, LLaVA, MiniGPT4, QWEN-VL) with and without the MLLM-Protector.

The Red Area represents the vulnerability of the base model. The Green Area represents the vulnerability after adding MLLM-Protector.

Figure 2: MLLM-Protector is able to be applied with any MLLMs to boost their safety. The red areas represent the attack success rate (ASR) of the original MLLMs, while the green areas represent the ASR with our MLLM-Protector. We can observe that the ASR in all scenarios and for all the MLLMs have significantly reduced.

As clearly shown, the red areas are large, indicating high susceptibility to attacks across categories like Illegal Activity (IA), Privacy Violence (PV), and Hate Speech (HS). The green areas are almost non-existent, indicating that MLLM-Protector successfully blocked nearly all attacks.

Qualitative Analysis

It is helpful to look at actual examples of how the model behavior changes. In the figure below, we see queries requesting help with illegal activities (orchestrating harassment, tax evasion, weapon creation).

The standard models (Top/Middle rows) provide detailed, step-by-step instructions on how to commit these crimes. When equipped with MLLM-Protector (Bottom row), the models provide firm but polite refusals, citing ethical and legal reasons.

Figure 3: We demonstrate the responses of different MLLMs to harmful questions both without and with our MLLM-Protector. As shown, MLLM-Protector effectively removes harmful content and provides reasons for not answering, achieving both harmlessness and helpfulness.

Robustness on “FigStep”

The team also tested against FigStep, a challenging benchmark where harmful instructions are hidden typographically within images (e.g., words spelled out in block letters). This effectively turns text instructions into visual puzzles.

As seen in Table 5, the base LLaVA models had incredibly high failure rates, accepting the harmful instructions up to 92% of the time in the Malware Generation (MG) category. MLLM-Protector drastically reduced these rates.

Table 5: The attack success rate (ASR) for LLaVA-7B and LLaVA-13b on FigStep (Gong et al., 2023). The results validate the effectiveness of MLLM-Protector.

Did Performance Suffer?

The primary claim of the paper is that this method ensures safety without hurting performance. To verify this, the researchers evaluated the model on standard utility benchmarks like GQA (visual reasoning) and MM-Vet.

Because MLLM-Protector is an external wrapper, the weights of the original MLLM remain untouched. Consequently, the model’s ability to answer benign questions remains exactly the same. The only potential downside is if the Harm Detector triggers a false positive on a safe query, but the ablation studies showed the detector is highly accurate.

Figure 4: The harmlessness score predicted from the harm detector. The bars with red color and green color represent the harmful and harmless responses. The harm detector is able to well distinguish the harmful responses from the harmless ones.

Figure 4 confirms that the Harm Detector (especially larger versions like OpenLLaMA-3B) creates a clear separation between harmful and harmless content, minimizing the risk of accidentally censoring a helpful response.

Conclusion and Implications

The “MLLM-Protector” paper highlights a critical reality in modern AI development: as models become more complex and multimodal, the attack surface expands. Strategies that worked for text do not automatically transfer to vision.

The key takeaway here is the efficacy of the Divide-and-Conquer approach. By decoupling safety alignment from the core generation process, we can:

Avoid the “Alignment Tax” (performance degradation).
Use smaller, specialized models (Harm Detector/Detoxifier) to police larger, generalist models.
Create a plug-and-play safety module that can be applied to any MLLM, regardless of its architecture.

As we move toward autonomous agents that can see and interact with the world, ensuring they cannot be visually tricked into harmful behavior is paramount. MLLM-Protector offers a robust, scalable blueprint for achieving that security.

Introduction: The New Vulnerability in Multimodal AI#

The “Foreign Language” Hypothesis#

Why Standard Fine-Tuning Fails#

The Core Method: MLLM-Protector#

The Inference Workflow#

Component 1: The Harm Detector#

Component 2: The Response Detoxifier#

Data Generation: Safe-Harm-10K#

Experiments and Results#

Quantitative Analysis#

Qualitative Analysis#

Robustness on “FigStep”#

Did Performance Suffer?#

Conclusion and Implications#