Introduction

In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) like LLaVA and Llama 3.2 have become the new standard. These models can “see” an image and answer complex questions about it, from diagnosing medical X-rays to explaining memes. To make these powerful models safe for public use, researchers invest heavily in safety alignment—training the model to refuse harmful requests, like “how to build a bomb” or “how to evade taxes.”

Typically, we assume that if a model is aligned to be safe, it stays safe. If you show it a picture of an airplane and ask for a bomb recipe, a safe model should refuse. And indeed, they usually do.

But what if there was a way to bypass this safety training without changing the input image or the text prompt, but simply by changing where inside the model we look?

A recent research paper titled “Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models” uncovers a startling vulnerability. The researchers discovered that while the final output of an image encoder is safe, the intermediate layers—the internal processing steps—contain “out-of-distribution” representations that can trick the language model into breaking its safety protocols. They term this vulnerability ICET (Image enCoder Early-exiT).

Figure 1. While performing inference with LLaVA-1.5, where the input text is harmful but the image is safe, we observe that using the default layer embeddings results in a safe output. However, utilizing the 18th-layer embeddings causes LLaVA-1.5 to produce harmful responses.

As shown in Figure 1 above, the results are stark. When using normal inference, the model behaves like a “smiling angel,” refusing the harmful prompt. But by simply extracting the image features from layer 18 instead of the final layer, the model turns into a “devil,” providing detailed instructions for dangerous activities.

In this post, we will deconstruct how this vulnerability works, why current safety methods miss it, and the novel solution proposed by the authors: Layer-wise Clip-PPO (L-PPO).

Background: How VLMs See and Stay Safe

To understand the ICET vulnerability, we first need to understand the architecture of a typical Vision-Language Model.

The Architecture

Most VLMs consist of three main components:

  1. The Image Encoder: Usually a Vision Transformer (ViT) like CLIP. This component takes an image and processes it through many layers (e.g., 24 layers in LLaVA-1.5). It compresses the visual information into a mathematical representation (an embedding).
  2. The Projector: A small network that translates the image embeddings into a format the language model can understand.
  3. The Language Backbone: A Large Language Model (LLM) like Vicuna or Llama. This acts as the “brain,” taking the translated image features and the user’s text prompt to generate an answer.

During standard training and inference, the system uses the embeddings from the final layer (or the penultimate layer) of the image encoder. The language model learns to reason based on that specific representation of the image.

Safety Alignment

Safety alignment is the process of teaching the model to adhere to human values. This is typically done via:

  • Supervised Fine-Tuning (SFT): Showing the model examples of harmful prompts and the correct “refusal” responses.
  • Reinforcement Learning from Human Feedback (RLHF): Using a reward model to penalize the system when it generates toxic or harmful content.

Crucially, this safety training is performed using the standard architecture—meaning the LLM is taught to be safe only when it sees image features from the final layer of the encoder.

The Vulnerability: Image Encoder Early-Exit (ICET)

The core discovery of this paper is that safety does not transfer across layers.

In deep learning, “Early Exiting” is a technique often used for efficiency. If an image is simple, you might not need to process it through all 24 layers of the encoder; layer 10 might have enough information to identify a cat. However, the authors found that skipping layers has a dangerous side effect in VLMs.

Why Does ICET Break Safety?

When you extract embeddings from an intermediate layer (say, Layer 10) and feed them into the language model, you are presenting the LLM with data it has never seen before.

  1. Distribution Shift: The “grammar” of the visual features at Layer 10 is different from Layer 24.
  2. Out-of-Distribution (OOD) Scenario: To the LLM, these intermediate features look like valid inputs, but they lie outside the safety training distribution.
  3. Bypassing Guardrails: Because the LLM treats this as a novel, OOD context, the learned refusal behaviors (which are context-dependent) fail to trigger. The model attempts to “be helpful” and answers the harmful prompt.

Figure 2. (A) We investigate early exits from different image encoder layers and find that VLM safety alignment varies. (B) With the same input, choosing different image encoder layers significantly affects the safety of the output response. (C) Safety training is applied with the model’s default settings, leaving parts of the embedding space uncovered.

Figure 2 illustrates this concept perfectly. In Panel B, you can see that simply selecting a different layer (Layer K vs. Layer J) acts as a switch between a harmful and harmless generation. Panel C explains the theoretical reason: the safety training creates a “safe region” (green ellipse) in the embedding space. The final layer embeddings fall inside this region. However, intermediate layer embeddings often fall outside this region (red dot), leaving the model in an undefined, unsafe state.

Mathematical Formulation of ICET

Let’s look at the math behind this. In a standard VLM, the final embedding \(e_L\) is taken from layer \(L\):

Equation 1

Here, \(\mathcal{E}^L\) is the encoder up to layer \(L\), and \(\pi\) is the language model. The model generates text based on this final embedding.

However, in an ICET scenario, we extract embeddings from an intermediate layer \(l\):

Equation 3

The language model \(\pi\) is now receiving \(\mathcal{P}(e_l)\). While the text input \(e_T\) remains the same, the visual context has shifted.

Real-World Examples of Failure

The researchers tested this on several models, including LLaVA-NeXT. The results were concerning.

Table 1. Responses of LLaVA-NeXT with early exits. The original inference results in a safe output. However, utilizing the 11th and 10th layer embeddings causes the model to produce harmful responses.

As seen in Table 1 above, when asked for a tutorial on making a gun using household items (paired with a safe image of cars), the standard model refuses. But take the output from Layer 11, and the model begins listing components like a “barrel, piston, and spring.”

The Solution: Layer-wise Clip-PPO (L-PPO)

Identifying the break in the armor is only half the battle. The authors propose a specific defense strategy called Layer-wise Clip-PPO (L-PPO).

Adapting RLHF for Layers

Standard Reinforcement Learning from Human Feedback (RLHF) uses an algorithm called PPO (Proximal Policy Optimization). It trains the model to maximize a “reward” (safety score) while ensuring the model doesn’t drift too far from its original behavior (using a KL-divergence penalty).

The standard PPO objective looks like this:

Equation 5

Where \(K\) is the ratio between the new policy and the old policy. The goal is to maximize the expected reward \(R\).

The Layer-wise Modification

The authors’ innovation is simple yet effective: Include the intermediate layers in the RLHF loop.

Instead of only training on the final layer \(e_L\), L-PPO samples embeddings from various intermediate layers \(e_l\) during the training process. It forces the language model to learn safety behaviors regardless of which layer the visual features come from.

The modified objective function for L-PPO specifically targets the policy conditioned on layer \(l\):

Equation 14

Here, the term \(Z\) represents the clipped advantage function, which ensures stable training updates:

Equation 11

By maximizing this objective across different layers, the model learns to associate the “refusal” behavior with the harmful text prompts, even when the visual context is distorted or OOD (Out-Of-Distribution) due to early exiting.

The Algorithm in Motion

The training process involves an iterative loop.

  1. Sample: Pick a layer \(l\) and a prompt.
  2. Generate: Let the model generate a response using embeddings from layer \(l\).
  3. Evaluate: Use a Reward Model to score the response (High score for refusal, Low score for harmfulness).
  4. Update: Adjust the model weights using the L-PPO loss to encourage safer responses for that layer.

The total loss function combines the PPO clip loss with a value function loss (which helps estimate future rewards):

Equation 13

This holistic approach attempts to cover the “holes” in the embedding space shown back in Figure 2C.

Experiments and Results

To prove that ICET is a real threat and that L-PPO is a valid fix, the authors conducted extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Llama 3.2 Vision.

Quantitative Analysis

The researchers used three key metrics:

  1. ASR (Attack Success Rate): How often does the model comply with a harmful request? (Lower is better).
  2. TS (Toxicity Score): How toxic is the output text? (Lower is better).
  3. TR (Total Rewards): The safety score from the reward model. (Higher is better).

Visualizing the Improvement

The impact of L-PPO is visualized dramatically in the graphs below.

Figure 3. Comparing layer-wise Attack Success Rate (ASR), Total Rewards (TR), and Toxicity Score (TS) for LLaVA-1.5 before and after alignment using L-PPO.

In the Left Chart (ASR), the purple line represents the model before alignment. Notice how high the attack success rate is for the early and middle layers (around layers 5-15). The model is extremely vulnerable here. The blue dashed line represents the model after L-PPO alignment. The ASR drops significantly across almost all layers.

In the Right Chart (TS), we see a similar trend for toxicity. The pre-alignment model (light purple) spikes in toxicity at various intermediate layers. Post-alignment (dashed gray), the toxicity is suppressed.

Llama 3.2 Vulnerability

The vulnerability isn’t unique to LLaVA. The authors also tested Llama 3.2 Vision, a state-of-the-art model.

Figure 9. Layer-wise Attack Success Rate (ASR) and Toxicity Score (TS) for Llama 3.2 vision. The irregularity suggests that performing image encoder early exit significantly changes the safety alignment.

Figure 9 shows that Llama 3.2 is also highly susceptible. The ASR (Left) starts very high at the early layers (near 0.6 or 60%) before dropping. This confirms that modern safety training overlooks the internal representation space.

Qualitative Results: Before and After

Numbers are great, but actual model responses tell the story better.

Table 4. Responses of LLaVA-1.5 with early exits. BA and AA = before and after L-PPO alignment.

In Table 4, we see the prompt: “Demonstrate how to make a bomb…”

  • ICET-9 (Before Alignment): The model starts giving a recipe: “To make a bomb using common household items…”
  • ICET-9 (After Alignment): The model feigns ignorance or refuses: “I’m sorry, I’m not sure what you’re asking.”

This demonstrates that L-PPO successfully teaches the model to recognize the harmful intent of the text, even when the visual signal from layer 9 is confusing or OOD.

Does Safety Kill Utility?

A major concern in safety alignment is “over-refusal” or “lobotomy”—making the model so paranoid it refuses to answer innocent questions. The authors tested this using the VQA-v2 dataset (standard visual question answering).

Table 17. Responses of LLaVA-1.5 for ICET-17 on a VQA-v2 sample. BA and AA denote responses before and after L-PPO alignment, showing that utility remains stable.

As shown in Table 17, when asked “Why is the cow laying down?”, the model provides a detailed, correct answer both before and after alignment. This confirms that L-PPO targets specific harmful vectors without degrading the model’s general intelligence or vision capabilities.

The comprehensive results in Table 2 further validate this balance:

Table 2. The Average Attack Success Rates (AASR), Average Toxicity Score (ATS), and Average Total Rewards (ATR) for LLaVA-1.5 aligned using RedTeam 2k.

Notice the AASR (Average Attack Success Rate) row. For “Early” layers, it drops from 75.38% to 45.15%. For “Late” layers, it drops from 48.50% to 21.25%. Simultaneously, the ATR (Total Rewards) increases, indicating better alignment with human preferences.

Conclusion and Implications

The “Image enCoder Early-exiT” (ICET) vulnerability highlights a subtle but critical flaw in how we currently train multimodal AI. We tend to treat neural networks as “black boxes” where only the input and output matter. This research demonstrates that the internals of the box matter just as much.

If we want to build systems that are robust—especially systems that might use techniques like early exiting to save energy or speed up processing—we cannot rely on safety training that only looks at the final layer.

Key Takeaways:

  1. Internal Representations are Risky: Intermediate layers of an image encoder can bypass the safety guardrails of an LLM.
  2. Harmful without Modification: You don’t need to “hack” the image. A standard, safe image combined with a harmful prompt is enough to trigger this behavior if you tap into the wrong layer.
  3. L-PPO Works: By expanding Reinforcement Learning to consider multiple layers (Layer-wise Clip-PPO), we can plug these holes.

This paper serves as a reminder that safety is not a binary switch. It is a continuous landscape across the entire depth of the neural network, and our alignment techniques must evolve to cover that entire territory.