Introduction

In the era of AI-Generated Content (AIGC), the volume of visual media being created and shared online is exploding. From social media feeds to generative art platforms, the flow of images is endless. But with this creativity comes a significant risk: the proliferation of harmful content, ranging from graphic violence to explicit material.

For years, the standard solution has been human moderation. Platforms hire thousands of people to look at disturbing images and label them as “safe” or “unsafe” based on a rulebook (a safety constitution). This approach has two massive problems: it is expensive and slow to scale, and it takes a heavy psychological toll on the human annotators.

So, why not just use AI? Specifically, why not use Multimodal Large Language Models (MLLMs) like GPT-4V or LLaVA? After all, they can “see” images and “read” rules.

It turns out, it’s not that simple. If you take a standard MLLM and ask it, “Does this image violate safety policy X?”, the results are often unreliable. As established in a recent paper titled “MLLM-as-a-Judge for Image Safety without Human Labeling”, simply querying pre-trained models fails due to three main challenges:

  1. Subjectivity: Safety rules are often vague (e.g., “inappropriate content”).
  2. Complexity: Models struggle to reason through lengthy, multi-clause legalistic rules.
  3. Bias: Models have inherent biases—sometimes seeing things that aren’t there because of background context or language priors.

Examples showing the challenges for simply querying pre-trained MLLMs.

Figure 1: The three main challenges of naive MLLM safety judgments. (a) Subjective rules confuse models. (b) Lengthy rules cause reasoning failures. (c) Biases lead to false positives (e.g., seeing blood on the ground and assuming a throat is slit).

In this post, we will deconstruct a new method called CLUE (Constitutional MLLM JUdgmEnt). This framework allows MLLMs to act as accurate safety judges in a zero-shot setting—meaning it requires no human-labeled training data.

Let’s dive into how researchers turned a confused chatbot into a precise safety inspector.


Background: The Problem with Naive Prompting

To understand CLUE, we first need to understand why the “easy way” doesn’t work.

In a perfect world, you would feed an image and a safety rule into an MLLM, and it would output “Safe” or “Unsafe.” However, MLLMs are trained on vast amounts of internet data, which gives them strong “priors.”

For example, if you ask a model, “Is this person naked?” and show an image of a person with a bare upper body (but clothed lower body), the model might hallucinate nudity because its training data often correlates bare skin with full nudity. Furthermore, if a safety rule is complex—“Do not show injuries that indicate imminent death”—the model often latches onto the word “injury” and ignores the “imminent death” condition.

Existing solutions usually involve fine-tuning models on thousands of human-labeled images (like the Q16 dataset). But if your safety policy changes (e.g., a new policy on “AI-generated deepfakes”), you have to re-label data and re-train the model. CLUE solves this by adhering to a Safety Constitution without needing retraining.


The CLUE Method

The researchers propose a pipeline designed to mimic how a careful human inspector would work: understand the specific rule, check if it applies, break it down into a checklist, and verify each item objectively.

The CLUE framework consists of four distinct stages.

1. Rule Objectification

The first step is to fix the human inputs. Safety guidelines are notoriously subjective. A rule like “Images should not include sexual content” is a nightmare for an AI because “sexual content” is a broad, fuzzy concept.

The authors use an “LLM-as-an-Optimizer” approach. They feed the original safety constitution into an LLM and ask it to rewrite the rules to be objective.

  • Original: “Should not depict sexual images.”
  • Objectified: “Genitalia, anus, or pubic area of human should not be visible via this image.”

This shift is crucial. It turns a qualitative judgment (“is this sexual?”) into a visual object detection task (“is this body part visible?”).

Comparison of results between objectified and original rules.

Table 5: The impact of objectification is massive. Accuracy jumps from 74% to 98% simply by making the rule specific.

2. Relevance Scanning (The Efficiency Filter)

A safety constitution might have dozens or hundreds of rules. Running a massive MLLM (like GPT-4 or InternVL-76B) to check every single rule against every image is computationally wasteful.

To solve this, CLUE uses a Relevance Scanning module. They use a lighter, faster model—CLIP—to calculate the cosine similarity between the image and the text of the rule.

If the similarity score is below a certain threshold, the system assumes the rule is irrelevant to the image and skips it.

Performance of the Relevance Scanning module.

Figure 7: The efficiency gain is significant. As shown in the graph, the system can filter out 67% of the rules (green line) while only losing about 3.4% of actual violations (blue line).

3. Precondition Extraction & Decomposition

This is perhaps the most clever part of the framework. Even with objective rules, MLLMs struggle with complex logic.

Take this rule: “Should not have any depiction of people or animals whose bodies have suffered visible, bloody injuries that seem to cause their imminent death.”

A standard MLLM sees “bloody injury” and screams “Unsafe!” even if the injury is minor. To fix this, CLUE breaks the rule into a Precondition Chain.

The authors use an LLM to decompose the rule into a logical flow chart:

  1. Is a person or animal visible?
  2. Is there a visible, bloody injury?
  3. Does that injury indicate imminent death?

The rule is only violated if all conditions in the chain are “Yes.”

Flowchart of preconditions extracted from a rule.

Figure 2: Instead of one big question, the model answers a series of simpler checks. If any check fails (Green arrows), the image is Safe.

To visualize the extraction process, the authors provide the prompts used to generate these chains automatically using an LLM:

Detailed process for precondition extraction.

Figure 10: The system automatically converts policy text into logical JSON structures.

Why Preconditions Matter

The difference in performance between asking the full rule vs. checking preconditions is stark. In the examples below, you can see how advanced models like GPT-4o get confused by the full rule but answer correctly when asked about the specific condition (imminent death).

Comparisons of GPT-4o performance.

Figure 13: In (a), GPT-4o incorrectly flags an image as violating the “imminent death” rule. In (b), when asked specifically about the precondition “Does it cause imminent death?”, it correctly answers “No”.

4. Debiased Token Probability Judgment

Now the system needs to answer those precondition questions (e.g., “Is there a bloody injury?”). Instead of asking the MLLM to generate a text response (which can be verbose or hallucinated), CLUE looks at the probability of the “Yes” token.

However, simply checking if Prob(Yes) > Prob(No) is dangerous due to bias.

Bias Type A: Language Priors

Sometimes, an MLLM will say “No” just because the sentence structure makes “No” the most likely next word in its training data, regardless of the image.

  • Fix: CLUE calculates the score with the image and without the image. If the “Yes” score doesn’t increase significantly when the image is added, the model is likely just hallucinating based on text.

Bias Type B: Non-Centric Content

If a rule asks about a specific object (e.g., “Is the throat slit?”), the model might get distracted by red pixels elsewhere in the image (like blood on the ground) and answer “Yes.”

  • Fix: CLUE uses an open-vocabulary object detector (OWLv2) to find the object mentioned in the rule (e.g., “throat”). It then crops the image to that specific region or removes the center object to see how the score changes.

Diagram of the debiasing approach.

Figure 3 & 4: (Left) The basic token scoring formula. (Right) The debiasing strategy. By comparing the score of the full image vs. the image with the central object removed, the system can determine if the model is actually looking at the right thing.

Does this cropping strategy actually work? The data suggests yes.

Distribution of score differences.

Figure 8: When the score difference between the whole image and the cropped version is high, it strongly correlates with the precondition being satisfied.

5. Cascaded Reasoning (The “Thinking” Step)

Finally, what happens if the token probabilities are borderline (not a strong “Yes” or “No”)?

CLUE switches to a slower but more thoughtful mode: Chain-of-Thought (CoT) Reasoning. It asks the model to “think step by step” to analyze the image. This is computationally expensive, so it is only used as a fallback when the fast token-check is uncertain.

Process of cascaded reasoning-based judgment.

Figure 5: If the fast check (Step 1) is inconclusive, the system triggers the reasoning module (Step 2).


Experiments and Results

To test this, the researchers had to build a new dataset called OS Bench (Objective Safety Bench), containing around 1,400 images. Crucially, they included “borderline safe” images—images that look dangerous but don’t technically violate the rules (e.g., ketchup looking like blood, or a person bending over but fully clothed).

Beating the Baselines

The results are impressive. CLUE significantly outperforms standard zero-shot prompting techniques.

Table showing comparison to zero-shot baselines.

Table 2: CLUE (bottom rows) achieves significantly higher F-1 scores across different models (Qwen, InternVL, LLaVA) compared to standard prompting or simple Chain-of-Thought.

Outperforming Fine-Tuned Models

Perhaps the most surprising result is how CLUE compares to models that were fine-tuned specifically for safety (like LLaVA Guard or Q16). Because those models were trained on specific, often subjective datasets, they struggle to generalize to the strict, objective rules of the OS Bench.

Table comparison to fine-tuning based methods.

Table 3: CLUE achieves an F-1 score of 0.871 with LLaVA-v1.6-34B, while the fine-tuned LLaVA Guard only reaches 0.401 on this benchmark. This highlights the flexibility of the zero-shot constitutional approach.

Granular Performance

The method is also highly effective across different types of safety violations, from “Imminent Death” to specific nudity constraints.

Detailed binary classification performance. Table 4: High precision and recall across various specific safety rules using InternVL2-76B.


Conclusion and Implications

The “MLLM-as-a-Judge” paper presents a compelling argument: we don’t necessarily need more labeled data to solve AI safety; we need better inference architectures.

By breaking down the problem—objectifying rules, decomposing logic, and debiasing model outputs—CLUE transforms general-purpose MLLMs into specialized safety inspectors.

Key Takeaways for Students:

  1. Prompting isn’t enough: For high-stakes tasks, you can’t just ask an LLM a question. You need to structure its reasoning process.
  2. Logic Chains: Decomposing complex problems into simple “Yes/No” preconditions is a powerful technique for controlling LLM behavior.
  3. Bias Correction: Always assume your model has priors. Comparing outputs against a “blank” input or cropped inputs provides a mathematical way to measure confidence.
  4. Zero-Shot Power: With the right architecture, zero-shot methods can outperform fine-tuned methods, especially when the task rules change frequently.

This research paves the way for automated content moderation that is scalable, adaptable, and keeps humans out of the loop for the most disturbing content. As MLLMs continue to get smarter, frameworks like CLUE will likely become the standard for keeping the internet—and AI generation—safe.