Trust but Verify: How to Measure Reliability in Vision-Language Models with DeCC

Imagine you are using a state-of-the-art AI to analyze medical X-rays or navigate an autonomous vehicle. You ask the model a question about an image, and it gives you an immediate, confident answer. But here is the critical problem: How do you know if the model is actually right, or if it’s just confidently hallucinating?

Vision-Language Models (VLMs) have made tremendous strides in understanding the world, but they are far from perfect. They suffer from overconfidence—they often sound just as certain when they are wrong as when they are right. For students and researchers entering the field of multimodal AI, solving this reliability puzzle is one of the most significant hurdles to deploying these models in the real world.

In this post, we will deep-dive into a fascinating research paper titled “Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison.” The researchers propose a novel framework called DeCC (Decompose and Compare Consistency) that acts like a lie detector for VLMs, using a mix of “cross-examination” techniques and independent reasoning to flag unreliable answers.

The Problem: Hallucinations and Hollow Confidence

Before we look at the solution, we need to understand why existing methods for measuring reliability often fail.

Traditionally, if you wanted to know if a model was confident, you might look at:

Answer Likelihoods (Uncertainty): Checking the mathematical probability (logits) the model assigns to its tokens.
Prompted Confidence: Literally asking the model, “Are you sure? Give me a confidence score from 0 to 100.”
Self-Consistency: Asking the model the same question multiple times (perhaps paraphrased) and seeing if the answers agree.

The issue? VLMs are poorly calibrated. They are trained to predict the next word, not to be self-aware. They tend to be overconfident, assigning high probabilities to wrong answers. Furthermore, standard self-consistency checks often suffer from confirmation bias. Once a VLM “decides” on an interpretation of an image, it tends to double down on that interpretation, even if questioned repeatedly.

The Solution: Decompose and Compare (DeCC)

The core insight of the DeCC framework is that a reliable answer should be supported by the details. If a VLM says, “That is a baseball pitcher,” it should also know that the person is holding a ball, standing on a mound, and facing a batter.

If the model gets the “big picture” answer right but fails on the fundamental details, the “big picture” answer was likely a lucky guess or a hallucination.

DeCC automates this logic using a two-step process: Task Decomposition and Consistency Comparison.

Step 1: Task Decomposition

Instead of just accepting the VLM’s direct answer, DeCC breaks the original question down into simpler sub-questions. This is done by a “Decomposer” (another VLM).

For example, if the question is “Which person is everyone here staring at now?”, the Decomposer might ask:

“Is the batter looking at someone?”
“Is the catcher looking at the pitcher?”
“Is the umpire looking at the pitcher?”

The candidate VLM then answers these sub-questions. This creates a list of “Sub-Question/Answer Pairs” (Sub-QA pairs).

Step 2: Independent Reasoning (The “Agents”)

This is where the paper introduces a clever twist. To check if the Direct Answer makes sense, the system tries to reconstruct the answer only using the Sub-QA pairs. It uses two different “agents” to do this:

The VLM Agent: The original model looks at the sub-answers and tries to reason out the final answer again.
The LLM Agent: A separate Large Language Model (which cannot see the image) looks only at the text of the questions and sub-answers to derive a conclusion.

Why use a blind LLM? Because it forces objectivity. The VLM might be biased by visual features in the image that confuse it. The LLM, however, acts as a pure logic check. If the sub-answers (text) logically lead to Answer A, but the VLM’s direct visual impression was Answer B, there is a contradiction.

The DeCC framework workflow showing the decomposition and comparison process.

As shown in Figure 1 above, the process flows from the original question into a branching path. The top path is the Direct Answer. The bottom path is the Decomposition and Reasoning phase. Finally, the system compares the results.

The Core Method: Determining Reliability

How does DeCC mathematically decide if an answer is reliable? It compares the Direct Answer (\(A\)) with the Reasoned Answer (\(A'\)) generated by the agents.

If the reasoned answer matches the direct answer, the system assigns a Reliability score (\(\mathcal{R}\)) of 1 (Reliable). If they disagree, the score is 0 (Unreliable).

Equation determining reliability based on consistency between the direct answer and reasoned answer.

The Multi-Agent Strategy

The researchers found that relying on just the VLM or just the LLM has limitations. The VLM can be biased by the image (confirmation bias), while the LLM might lack context. To solve this, they propose a Multi-Agent setting.

In this advanced setting, the system checks consistency with both agents.

If both agents agree that the answer is consistent (or inconsistent), the verdict is clear.
But what if they disagree? (e.g., The VLM says “Consistent,” but the LLM says “Inconsistent”?)

When agents disagree, DeCC triggers a Second Iteration. It feeds the previous sub-questions back into the system to generate new, additional sub-questions to clarify the confusion.

Illustration of Multi-Agent Consistency Comparison and the two-iteration process.

As illustrated in Figure 2, if the first round yields a conflict (“Contradiction”), the system loops into a second reasoning process. The logic for resolving this conflict is sophisticated:

Equation for calculating reliability in the multi-agent setting, accounting for second-iteration results.

Let’s break down this complex equation (Equation 3):

Case 1 (Agreement in Round 2): If, after the second round of questions, the VLM and LLM finally agree on the consistency, we trust that consensus.
Case 2 (LLM Trust): If both agents are “stubborn” (their second round results match their first round results), we trust the LLM. Why? Because the LLM relies solely on the textual logic of the decomposition. It is less likely to be swayed by the “visual biases” that might be trapping the VLM.
Case 3 (VLM Change): If the results change in the second round (indicating new information shifted the reasoning), we trust the VLM. A change in the VLM’s response suggests it successfully overcame its initial bias thanks to the new sub-questions.

Case Studies: DeCC in Action

To truly understand how this works, let’s look at three examples provided by the researchers.

The Good: A Consistent Result

In this example from the A-OKVQA dataset, the model is asked about a classroom scene. The Direct Answer and the Reasoned answers from both agents align perfectly.

Example of a consistent situation where direct and reasoned answers match.

Because the VLM’s internal logic and the LLM’s external logic both align with the initial answer, DeCC marks this as Reliable.

The “Mixed” Signal: Catching a Hallucination

This is where DeCC shines. The question asks about birds in the water.

Direct Answer: “Goose.”
VLM Reasoned Answer: “Goose” (The VLM is reinforcing its own error).
LLM Reasoned Answer: “Duck” (Based on the text descriptions provided in sub-answers).

Example of an inconsistent situation where the LLM flags an error the VLM missed.

Here, the VLM suffers from confirmation bias—it repeats “Goose” even though the sub-details might suggest otherwise. The “blind” LLM sees the sub-answers (likely describing duck-like features) and concludes “Duck.” This disagreement flags the answer as Unreliable.

The Bad: Total Confusion

Sometimes, the model is just completely lost. In this VCR (Visual Commonsense Reasoning) example, the VLM provides a direct answer, but when decomposed, the sub-answers are chaotic.

Example where all answers are inconsistent and incorrect.

The VLM’s reasoned answer and the LLM’s reasoned answer both disagree with the Direct Answer. This is a strong signal that the model does not understand the scene at all. DeCC correctly identifies this as Unreliable.

Experiments and Results

The researchers tested DeCC across six diverse vision-language benchmarks, including SNLI-VE (Visual Entailment), VCR (Commonsense Reasoning), and MathVista (Mathematical reasoning). They evaluated three popular models: LLaVA, Idefics2, and InternVL.

How do we measure success?

They used two primary metrics:

Brier Score (BS): A proper scoring rule that measures the accuracy of probabilistic predictions. Lower is better.
Effective Reliability (ER): A metric designed specifically for scenarios where a model can “abstain” (refuse to answer) if it’s unsure. It penalizes incorrect high-confidence answers heavily.

Equation for Effective Reliability.

The Effective Reliability equation (shown above) rewards the system (+1) when it is confident and right, penalizes it (-1) when it is confident and wrong, and gives a neutral score (0) when the system correctly identifies an unreliable answer and chooses to abstain.

The Results

The performance of DeCC compared to existing methods (like Perplexity and standard Self-Consistency) is summarized in Table 1.

Table showing DeCC outperforming baselines in Brier Score and Effective Reliability.

Key Takeaways from the Data:

DeCC Wins: DeCC achieves the best (bold) or second-best (underlined) results across almost all datasets and models.
Significant Gains: For the LLaVA model, DeCC improved the Effective Reliability by 16.5% over the best baseline. For Idefics2, the gain was 25.6%.
Model Capability Matters: The researchers noticed an interesting trend. For “weaker” VLMs (like LLaVA), the LLM Agent Consistency method worked best. This is because weaker VLMs are bad reasoners, so offloading the logic check to an external LLM helps. For “stronger” VLMs (like InternVL), the multi-agent or self-consistency methods worked well because the VLM was capable enough to reason about its own sub-answers.

Computational Cost

One critique of decomposition methods is that they are slow (you have to run the model many times). The researchers analyzed this in Table 6.

Computational cost analysis showing time per sample.

While DeCC is indeed slower than a simple perplexity check (taking about 5-7 seconds per sample versus <1 second), it is comparable to other rigorous consistency checks. Crucially, DeCC requires no training, making it a “plug-and-play” solution for evaluating new models without the need for expensive labeled datasets.

Conclusion and Implications

The “Decompose and Compare Consistency” framework represents a significant step forward in making AI systems trustworthy. By forcing models to “show their work” via sub-questions and using independent agents to verify that work, we can filter out hallucinations more effectively than ever before.

Why does this matter? As we move toward Agentic AI—systems that take actions in the real world—reliability is non-negotiable. If a robot assistant is “pretty sure” it sees a glass of water, but DeCC reveals that the sub-details (shape, transparency) don’t match, the robot can pause and ask for clarification rather than knocking it over.

DeCC demonstrates that the path to reliable AI isn’t just about bigger models; it’s about better metacognition—giving models the architectural ability to reflect, decompose, and verify their own thoughts.

This blog post explains the concepts presented in “Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison” by Yang et al.

The Problem: Hallucinations and Hollow Confidence#

The Solution: Decompose and Compare (DeCC)#

Step 1: Task Decomposition#

Step 2: Independent Reasoning (The “Agents”)#

The Core Method: Determining Reliability#

The Multi-Agent Strategy#

Case Studies: DeCC in Action#

The Good: A Consistent Result#

The “Mixed” Signal: Catching a Hallucination#

The Bad: Total Confusion#

Experiments and Results#

How do we measure success?#

The Results#

Computational Cost#

Conclusion and Implications#