Busting Visual Hallucinations: How Pelican Uses Python to Fact-Check AI Vision Models

Imagine asking an AI to describe a photo of your living room. The model confidently replies, “There is a red vintage motorcycle parked next to the coffee table.” You look at the photo again. There is no motorcycle. There is just a red potted plant.

This phenomenon is known as hallucination. It is one of the most persistent and dangerous problems facing Large Visual Language Models (LVLMs) today. While these models have become incredibly good at chatting about images, they have a bad habit of making things up—fabricating objects, misidentifying colors, or describing relationships that simply don’t exist.

For students and researchers diving into multimodal AI, solving hallucination is the “holy grail” of trustworthiness. If we can’t trust an AI to tell us what is actually in a picture, we certainly can’t trust it to drive a car or analyze a medical X-ray.

In this post, we are going to do a deep dive into a fascinating research paper titled “Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification.” The researchers propose a structured, logic-driven framework that acts as a rigorous fact-checker for vision models. By the end of this article, you will understand how Pelican breaks down complex sentences, writes its own Python code to verify facts, and significantly reduces the rate at which AI lies about images.

Let’s unpack how we can turn a hallucinating AI into a reliable observer.

The Problem: Why Do Vision Models Hallucinate?

Before we fix the problem, we need to understand it. LVLMs (like LLaVA, mPlug-OWL, or InstructBlip) work by projecting visual features (from an image) into the same “embedding space” as text. This allows a Large Language Model (LLM) to “see” the image tokens just like it sees words.

However, this integration isn’t perfect. Hallucinations often stem from a few key issues:

Data Bias: The model might have seen thousands of pictures of “living rooms” in its training data that did have televisions. So, when it sees a living room now, it might guess there is a TV even if there isn’t one.
Yes-Bias: Models are trained to be helpful. If you ask, “Is the red car next to the tree?”, the model is statistically biased to say “Yes” to please the user, disregarding the visual evidence.
Weak Grounding: The model might know what a dog looks like, but it might struggle to pinpoint exactly where the dog is or differentiate between “the dog on the left” and “the dog on the right.”

Existing solutions often involve training on massive datasets or using Reinforcement Learning (RLHF). While these help, they are expensive and don’t solve the fundamental disconnect between the claim (“There is a cat”) and the evidence (the pixels).

Enter Pelican: A Framework for Verification

The researchers introduce Pelican, a framework designed to function as a post-hoc “editor.” Instead of retraining the whole model, Pelican takes the output of an LVLM and runs it through a rigorous verification pipeline.

Think of the standard LVLM as a creative writer who sometimes gets carried away. Pelican is the strict fact-checker sitting at the next desk, demanding proof for every adjective and noun.

The 4-Step Architecture

Pelican operates in four distinct stages. To give you a high-level roadmap, take a look at the architecture overview below.

Figure 1: Overview of Pelican. Given an image (I) and a question (q), the pipeline transforms the answer into a claim, creates a visual table, decomposes the claim, uses Python code to verify it, and synthesizes a final verdict.

As shown in Figure 1, the process starts when the LVLM produces an answer (a) to a question (q). Pelican combines these into a single “Claim” (C). For example, if the question is “Who is riding the bike?” and the answer is “A woman is riding the bike,” the claim becomes: “The person riding the motorcycle in the image is a woman.”

Let’s walk through the four steps used to verify this claim.

Step 1: The Visual Table

The first innovation of Pelican is that it doesn’t trust the LVLM to just “look” at the image again. Instead, it relies on specialized, “narrow” AI tools—specifically object detectors like YOLO and Grounding-DINO.

In the Visual Table step, Pelican scans the image to identify tangible objects mentioned in the claim. It constructs a structured data table (specifically, a Pandas dataframe in Python) that lists:

The object detected (e.g., “Motorcycle”).
The bounding box coordinates (where it is).
The confidence score.

Why is this important? Language models are probabilistic; they predict the next word. A Pandas dataframe is deterministic; it contains hard data. By converting the image content into a structured table, Pelican creates a reliable reference point that isn’t subject to the whims of language generation.

Step 2: Claim Decomposition

A complex sentence like “The red car is parked behind the tall tree” is hard to verify all at once. It contains multiple assertions:

There is a car.
The car is red.
There is a tree.
The tree is tall.
The car is behind the tree.

Pelican uses an LLM to break the main claim down into Sub-Claims. These are atomic units of logic based on “First-Order Predicates.”

The researchers define specific predicates such as:

Exists(object)
Color(object)
Position(object1, object2)
Count(object)

This decomposition transforms a sentence into a chain of logic. Crucially, this chain can be visualized as a Computational Graph.

Figure 2: Computational graph representation of the generated sub-claims with predicates as the node and edges defined by their dependencies.

Figure 2 illustrates this concept perfectly. Notice how the logic flows. We start with Exists(dog, Yes).

If the dog exists, we might branch out to check its position: Position(dog, left).
Only after confirming the specific dog on the left do we check its color.

This graph structure ensures that the verification is efficient. If the first node (Exists) returns False, we don’t need to waste time checking the color. The claim is already debunked.

Step 3: Program of Thought (PoT) Verification

This is arguably the most innovative part of the Pelican framework.

In traditional approaches, if you wanted to verify a sub-claim like “Is the car red?”, you might just ask another AI model, “Is the car red?” But this leads to the same hallucination risks we started with.

Pelican uses Program of Thought (PoT). Instead of asking for a text answer, Pelican prompts an LLM to write Python code to answer the question.

Why Python?

Tools: Python can natively call external tools. Pelican provides functions like iou() (Intersection over Union to check if two objects overlap) or grounded_vqa() (to ask specific questions about a specific image crop).
Logic: Python handles logic (if, else, for loops) perfectly. Text generators struggle with complex logic.
Data Manipulation: Remember the Visual Table from Step 1? Python interacts with that Pandas dataframe seamlessly.

The Power of Intermediate Variables ($v$)

A major issue in previous works was grounding. If an image has three cars, and the claim is “The car on the left is red,” a standard model might get confused about which car to look at.

Pelican introduces Intermediate Variables. In the decomposition phase, it assigns specific variables to specific instances.

It might define $car_left as the object detected at coordinates [0, 0, 50, 50].
The Python code then runs get_color($car_left).

This precise referencing ensures that the verification process remains “locked on” to the correct object throughout the chain of reasoning. The Python code explicitly filters the Visual Table to find the specific row corresponding to the object in question.

Step 4: Integrated Verification Synthesis

Once the Python code executes, it returns “evidence.”

Question: Is there a person? -> Code Output: False (Count = 0).
Question: Is there a motorcycle? -> Code Output: True.

Finally, Pelican feeds the original claim, the decomposition logic, and the code execution results (evidence) back into an LLM. This step is the “Judge.”

The LLM reviews the evidence. If the code says “Count = 0” for “person,” but the claim was “A woman is riding the bike,” the LLM spots the contradiction. It then outputs a decision: Incorrect.

Critically, Pelican also generates a Rewrite. It corrects the hallucination, transforming “A woman is riding the motorcycle” into “There is a motorcycle, but no person is riding it.”

Experimental Results: Does it Work?

The theory sounds solid, but how does Pelican perform in practice? The researchers tested the framework against several state-of-the-art LVLMs using tough benchmarks like MMHal-Bench, GAVIE, and MME.

Quantitative Success

Let’s look at the numbers.

Table 1: Results on hallucination (MMHal-Bench, GAVIE) and visual understanding (MME) benchmarks. Pelican shows consistent improvements across all metrics.

Table 1 presents a comprehensive comparison. Here is what you should notice:

Hal-Rate (Hallucination Rate): Lower is better. Look at the column for MMHal-Bench. When Pelican is applied (rows with the checkmark $\checkmark$), the hallucination rate drops significantly. For example, InstructBlip went from a rate of 0.74 down to 0.51. That is a massive reduction in errors.
MME Score: This measures perception and cognition. Higher is better. Pelican consistently boosts the total score. For mPlug-OWL, the score jumped from 471 to 611.

The researchers also compared Pelican against other dedicated “hallucination correction” tools, specifically Woodpecker and Volcano.

Table 2: Results compared against Woodpecker and Volcano. Pelican achieves the best scores (bolded) and lowest hallucination rates.

As seen in Table 2, Pelican outperforms the competition. On MMHal-Bench, Pelican achieved a score of 3.04 compared to Volcano’s 2.44 and Woodpecker’s 1.73. The Hallucination Rate for Pelican was 0.38, significantly lower than Woodpecker’s 0.66.

This indicates that Pelican’s method of combining a Visual Table with Python-based logic is more robust than simply asking another LLM to “double-check” the work (which is roughly what Woodpecker does).

Qualitative Analysis

Numbers are great, but seeing is believing. Let’s look at actual examples of hallucinations that Pelican fixed.

Figure 3: An illustration of hallucination in LVLMs. Three examples showing how LLaVA models hallucinate and how Pelican corrects them.

Figure 3 shows three distinct cases:

Left Image (The Chair): The model likely hallucinated details about the chair’s style or color.
Middle Image (The Desk): This scene is cluttered. Models often hallucinate extra laptops or get the count wrong. Pelican’s “Visual Table” is particularly good here because it relies on object detectors that count instances rows in a dataframe, rather than guessing via text.
Right Image (The Couple): Models might misinterpret the interaction (e.g., claiming they are holding glasses).

In all these cases, the baseline models (LLaVA v1.5 and v1.6) failed. Pelican successfully identified the mismatch between the claim and the visual evidence, producing a corrected response.

Why Does This Matter?

The significance of Pelican extends beyond just getting a higher score on a leaderboard. It represents a shift in how we think about “neuro-symbolic” AI.

Current Deep Learning is “neural”—it relies on vast networks of probabilities. It’s creative, intuitive, but imprecise. “Symbolic” AI (like logic and code) is precise, rigid, but verifiable.

Pelican is a hybrid. It uses the neural network (LVLM) to understand the messy world of natural language and images, but it uses symbolic logic (Python code, Predicates, Dataframes) to verify the facts.

Key Takeaways

Decomposition is Key: You cannot verify a complex paragraph all at once. You must break it down into atomic “True/False” statements (Predicates).
Code > Text for Verification: asking an AI to “think” in Python code yields more accurate results than asking it to “think” in sentences, because code enforces logical consistency.
Intermediate Variables Solve Grounding: To fix hallucination, you must know exactly which object you are talking about. Pelican’s use of variables ($v$) to track specific object instances prevents the model from conflating different objects.
Tools over Intuition: Sometimes, you just need a ruler. By offloading tasks like “counting” or “calculating IoU” to deterministic Python functions, Pelican avoids the fuzzy math of neural networks.

Conclusion

Hallucination remains a formidable barrier to the widespread adoption of Vision-Language Models. However, frameworks like Pelican show us that the solution might not just be “bigger models.” The solution might be better architecture.

By forcing the model to show its work—breaking claims down, creating a data table, and writing code to prove its assertions—Pelican introduces a layer of accountability that is missing in standard end-to-end models. It creates a system that doesn’t just guess the truth, but actively hunts for it using logic and evidence.

As we move forward, we can expect to see more of these “System 2” thinking approaches (slow, logical, deliberate) integrated into the fast, intuitive “System 1” world of Large Language Models. Pelican is a shining example of how effective that combination can be.

The Problem: Why Do Vision Models Hallucinate?#

Enter Pelican: A Framework for Verification#

The 4-Step Architecture#

Step 1: The Visual Table#

Step 2: Claim Decomposition#

Step 3: Program of Thought (PoT) Verification#

The Power of Intermediate Variables (\(v\))#

Step 4: Integrated Verification Synthesis#

Experimental Results: Does it Work?#

Quantitative Success#

Qualitative Analysis#

Why Does This Matter?#

Key Takeaways#

Conclusion#