Introduction
The rise of Large Language Models (LLMs) has been nothing short of revolutionary. But as we move from text-only models to Multi-modal Large Language Models (MLLMs)—systems that can see and process images alongside text—we encounter new, fascinating modes of failure. We generally assume that giving an AI more information (like an image) will help it reason better. After all, a picture is worth a thousand words, right?
But what if the picture is lying?
Recent research titled “The Instinctive Bias: Spurious Images lead to Illusion in MLLMs” reveals a critical vulnerability in state-of-the-art models like GPT-4V and LLaVA. The researchers discovered that when these models are presented with an image that is relevant to a topic but contradicts the correct textual answer, the models suffer from a “visual illusion.” They ignore their own reasoning capabilities and blindly trust the visual input.
This phenomenon, termed Instinctive Bias, suggests that current multimodal AI behaves a bit like a human using “System 1” thinking (fast, instinctive) rather than “System 2” (slow, logical). In this post, we will dive deep into this research, exploring how the authors quantified this bias, the benchmark they created, and what this implies for the future of AI robustness.
Background: The Promise and Peril of MLLMs
To understand why this bias occurs, we first need to look at how MLLMs operate. Typically, models like LLaVA or GPT-4V process visual inputs by converting images into “visual tokens.” These tokens share the same latent space as language tokens. This allows the model to “read” an image just as it reads a sentence.
In standard tasks like Visual Question Answering (VQA), this is a superpower. You show the model a picture of a dog and ask, “What animal is this?” The answer is contained directly within the image.
However, real-world reasoning is often more complex. A user might provide an image for context that doesn’t contain the direct answer, or worse, provides misleading context.
Consider a scenario where you describe a specific city in Australia known for the Great Barrier Reef. You ask the model to name the city. But, alongside your question, you accidentally upload a photo of the Eiffel Tower. A robust model should read your text, identify the city as Cairns, and ignore the irrelevant photo of Paris.
As we will see, that is not what happens.
The Phenomenon: Instinctive Bias
The core contribution of this paper is the identification of Instinctive Bias. This is the tendency of MLLMs to hallucinate an incorrect answer because they are overly influenced by “spurious” images—images that are related to the question concept but correspond to a wrong answer.

As shown in Figure 1 above, when LLaVA is asked to identify an Australian city described in the text (Cairns), it answers correctly in a text-only setting. However, when an image of the Eiffel Tower (a spurious image) accompanies the text, the model abandons the textual clues and incorrectly answers “Eiffel in Paris.”
The model isn’t just making a random guess; it is suffering from an illusion induced by the visual input. The image triggers an “instinctive” response that overrides the logical reasoning derived from the text.
Methodology: Building CorrelationQA
To scientifically measure this bias, the researchers couldn’t rely on existing datasets, which usually pair questions with correct, relevant images. They needed a dataset designed to trick the models. They introduced CorrelationQA, a benchmark containing over 7,000 text-image pairs across 13 distinct categories (such as Animals, History, and Technology).
The construction of CorrelationQA involved a clever, three-stage automated pipeline:

Step 1: Text Generation
The authors utilized GPT-4 to generate complex Question-Answer (QA) pairs. For every question, GPT-4 provided:
- The Correct Answer.
- Five Spurious Answers (incorrect answers that are confusing or related).
For example, for a question describing a “Zebra,” the spurious answers might include “Tiger” or “Giraffe.”
Step 2: Image Generation
This is where the dataset gets its “trap.” The researchers needed images that corresponded to the wrong answers. They used Stable Diffusion, a state-of-the-art image generation model, to create:
- Spurious Natural Images: Realistic synthetic images of the wrong answers (e.g., a generated image of a Tiger for the Zebra question).
- Factual Images: Images of the correct answer (used as a control group).
Step 3: Typography and Realistic Images
To ensure the bias wasn’t limited to synthetic art, they also collected:
- Realistic Images: Sourced from search engines.
- Typography (OCR) Images: Images containing just the written text of the answer (e.g., an image with the word “Tiger” printed on it).

Figure 9 gives you a glimpse of what the models face. The text asks for a specific answer, but the image screams a different, incorrect one.
Evaluation Metrics
To quantify the impact of these misleading images, the paper introduces a specific metric called Accuracy Drop (AccDrop).
First, they calculate the standard Accuracy (\(Acc\)) as the number of correct answers divided by the total number of pairs. Then, they calculate the drop:

Here, \(A_f\) is the accuracy when the model sees the Factual (correct) image, and \(A_s\) is the accuracy when the model sees the Spurious (misleading) image.
- High AccDrop: The model is easily tricked. It performs well when the image helps, but fails miserably when the image lies.
- Low AccDrop: The model is robust; it ignores the misleading image and relies on the text.
Experiments & Results
The researchers tested 9 mainstream MLLMs, including industry heavyweights like GPT-4V, LLaVA-13B, and Qwen-VL. The results were consistent and concerning: All models suffer from instinctive bias.
1. The Sensitivity Gap
The difference in performance between factual and spurious inputs is stark.

In Figure 4, look at the gap between the green bars (Factual) and the orange bars (Spurious).
- GPT-4V, arguably the most advanced model tested, saw its accuracy drop from 0.89 (Factual) to 0.57 (Spurious) on natural images.
- Qwen-VL dropped from 0.65 to 0.36.
This proves that even the most powerful models are not immune. When the visual signal conflicts with the text, the visual signal often wins, leading to the “Instinctive Bias.”
2. The Typography Trap
Interestingly, the models were even more susceptible to images containing text (Typography). As seen in the right-hand chart of Figure 4 above, the AccDrop for Typography is significantly higher.
For example, Qwen-VL had an accuracy of 0.90 on factual typography but plummeted to 0.15 on spurious typography. This creates a massive AccDrop of 0.75. This suggests that MLLMs are incredibly sensitive to reading text within images, perhaps trusting “OCR” data even more than visual object features.
3. Text vs. Image: Adding Information Reduces Accuracy
One of the most damning findings is that adding a spurious image makes the model perform worse than if it had seen no image at all.

Figure 5 illustrates this perfectly.
- Text-only (Blue/Leftmost bars): Models perform reasonably well. GPT-4V is nearly perfect.
- Spurious (Orange bars): Performance crashes.
This indicates that the “visual illusion” isn’t just a distraction; it actively corrupts the reasoning process that the model successfully performs in a text-only context.
Analysis by Category and Qualitative Examples
Not all concepts are equally confusing. The study broke down performance by category (Animals, Food, History, etc.) and found that tangible categories suffer more from bias than abstract ones.
Models struggled most with Animals, Plants, and Colors. These are categories with distinct, concrete visual features. Conversely, categories like History or Art (which might require identifying a concept like “Renaissance” or a specific historical event) showed lower Accuracy Drops.
Why? The authors hypothesize that tangible themes have prominent content that the visual extraction modules of MLLMs pick up easily. If a model sees a “Cat,” it registers “Cat” strongly. If it sees a generic historical scene, the visual signal might be weaker, forcing the model to rely more on the text.
Qualitative Failures
Let’s look at some specific examples of where the models failed.

In the bottom row of Figure 6, we see clear instances of Instinctive Bias:
- The Bluebell/Tulip Case: The text describes a “flowering plant with delicate bell-shaped blooms.” The image shows a Tulip. The model answers “Tulip,” completely ignoring the “bell-shaped” textual clue.
- The Giraffe/Ostrich Case: The text describes the “tallest living terrestrial animal… spotted coat.” The image shows an Ostrich. The model answers “Ostrich,” ignoring the biological impossibility of an ostrich being the tallest terrestrial animal with a spotted coat.
These errors highlight a disconnect between “seeing” and “reasoning.” The model perceives the object correctly (it identifies the Ostrich) but fails to integrate that perception with the logical constraints provided in the text.
Conclusion and Implications
The paper “The Instinctive Bias: Spurious Images lead to Illusion in MLLMs” serves as a wake-up call for the multimodal AI community. It demonstrates that as we give models eyes, we also give them the ability to be deceived by visual illusions.
The key takeaways are:
- Universality: This is not a bug in one specific model; it is a widespread behavior across all current MLLMs, including GPT-4V.
- Visual Dominance: MLLMs tend to prioritize visual information over textual logical constraints, leading to “Instinctive Bias.”
- Vulnerability to Text-in-Image: Models are exceptionally prone to being misled by text embedded within images (typography).
This research implies that current training strategies, which focus heavily on aligning images with relevant text, might be inadvertently training models to over-trust visual data. Future work must focus on “slow reasoning”—teaching models to critically evaluate the relevance of an image before letting it drive the answer. Until then, we should be careful about trusting an AI’s answer when the input image might be misleading.
](https://deep-paper.org/en/paper/2402.03757/images/cover.png)