Breaking the Hallucination: Why MLLMs Struggle with Fine-Grained Visual Grounding

Introduction

In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs) like GPT-4V have dazzled us with their ability to chat about images. You can upload a photo of your fridge, and the model can suggest recipes. However, beneath this conversational fluency lies a persistent issue: visual grounding.

When you ask a model to pinpoint the exact location of “the red mug to the left of the blue book,” it often struggles. Instead of truly “seeing” the spatial relationships, many models rely on linguistic probabilities—essentially guessing based on word associations rather than visual evidence. This leads to hallucinations, where the model confidently identifies an object that isn’t there or misinterprets complex instructions.

To solve this, we need rigorous benchmarks that go beyond simple object detection. Enter FineCops-Ref, a new dataset and task designed by researchers to evaluate Fine-Grained Compositional Referring Expression Comprehension (REC). This paper introduces a testing ground that forces models to understand not just what objects are, but how they relate to one another, and crucially, to admit when an object simply doesn’t exist.

The Problem: The Bag-of-Words Trap

Current Vision-Language Models (VLMs) often treat language as a “bag of words.” If you feed them the phrase “a horse on the grass,” they look for a horse and grass. If you switch it to “grass on the horse,” many models will still output the same confidence scores because the keywords haven’t changed. They lack compositional reasoning—the ability to understand how attributes (like color) and relationships (spatial positioning) modify the meaning of a sentence.

Standard benchmarks like RefCOCO have been instrumental in the field, but they are becoming saturated. Models perform well on them, but often for the wrong reasons. These datasets rarely test whether a model can handle negative samples—scenarios where the text describes an object that is not in the image. In real-world applications, a robot asked to “pick up the red hammer” must stop if there is no red hammer, rather than grabbing a blue one.

Introducing FineCops-Ref

To address these gaps, the authors introduced FineCops-Ref. This dataset is distinct in two major ways:

Controllable Difficulty: It categorizes tasks by the level of reasoning required (from simple identification to multi-hop logic).
Hard Negatives: It includes both text and images that have been manipulated to test the model’s ability to reject false descriptions.

The Construction Pipeline

Building a dataset that tests fine-grained reasoning requires a sophisticated pipeline. The authors didn’t just scrape captions; they engineered them.

Figure 1: The data construction pipeline of FineCops-Ref. Given an image, we first generate paths based on its scene graph. Then, we fill paths into templates and obtain the positive referring expression through LLM rewriting. Meanwhile, we utilize LLM to generate negative expressions, and based on this, we employ diffusion model to create fine-grained editing negative images.

As shown in Figure 1 above, the process begins with a Scene Graph from the GQA dataset. A scene graph is a structured representation of an image, mapping out objects (Television), attributes (large, black), and relationships (on table).

Path Generation: The system traces paths through the scene graph to create logical chains (e.g., Television -> on -> Table -> right of -> Table).
Expression Generation: These paths are converted into template sentences and then rewritten by an LLM (like GPT-3.5) to sound natural.
Negative Generation: This is where the pipeline innovates. The system generates Negative Text (changing “television” to “radio”) and Negative Images (using diffusion models to visually edit the TV into a radio), creating a perfect “trap” for the model.

Understanding Difficulty Levels

One of the paper’s key contributions is the categorization of “difficulty” not by sentence length, but by the reasoning required to find the target.

Figure 3: Positive expressions of different difficulty levels.

As illustrated in Figure 3, the dataset is split into three levels:

Level 1 (Easy): The target object is unique in its category. For example, “The girl holding the blue phone” (Figure 3a). If there is only one girl, the model doesn’t actually need to understand “holding blue phone” to get the answer right. It just needs to find “girl.”
Level 2 (Medium): There are distractors. In Figure 3b (“Above the sofa… there is a sitting girl”), there might be other people or other girls. The model must process the spatial relationship “above the sofa” to distinguish the target.
Level 3 (Hard): This requires multi-hop reasoning. In Figure 3c, identifying “The girl situated to the right of the dog” requires first finding the dog, verifying the dog has a “blue collar,” and then finding the girl relative to that specific dog.

The Challenge of Negatives

Most current models are “optimistic”—they assume the user’s query is valid. FineCops-Ref challenges this by introducing Negative Samples.

Negative Text

The authors used LLMs to subtly alter the text descriptions. They employ two main strategies:

Replace: Swapping a noun (e.g., “cat” becomes “dog”) or an attribute (“white” becomes “black”).
Swap: Interchanging attributes between two objects in the sentence (e.g., “the red cup on the blue table” becomes “the blue cup on the red table”). This specifically targets the “bag of words” weakness.

Negative Images

Text manipulation is useful, but visually editing the image is a rigorous test of visual parsing. The authors used inpainting models (like PowerPaint) guided by text prompts to alter specific parts of the image while keeping the rest consistent.

Figure 5: Negative images generated by different methods.

Figure 5 demonstrates this visual editing:

(b) Replace Attribute: The skier’s bag is changed from yellow to pink. If the text asks for a yellow bag, the model should find nothing.
(d) Swap Attribute: The yellow train becomes brown, and the platform becomes yellow. A model that only looks for “yellow” and “train” generally might still fire a positive detection here, failing the test.

Experimental Results

The researchers evaluated a wide range of models, including Specialist models (like MDETR and GroundingDINO, which are trained specifically for detection) and MLLMs (like Shikra, Ferret, and CogVLM, which are general-purpose multimodal models).

Performance on Positive Samples

The first test was standard: Can the model find the object when it is there?

Table 3: Evaluation results (Precision@1) on positive data.

Table 3 reveals a fascinating trade-off:

Specialists Rule Level 1: Models like MM-GDINO-T perform exceptionally well on Level 1 (Simple detection). Since Level 1 is essentially object detection, these specialized architectures shine.
MLLMs Shine at Reasoning: As difficulty increases to Level 2 and Level 3, the performance gap narrows or flips. MLLMs like CogVLM demonstrate stronger compositional reasoning capabilities, handling the complex language logic better than the smaller specialist models.
The Drop-off: Regardless of the model architecture, performance drops significantly as difficulty increases. Most models struggle to achieve 50% precision on Level 3 tasks, highlighting that multi-hop reasoning is still an open problem.

The Failure on Negatives

The results become much more stark when analyzing negative samples. Here, the metric changes. Since there is no bounding box to find, we measure Recall@1. This metric checks if the model correctly assigns lower confidence scores to the negative samples compared to positive ones.

Equation for Recall calculation

Essentially, we want the model to say, “I am not confident I see this object.”

Table 4: Evaluation results (Recall@1) on negative expressions.

Table 4 shows the results for negative text expressions:

Global Failure: Performance is weak across the board. Even on Level 1 (where the object name is just replaced, e.g., looking for a “radio” when there is only a “TV”), models struggle to reject the premise.
Reasoning Gap: Models perform worse on “Relation” and “Attribute” swaps than on simple Object replacements. They can tell a cat isn’t a dog, but they struggle to tell that a “red cup” isn’t a “blue cup.”

Correlation: Accuracy vs. Rejection

Is a model that is good at finding objects also good at knowing when objects aren’t there? The authors analyzed the correlation between Precision (finding positive samples) and Recall (rejecting negative samples).

Figure 2: The relationship between Precision@1 (on positive samples) and Recall@1 (on positive and negative samples).

Figure 2 shows a positive correlation. Generally, models that are more accurate on standard tasks are also more robust against hallucinations.

Specialist Models (Left): Show a very high correlation with easy negatives (Level 1). Their ability to detect specific objects translates well to rejecting missing objects.
MLLMs (Right): Show a stronger correlation with harder negatives (Level 2). This suggests that as MLLMs get better at complex reasoning, they also get better at identifying subtle textual contradictions.

Conclusion and Implications

The FineCops-Ref paper exposes a critical vulnerability in modern AI: while models are getting better at “chatting,” they are not necessarily getting better at seeing.

The authors demonstrated that while MLLMs possess superior reasoning abilities for complex queries, they (and specialist models) suffer from a significant “grounding gap” when faced with negative samples or fine-grained attributes. By releasing this dataset, which includes both the reasoning-heavy positive samples and the hallucination-inducing negative samples, the researchers have provided a roadmap for the next generation of models.

To build truly reliable AI agents—robots that can navigate warehouses or assistants that analyze medical images—we need models that don’t just guess based on language patterns. We need models that look, reason, and have the confidence to say, “I don’t see that here.”

Introduction#

The Problem: The Bag-of-Words Trap#

Introducing FineCops-Ref#

The Construction Pipeline#

Understanding Difficulty Levels#

The Challenge of Negatives#

Negative Text#

Negative Images#

Experimental Results#

Performance on Positive Samples#

The Failure on Negatives#

Correlation: Accuracy vs. Rejection#

Conclusion and Implications#