Introduction
In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs) like GPT-4V have dazzled us with their ability to chat about images. You can upload a photo of your fridge, and the model can suggest recipes. However, beneath this conversational fluency lies a persistent issue: visual grounding.
When you ask a model to pinpoint the exact location of “the red mug to the left of the blue book,” it often struggles. Instead of truly “seeing” the spatial relationships, many models rely on linguistic probabilities—essentially guessing based on word associations rather than visual evidence. This leads to hallucinations, where the model confidently identifies an object that isn’t there or misinterprets complex instructions.
To solve this, we need rigorous benchmarks that go beyond simple object detection. Enter FineCops-Ref, a new dataset and task designed by researchers to evaluate Fine-Grained Compositional Referring Expression Comprehension (REC). This paper introduces a testing ground that forces models to understand not just what objects are, but how they relate to one another, and crucially, to admit when an object simply doesn’t exist.
The Problem: The Bag-of-Words Trap
Current Vision-Language Models (VLMs) often treat language as a “bag of words.” If you feed them the phrase “a horse on the grass,” they look for a horse and grass. If you switch it to “grass on the horse,” many models will still output the same confidence scores because the keywords haven’t changed. They lack compositional reasoning—the ability to understand how attributes (like color) and relationships (spatial positioning) modify the meaning of a sentence.
Standard benchmarks like RefCOCO have been instrumental in the field, but they are becoming saturated. Models perform well on them, but often for the wrong reasons. These datasets rarely test whether a model can handle negative samples—scenarios where the text describes an object that is not in the image. In real-world applications, a robot asked to “pick up the red hammer” must stop if there is no red hammer, rather than grabbing a blue one.
Introducing FineCops-Ref
To address these gaps, the authors introduced FineCops-Ref. This dataset is distinct in two major ways:
- Controllable Difficulty: It categorizes tasks by the level of reasoning required (from simple identification to multi-hop logic).
- Hard Negatives: It includes both text and images that have been manipulated to test the model’s ability to reject false descriptions.
The Construction Pipeline
Building a dataset that tests fine-grained reasoning requires a sophisticated pipeline. The authors didn’t just scrape captions; they engineered them.

As shown in Figure 1 above, the process begins with a Scene Graph from the GQA dataset. A scene graph is a structured representation of an image, mapping out objects (Television), attributes (large, black), and relationships (on table).
- Path Generation: The system traces paths through the scene graph to create logical chains (e.g., Television -> on -> Table -> right of -> Table).
- Expression Generation: These paths are converted into template sentences and then rewritten by an LLM (like GPT-3.5) to sound natural.
- Negative Generation: This is where the pipeline innovates. The system generates Negative Text (changing “television” to “radio”) and Negative Images (using diffusion models to visually edit the TV into a radio), creating a perfect “trap” for the model.
Understanding Difficulty Levels
One of the paper’s key contributions is the categorization of “difficulty” not by sentence length, but by the reasoning required to find the target.

As illustrated in Figure 3, the dataset is split into three levels:
- Level 1 (Easy): The target object is unique in its category. For example, “The girl holding the blue phone” (Figure 3a). If there is only one girl, the model doesn’t actually need to understand “holding blue phone” to get the answer right. It just needs to find “girl.”
- Level 2 (Medium): There are distractors. In Figure 3b (“Above the sofa… there is a sitting girl”), there might be other people or other girls. The model must process the spatial relationship “above the sofa” to distinguish the target.
- Level 3 (Hard): This requires multi-hop reasoning. In Figure 3c, identifying “The girl situated to the right of the dog” requires first finding the dog, verifying the dog has a “blue collar,” and then finding the girl relative to that specific dog.
The Challenge of Negatives
Most current models are “optimistic”—they assume the user’s query is valid. FineCops-Ref challenges this by introducing Negative Samples.
Negative Text
The authors used LLMs to subtly alter the text descriptions. They employ two main strategies:
- Replace: Swapping a noun (e.g., “cat” becomes “dog”) or an attribute (“white” becomes “black”).
- Swap: Interchanging attributes between two objects in the sentence (e.g., “the red cup on the blue table” becomes “the blue cup on the red table”). This specifically targets the “bag of words” weakness.
Negative Images
Text manipulation is useful, but visually editing the image is a rigorous test of visual parsing. The authors used inpainting models (like PowerPaint) guided by text prompts to alter specific parts of the image while keeping the rest consistent.

Figure 5 demonstrates this visual editing:
- (b) Replace Attribute: The skier’s bag is changed from yellow to pink. If the text asks for a yellow bag, the model should find nothing.
- (d) Swap Attribute: The yellow train becomes brown, and the platform becomes yellow. A model that only looks for “yellow” and “train” generally might still fire a positive detection here, failing the test.
Experimental Results
The researchers evaluated a wide range of models, including Specialist models (like MDETR and GroundingDINO, which are trained specifically for detection) and MLLMs (like Shikra, Ferret, and CogVLM, which are general-purpose multimodal models).
Performance on Positive Samples
The first test was standard: Can the model find the object when it is there?

Table 3 reveals a fascinating trade-off:
- Specialists Rule Level 1: Models like MM-GDINO-T perform exceptionally well on Level 1 (Simple detection). Since Level 1 is essentially object detection, these specialized architectures shine.
- MLLMs Shine at Reasoning: As difficulty increases to Level 2 and Level 3, the performance gap narrows or flips. MLLMs like CogVLM demonstrate stronger compositional reasoning capabilities, handling the complex language logic better than the smaller specialist models.
- The Drop-off: Regardless of the model architecture, performance drops significantly as difficulty increases. Most models struggle to achieve 50% precision on Level 3 tasks, highlighting that multi-hop reasoning is still an open problem.
The Failure on Negatives
The results become much more stark when analyzing negative samples. Here, the metric changes. Since there is no bounding box to find, we measure Recall@1. This metric checks if the model correctly assigns lower confidence scores to the negative samples compared to positive ones.

Essentially, we want the model to say, “I am not confident I see this object.”

Table 4 shows the results for negative text expressions:
- Global Failure: Performance is weak across the board. Even on Level 1 (where the object name is just replaced, e.g., looking for a “radio” when there is only a “TV”), models struggle to reject the premise.
- Reasoning Gap: Models perform worse on “Relation” and “Attribute” swaps than on simple Object replacements. They can tell a cat isn’t a dog, but they struggle to tell that a “red cup” isn’t a “blue cup.”
Correlation: Accuracy vs. Rejection
Is a model that is good at finding objects also good at knowing when objects aren’t there? The authors analyzed the correlation between Precision (finding positive samples) and Recall (rejecting negative samples).

Figure 2 shows a positive correlation. Generally, models that are more accurate on standard tasks are also more robust against hallucinations.
- Specialist Models (Left): Show a very high correlation with easy negatives (Level 1). Their ability to detect specific objects translates well to rejecting missing objects.
- MLLMs (Right): Show a stronger correlation with harder negatives (Level 2). This suggests that as MLLMs get better at complex reasoning, they also get better at identifying subtle textual contradictions.
Conclusion and Implications
The FineCops-Ref paper exposes a critical vulnerability in modern AI: while models are getting better at “chatting,” they are not necessarily getting better at seeing.
The authors demonstrated that while MLLMs possess superior reasoning abilities for complex queries, they (and specialist models) suffer from a significant “grounding gap” when faced with negative samples or fine-grained attributes. By releasing this dataset, which includes both the reasoning-heavy positive samples and the hallucination-inducing negative samples, the researchers have provided a roadmap for the next generation of models.
To build truly reliable AI agents—robots that can navigate warehouses or assistants that analyze medical images—we need models that don’t just guess based on language patterns. We need models that look, reason, and have the confidence to say, “I don’t see that here.”
](https://deep-paper.org/en/paper/2409.14750/images/cover.png)