Introduction

Imagine you are training a self-driving car system. You train it on thousands of hours of video footage taken in sunny California. The model achieves 99% accuracy in detecting pedestrians, other cars, and stop signs. Then, you deploy the car in a snowy Canadian town or a dimly lit tunnel. Suddenly, the system fails to recognize a pedestrian wearing a winter coat against a white background.

This scenario illustrates one of the most persistent challenges in modern computer vision and Artificial Intelligence: Out-of-Distribution (OOD) generalization.

Most machine learning models are built on the assumption that the data they will see in the real world (test data) looks statistically identical to the data they studied (training data). This is known as the IID (Independent and Identically Distributed) assumption. However, the real world is chaotic. Weather changes, artistic styles vary, lighting shifts, and objects appear in unexpected contexts. When these “distribution shifts” occur, even the most powerful models can crumble.

While we have benchmarks to test how image classifiers handle these shifts (e.g., “Is this a cat?”), we have severely lacked comprehensive tools to evaluate more complex tasks like Object Detection (“Where is the cat?”) and Visual Grounding (“Find the cat sitting on the red mat”).

In this post, we are diving deep into a new research paper that introduces COUNTS (Common Objects UNder disTribution Shifts). This massive dataset and benchmarking suite is designed to rigorously test how well Object Detectors and Multimodal Large Language Models (MLLMs) like GPT-4o and Gemini perform when they step out of their comfort zones.

Background: The OOD Problem

To understand the significance of COUNTS, we first need to look at the current landscape of AI testing.

The Limitation of Current Benchmarks

Historically, robustness research has focused on image classification. Datasets like ImageNet-C apply synthetic corruptions (like digital noise or blur) to standard images. While useful, these artificial changes don’t fully capture natural variations. A digital Gaussian blur filter is not the same as a foggy morning or a watercolor painting.

Furthermore, real-world applications—such as robotics, autonomous driving, and embodied agents—require more than just labeling an image. They need to locate objects precisely. Existing datasets for OOD detection are often small, lack variety in domains, or rely on synthetic data.

The Rise of Multimodal LLMs

We are also in the era of Multimodal Large Language Models (MLLMs), which can process both text and images. Models like GPT-4o and Gemini are incredibly capable, but their “black box” nature makes them hard to evaluate. We don’t know exactly what data they were trained on, making it difficult to say for sure if a test image is truly “new” or “out-of-distribution” for them.

The researchers behind COUNTS tackled these problems head-on by building a dataset from the ground up, specifically for complex visual tasks in the wild.

The COUNTS Dataset

The core contribution of this work is the COUNTS dataset. It is a large-scale, finely annotated dataset designed to support both training and testing for OOD generalization.

Real-World Diversity

Unlike benchmarks that use algorithmic filters to simulate changes, COUNTS consists entirely of real-world images collected from the internet. The researchers identified 14 distinct domains that represent natural shifts in visual distribution.

Figure 1. Examples of images in COUNTS. Each image is annotated with domain and objects.

As shown in Figure 1, these domains include:

  • Environmental conditions: Snow, Rain, Dim (low light), Water, Grass, Sand.
  • Contextual shifts: Street, Road, Indoor, Mountain, Tree, Sky.
  • Artistic and Object states: Painting, Handmade (toys/crafts), Occlusion (blocked objects).

This variety ensures that models are tested against the kinds of visual changes humans navigate effortlessly but machines struggle with. For example, a “dog” looks very different in a photograph taken in a park (Grass domain) compared to a Painting of a dog or a plush toy dog (Handmade domain).

Scale and Precision

COUNTS is not a toy dataset. It contains:

  • 222,234 samples
  • 35 object categories
  • Over 1.1 million labeled bounding boxes

Crucially, the dataset provides object-level annotations. This means every target object is boxed and labeled, allowing for precise evaluation of object detection and grounding.

Comparison to Existing Benchmarks

To appreciate the scale of COUNTS, we can compare it to previous benchmarks in the field.

Table 1. Overview of current OOD generalization and robust detection benchmarks.

Table 1 highlights the gap COUNTS fills. Previous datasets like PACS or VLCS were primarily for classification and had very few images. Others like COCO-C relied on synthetic corruptions (non-natural images). COUNTS offers a unique combination of high image count, natural domains, and fine-grained detection/grounding tasks.

The New Benchmarks

Leveraging this dataset, the researchers propose two novel evaluation frameworks: \(O(OD)^2\) for traditional object detectors and OODG for Multimodal LLMs.

1. \(O(OD)^2\): Benchmarking Object Detectors

The first benchmark, \(O(OD)^2\) (Out-of-Distribution in Object Detection), is designed to test models like Faster R-CNN, YOLO, and DETR.

The Setup

Since modern detectors are already quite good at standard tasks (In-Distribution or IID), this benchmark specifically separates training and testing domains.

  • Training: The models are trained on a subset of domains (e.g., standard road or street images).
  • Testing: The models are evaluated on “unseen” domains (e.g., Sky, Occlusion, Grass, Water, Dim, Handmade).

This setup forces the model to learn the concept of an object (like a “car”) rather than just memorizing the context (like “a car is something on a gray asphalt road”). If a model relies too heavily on background context, it will fail when asked to find a car in a painting or on a snowy field.

2. OODG: Benchmarking Multimodal LLMs

The second benchmark, OODG (OOD in Grounding), addresses the specific challenges of Multimodal Large Language Models.

The Problem of “Unknown” Training Data

Because we don’t know the full training history of models like GPT-4 or Gemini, we cannot simply say “train on X, test on Y.” They have likely seen everything during their massive pre-training phase.

The Solution: Distribution Shifts in In-Context Learning

The researchers propose a brilliant workaround. MLLMs often use In-Context Learning (ICL), where a user provides a few examples (shots) in the prompt before asking a question. For example:

Here is an image of a dog. [Box coordinates]. Here is an image of a cat. [Box coordinates]. Now, find the bird in this new image.

The OODG benchmark defines distribution shifts based on the difference between the In-Context Examples (ICE) and the Test Sample.

The benchmark evaluates three specific tasks:

  1. Visual Grounding: The model is given a bounding box and must identify what is inside it.
  2. Recognition and Localization: The model is asked to find an object (e.g., “Find the truck”) and must return the coordinates.
  3. Visual and Semantic Mapping: A complex task where the model maps descriptions to multiple regions.

Let’s look at examples of these prompts to understand what the models are facing.

Figure 3. Example of Visual Grounding.

Figure 3 shows the Visual Grounding task. The prompt asks, “What object is in the red box?” The model must choose from a list. In this example, both Gemini and GPT-4o correctly identify the wheel.

Figure 4. Example of Recognition and Localization.

Figure 4 illustrates Recognition and Localization. The model is given the image dimensions and asked to find specific objects (like “truck”) and output their [X, Y, Width, Height] coordinates.

Figure 5. Example of Visual and Semantic Mapping.

Figure 5 displays the Visual and Semantic Mapping task. Here, the model must correlate specific regions to descriptions or categories, a task requiring high-level reasoning.

The 5 Evaluation Settings

To rigorously test MLLMs, the OODG benchmark uses five settings:

  1. Zero-shot: No examples provided. Can the model figure it out alone?
  2. IID ICL: Examples are from the same domain as the test image (e.g., example is a car in snow; test is a truck in snow).
  3. Covariate Shift: Examples are from a different domain (e.g., example is a car on a sunny street; test is a car in a painting).
  4. Label Shift: The distribution of object types in the examples differs from the test set.
  5. Spurious Correlation Shift: The examples contain misleading patterns (e.g., in the examples, all “cats” are in dark rooms, creating a false rule that “darkness = cat”).

Experiments and Results

The researchers ran extensive experiments using these benchmarks. The results reveal fascinating insights into the limitations of current AI.

Object Detector Performance (\(O(OD)^2\))

The study tested various architectures, including Two-stage detectors (Faster R-CNN), One-stage detectors (RetinaNet, YOLOv9), and Transformer-based models (DETR, DINO).

Clean vs. Robust Performance

One of the most telling results is the relationship between performance on standard data (Clean) vs. OOD data (Robustness).

Figure 2. Comparison of current object detectors in OOD and i.i.d. scenarios.

Figure 2 plots this relationship. The x-axis is “Clean” (IID) performance, and the y-axis is “Robustness” (OOD) performance.

  • Ideally, we want models in the top right corner.
  • The Gap: Notice that robustness scores are significantly lower than clean scores for almost all models. A model might have 40% mAP (mean Average Precision) on clean data but drop to 20% on OOD data.
  • Model Differences: Transformer-based models like DINO (the green squares at the top right) generally outperform traditional CNN-based models in both metrics.

Architecture Matters

The researchers dissected why some detectors perform better.

  • Head vs. Backbone: Improving the “Head” of the network (the part that makes the final prediction) yielded better OOD gains than just making the “Backbone” (the feature extractor) stronger.
  • Pre-training: How you pre-train the model matters.

Table 3. Comparison of object detectors with different backbone and pretraining methods.

Table 3 compares pre-training methods. Using advanced self-supervised methods (like Sup_timm) often provided a significant boost in robustness compared to standard ImageNet pre-training. This suggests that how a model learns to see the world initially defines how well it adapts to new environments later.

MLLM Performance (OODG)

The results for Multimodal LLMs like GPT-4o and Gemini were perhaps the most surprising, particularly regarding how they use In-Context Learning (ICL).

Zero-Shot is OK, but Grounding is Hard

Even without examples (Zero-shot), models struggled with fine-grained localization. While they can describe an image well, pinpointing exact coordinates (pixel-level grounding) is still a major weakness for current MLLMs compared to specialized detectors.

Table 5. Results of Recognition and Localization. mAP is reported.

Table 5 shows the Mean Average Precision (mAP) for localization. The scores are incredibly low (often below 0.1 or 10%). This indicates that while GPT-4o might know there is a cup in the image, asking it to “draw a box around the cup” results in very poor accuracy, especially in OOD domains.

The “Bad Example” Trap

The most critical finding came from testing Covariate Shifts in ICL.

  • Scenario: You ask the model to identify an object in a “Sketch” (Test). You provide examples of objects in “Photos” (Context).
  • Result: Performance drops significantly compared to Zero-shot.

Why? The models seemingly try too hard to match the visual patterns of the examples rather than understanding the underlying instruction.

Table 12.ResultsofVisual Groundingofmore modelsonskyocclusion,and gras.Theselectionaccuracyisreported.S,M,andL indicate small, medium,and large objects, respectively.

Table 12 and Table 13 (below) detail these drops.

Table 13.Results of Visual Grounding of more models on water,dim,and handmake.The selection acuracy is reported.

Look at the contrast between GPT-4o and Gemini.

  • Gemini-1.5 is a “good student.” It pays very close attention to the examples provided. When the examples match the test data (IID), Gemini’s accuracy skyrockets. But when the examples are from a different domain (Covariate Shift), Gemini creates false associations and crashes hard—sometimes dropping by over 50%.
  • GPT-4o is less reliant on examples. It gains less from good examples but also suffers less from bad ones. It relies more on its internal pre-training knowledge.

This creates a paradox: The ability to learn effectively from context (which we generally want) makes a model more vulnerable to distribution shifts.

Conclusion and Implications

The COUNTS paper serves as a reality check for the AI community. While we celebrate the capabilities of models on standard benchmarks, the performance drop-offs observed in the \(O(OD)^2\) and OODG benchmarks show we are far from solving visual perception in the wild.

Key Takeaways:

  1. Data Matters: We need diverse, real-world datasets like COUNTS to expose weaknesses that synthetic corruptions hide.
  2. Architecture vs. Scale: Simply making models bigger isn’t the only solution. Better “heads” in detectors and better pre-training strategies are essential for robustness.
  3. The Context Double-Edged Sword: For MLLMs, In-Context Learning is powerful but dangerous. If the few-shot examples provided by a user don’t statistically match the real-world task, the model might hallucinate patterns or become confused.

As we move toward deploying AI in unconstrained environments—from home robots to search and rescue drones—benchmarks like COUNTS will be the standard by which we measure true reliability. The goal isn’t just a model that works in the lab; it’s a model that works everywhere.