When you show a picture of a Golden Retriever to a modern AI model like CLIP and it correctly identifies it as a “dog,” it’s easy to make assumptions about how it did that. We naturally assume the model “saw” the floppy ears, the golden fur, and the snout. We assume it matched the visual features of the image to the visual descriptions inherent in the word “dog.”

But what if we’re wrong? What if the model isn’t looking at the dog at all, but rather looking for the digital equivalent of a watermark? Or what if it identifies the dog not by its shape, but by “knowing” it’s a pet that lives in North American suburbs?

A fascinating research paper titled “If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions” dives deep into this question. The researchers developed a novel method to interrogate Vision-Language Models (VLMs) to find out what they actually care about. The results are surprising: these models often rely on “spurious” text (like “Click to enlarge”) or non-visual facts (like habitat) rather than the physical attributes we expect them to see.

In this post, we will break down their method, “Extract and Explore,” and look at what happens when we force these black-box models to reveal their secrets.

The Black Box of Vision-Language Models

Contrastive Vision-Language Models (VLMs), such as CLIP, ALIGN, and SigLIP, have revolutionized AI. They are trained on massive datasets of image-text pairs from the internet. The training objective is simple yet powerful: map the image and its corresponding text close together in a mathematical “embedding space,” while pushing unrelated text and images apart.

However, understanding what these models learn is notoriously difficult. Previous research has shown that VLMs perform well on tasks requiring physical world knowledge, but also that they might not actually prioritize visual attributes like shape or color. If they aren’t looking at visual attributes, what are they looking at?

Directly asking the model is impossible because VLMs don’t “speak”—they just output numbers (vectors). To bridge this gap, the researchers introduce a framework called EX2 (Extract and Explore).

The Method: Extract and Explore (EX2)

The core idea of EX2 is ingenious: if the VLM can’t talk, let’s train a Large Language Model (LLM) to speak for it.

As illustrated in Figure 1, the process consists of two phases:

  1. Extract: Use Reinforcement Learning (RL) to teach an LLM (specifically Mistral-7B) to generate descriptions that the VLM “prefers.”
  2. Explore: Analyze these generated descriptions to understand what features the VLM prioritizes.

Figure 1: Extract: we align Mistral with VLM preferences and generate descriptions that contain features that are important for the VLM. Explore: we examine various aspects of these descriptions to identify features that contribute to VLM representations.

Phase A: Extracting Preferences

The researchers didn’t want to limit their search to a pre-defined list of colors or shapes. Instead, they wanted the LLM to freely generate descriptions. They used 25 diverse questions (like “What does a photo of a [concept] look like?” or “Write a story about [concept]”) to prompt the LLM.

The key mechanism here is Reinforcement Learning (RL). They set up a feedback loop where the LLM generates a description, and the VLM (e.g., CLIP) judges it.

Figure 2: Extract and Explore (EX2) overview. A) We use RL to fine-tune an LLM to generate concept descriptions that are closer to the corresponding images in the VLM embedding space. B) We inspect these descriptions from various aspects.

How does the VLM judge the text? Through a reward function based on Cosine Similarity. If the generated text lands close to the actual images of the concept in the VLM’s embedding space, the LLM gets a high reward. This effectively aligns the LLM with the VLM’s worldview. If the VLM thinks “Click to enlarge” is the best description of a flower, the LLM will learn to say “Click to enlarge.”

The mathematical formulation for this reward function is shown below:

Reward Function Equation

Here, \(R(d_c)\) is the reward for a description. It calculates the average cosine similarity between the description embedding \(\Phi_T(d_c)\) and the embeddings of the images \(D_c\). There is also a penalty term (KL divergence) to ensure the LLM doesn’t drift too far from generating coherent English.

Phase B: Exploring the Results

Once the LLM is aligned, it acts as a mirror, reflecting the VLM’s preferences. The researchers generated thousands of descriptions for varying concepts (birds, flowers, cars, etc.) and analyzed them.

To handle the massive volume of text, they used ChatGPT as an automated inspector. They designed specific prompts to categorize the descriptions into three buckets:

  1. Spurious: Text that provides no real information about the concept (e.g., “Photo 1 of 3”).
  2. Informative - Visual: Text describing physical appearance (e.g., “A red bird with a short beak”).
  3. Informative - Non-Visual: Text describing facts not visible in a photo (e.g., “This bird migrates to South America”).

Table 17: Prompt template for ChatGPT to determine if a description provides additional information about the corresponding concept.

Does It Work?

Before analyzing what was learned, the researchers had to verify that the LLM actually learned helpful features. They tested this by using the generated descriptions to classify images.

The results, shown in Table 2, were positive. In most cases (33 out of 42 experiments), the descriptions generated by the aligned LLM achieved higher classification accuracy than standard generic templates (like “A photo of a…”).

Table 2: The percentage of informative descriptions for experiments that the LLM successfully learns the VLM preferences and improves the classification accuracy.

This confirms that the method works: the LLM successfully extracted features that help the VLM identify images. Now, the burning question is: what are those features?

Finding 1: The “Spurious” Problem

One of the most startling findings is the VLM’s reliance on spurious descriptions. These are descriptions that contain absolutely no knowledge about the concept itself but help the model identify the image anyway.

For example, when analyzing the standard CLIP model on a dataset of Flowers, a massive chunk of the “preferred” descriptions were spurious.

Figure 3: Breakdown of aligned descriptions for CLIP on Flowers. CLIP significantly relies on spurious or non-visual information to represent flower species.

As shown in Figure 3 above, nearly 45% of the descriptions CLIP preferred for flowers were spurious.

What does spurious text look like? It often resembles metadata, file names, or website artifacts. Table 11 below provides specific examples. Look at the entry for “thorn apple”—the model preferred a repetitive header structure over a visual description. For the “McDonnell Douglas DC-9-30,” it preferred text that looked like a file caption.

Table 11: Examples of spurious, non-visual, and visual descriptions.

Why does this happen? VLMs are trained on web data (image-text pairs). If a specific bird species usually appears on a specific hobbyist website that always uses the caption “Click to enlarge,” the model learns to associate the image of the bird with the text “Click to enlarge.” It’s a shortcut—a “Clever Hans” moment where the AI gets the right answer for the wrong reason.

Finding 2: The Non-Visual Surprise

Even when the descriptions were informative (i.e., not spurious), they often didn’t describe what the object looked like.

The researchers found that VLMs rely significantly on non-visual attributes. For example, knowing that a bird is “native to North America” (habitat) might be more important to the model than knowing it has “yellow wings.”

Table 4 highlights this trend. In several datasets, fewer than 25% of the informative descriptions contained visual attributes.

Table 4: Percentage of informative descriptions that contain visual attributes.

This challenges the intuition that VLMs are “seeing” the image. Instead, they seem to be context-matching. If an image contains background elements (like a specific type of tree or fence) that correlate with “North America,” the model might use that geographical context to identify the bird, rather than identifying the bird itself.

Finding 3: Different Models, Different Personalities

Not all VLMs “think” alike. The researchers compared several popular models (CLIP, ALIGN, SigLIP, etc.) and found distinct preferences for each.

Figure 4 visualizes the attributes preferred by CLIP vs. ALIGN.

  • CLIP (on Flowers): Prioritizes “Family” (taxonomic classification) and “Size.”
  • ALIGN (on Flowers): Prioritizes “Parts” (petals, stems) and “Color.”

Figure 4: Most common described attributes for CLIP and ALIGN for CUB and Flowers. Different VLMs prioritize different attributes to represent concepts.

This means that even if two models have similar accuracy, they are achieving it through different representations.

The Case of SigLIP

SigLIP provided perhaps the most entertaining (and concerning) qualitative results. SigLIP is trained on a dataset called WebLI, which relies heavily on OCR (Optical Character Recognition) text. As a result, SigLIP has a strong bias toward text that looks like photo credits, website footprints, or even personal stories.

In Table 8, we see a comparison of descriptions generated for CLIP versus SigLIP.

  • CLIP describes the “Yellow-billed Cuckoo” as a medium-sized bird with dark plumage. (Visual/Factual).
  • SigLIP prefers: “Photo of a Yellow-billed Cuckoo. This image was downloaded from the US Fish & Wildlife Service website…” (Spurious/Metadata).

Table 8: Aligned descriptions generated in response to four different queries for CLIP and SigLIP.

The evolution of these descriptions during the RL training process is also revealing. Table 13 shows how the LLM adjusts its description of a “Chihuahua” over time to please SigLIP.

It starts with a factual description. By step 400, it drifts to “Chihuahuas are my favorite kind of dog…” By step 999, it has converged on a bizarrely specific personal anecdote: “Here’s a photo of my Chihuahua dog Huey who passed away at eleven years old…”

Table 13: Examples of how descriptions change during training for CLIP and SigLIP for the same query.

This suggests that SigLIP strongly associates Chihuahuas with personal blog posts or social media captions regarding pets passing away, rather than the visual features of the dog itself.

Why This Matters

This research paper serves as a wake-up call for how we interpret Vision-Language Models. The Extract and Explore method reveals that high performance on benchmarks does not imply the model is reasoning the way we want it to.

  1. Reliability: If a model identifies a flower because of a “Click to enlarge” artifact, it will fail when deployed in the real world where that artifact is missing.
  2. Dataset Hygiene: The findings highlight the impact of training data. SigLIP’s OCR-heavy data resulted in a model obsessed with website metadata. Future dataset curation needs to account for this to prevent models from learning these shortcuts.
  3. Non-Visual Dependencies: The reliance on habitat and origin stories suggests that VLMs are capturing scene context and correlations rather than strictly “object recognition.”

By forcing these models to “talk” through an aligned LLM, we can finally see the world through their eyes—and it turns out, they’re reading the captions more than they’re looking at the pictures.