Introduction

Humans possess an innate ability to understand the world through multiple senses. We effortlessly combine visual cues with language to interpret complex scenes. If you see a picture of “a horse riding a man,” you immediately recognize the absurdity and distinguish it from “a man riding a horse.” This ability to understand how different components (objects, attributes, relations) combine to form meaning is called compositional reasoning.

In the world of Artificial Intelligence, Vision-Language Models (VLMs) like CLIP have revolutionized how computers understand images and text. They are fantastic at recognizing objects and matching images to captions in a general sense. However, they have a “bag-of-words” problem. To a standard VLM, “a horse riding a man” and “a man riding a horse” look mathematically almost identical because they contain the same words.

Researchers have tried to fix this by fine-tuning these models with “hard negatives”—captions that are grammatically tricky. While this improves the model’s logic, it often comes at a steep price: the model creates a “tunnel vision,” losing its general ability to recognize novel concepts (zero-shot performance) or retrieve simple images.

In this post, we dive into a new paper, “Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality”, which proposes a solution called FSC-CLIP. This method allows models to learn fine-grained logic without forgetting the general knowledge they were pre-trained on.

The Trade-Off: Smarter Logic vs. General Knowledge

To understand the core problem, we first need to look at how models like CLIP are typically fine-tuned. The standard approach involves Contrastive Learning. You take an image and its correct caption (positive), and you contrast it against incorrect captions (negatives).

To teach compositionality, researchers use Hard Negatives (HN). These are captions that are very similar to the truth but differ in one key aspect, like swapping the subject and object.

  • Image: A dog biting a man.
  • Positive Text: “A dog biting a man.”
  • Hard Negative Text: “A man biting a dog.”

The goal is to force the model to push the “Hard Negative” representation away from the image. However, most methods do this using Global Representations—single vectors representing the whole image and the whole sentence.

Because the Hard Negative text is so semantically similar to the original text, forcing their vectors apart distorts the model’s carefully learned “multi-modal space.” The model gets so obsessed with this specific grammatical distinction that it damages the general alignment between images and text.

Figure 1: A holistic comparison of fine-tuning methods for vision-language compositionality. Enhancing compositionality often compromises multi-modal task performance in previous approaches. Our FSC-CLIP bridges this gap,minimizing these trade-offs.Full experimental results are provided in Tab.1.

As shown in Figure 1 above, previous methods (blue and green lines) often see a drop in general performance (Zero-Shot Average and Retrieval Average) when they try to improve compositionality. The proposed method, FSC-CLIP (orange line), pushes the boundary outward, achieving high scores on both axes.

The FSC-CLIP Framework

The researchers propose Fine-grained Selective Calibrated CLIP (FSC-CLIP). Instead of relying solely on blunt, global comparisons, this framework introduces two major innovations:

  1. Local Hard Negative (LHN) Loss: Looking at the problem with a magnifying glass (patch-token alignment) rather than a telescope (global vectors).
  2. Selective Calibrated Regularization (SCR): A smarter way to calculate loss that handles the ambiguity of hard negatives.

Let’s visualize the entire architecture before breaking down the math.

Figure 2: A complete FSC-CLIP framework that incorporates Local Hard Negative (LHN) Loss with Selective Calibrated Regularization (SCR), alongside a global HN loss. The LHN loss measures similarity between an image and a text at the patch and token levels to more accurately identifysubtle diferences between original and HN texts. SCR combines focal loss with label smoothing to mitigate the adverse effects of using hard negative losses.

In Figure 2, you can see the dual-pathway approach. The model computes both Global similarities (top path) and Local similarities (bottom path), combining them to update the model.

Innovation 1: Local Hard Negative (LHN) Loss

The first issue with standard fine-tuning is that a global vector is too abstract to capture the difference between “man on horse” and “horse on man.”

FSC-CLIP introduces a Local Hard Negative (LHN) Loss. Instead of summarizing the whole image into one vector, it looks at individual visual patches (\(v_p\)) and aligns them with specific text tokens (\(t_w\)).

Step 1: Textual-Aligned Visual Patches

First, the model determines which parts of the image correspond to which words. It calculates a similarity map between every word and every image patch. It then normalizes these scores to create an “attention weight” (\(a_{w,p}\)):

Equation 008

Using these weights, the model aggregates the visual patches to create a “textual-aligned” visual representation (\(\hat{v}_w\)) for each word. Essentially, for the word “dog,” the model synthesizes a visual vector based mostly on the pixels where the dog is located.

Equation 009

Step 2: Token-Level Similarity

Once we have a visual vector for every word, we compare them directly. The local similarity score \(S_l\) is the sum of the similarities between each word token (\(t_w\)) and its corresponding visual area (\(\hat{v}_w\)).

Equation 010

Step 3: The LHN Loss Function

Finally, this local similarity score is used in a contrastive loss function. This forces the model to ensure that the specific patches of the image align better with the correct sentence structure than with the hard negative sentence structure.

Equation 007

By focusing on these local details, the model can learn compositionality (finding the dog, finding the man, checking who is biting whom) without having to drastically alter the global vectors that represent the general concept of the scene.

Innovation 2: Selective Calibrated Regularization (SCR)

Even with local attention, there is a risk. Hard Negative texts are mostly correct. “A man biting a dog” contains all the correct objects for an image of a dog biting a man. If we tell the model “This text is WRONG (0% match),” we are technically lying. It’s a 90% match with a 10% structural error.

Punishing the model too hard for these high-similarity negatives causes “catastrophic forgetting” of general knowledge. Selective Calibrated Regularization (SCR) solves this with two techniques.

Technique A: Focal Loss

Standard Cross-Entropy loss treats all errors equally. Focal Loss, originally designed for object detection, down-weights “easy” examples and focuses on “hard” ones.

In this context, if the model is already confident about the relationship between the image and the original text, the loss function reduces the signal. It prioritizes the confusing cases—the “Hard Negatives” that are deceptively similar to the image.

Figure 3: A conceptual illustration of the confidencebased weighting mechanism in HN loss. It reduces the adverse impact of HN supervision by lowering the signal from confident predictions while selectively focusing on challenging ones,crucial for learning compositionality.

The mathematical formulation for this Focal Loss applies a modulation factor \((1 - p)^\gamma\). As the probability \(p\) of the correct class increases (confidence goes up), the loss approaches zero.

Equation 012

Technique B: Label Smoothing

This is the “Calibration” part of SCR. Instead of using a hard target of 1 for the positive text and 0 for the hard negative, the researchers use Label Smoothing.

They assign a small positive value to the Hard Negatives. This tells the model: “It’s okay that this negative sentence looks similar to the image; it actually shares a lot of content. Just make sure the original sentence scores higher.”

Equation 013

Here, \(\beta\) is a smoothing parameter. This prevents the model from violently pushing away the Hard Negative vectors, preserving the integrity of the learned representation space.

The Total Objective

The final training objective combines the standard CLIP loss (to keep general knowledge), the Global Hard Negative loss, and the new Local Hard Negative loss, balanced by weighting factors \(\lambda\).

Equation 014

Experiments and Results

The researchers evaluated FSC-CLIP on a massive suite of benchmarks:

  • 11 Compositionality Benchmarks: Tests like SugarCrepe and Winoground that specifically look for logic/grammar understanding.
  • 21 Zero-Shot Tasks: Standard classification datasets (ImageNet, etc.) to check if general knowledge remains intact.
  • Retrieval Tasks: Finding images from text on COCO and Flickr30k.

Quantitative Performance

The results, summarized in Table 1, show that FSC-CLIP achieves the “best of both worlds.”

Table 1: A holistic comparison of fine-tuning methods…

  • Compositionality: FSC-CLIP scores 66.3, significantly higher than the original CLIP (57.1) and competitive with state-of-the-art models like NegCLIP.
  • Zero-Shot (ZS): Crucially, while other models drop significantly in Zero-Shot performance (e.g., DAC-LLM drops to 51.1), FSC-CLIP maintains a score of 58.3, very close to the original pre-trained model.
  • Retrieval: It achieves the highest retrieval scores among fine-tuned models, indicating the latent space is well-preserved.

The “Pareto Frontier”

A great way to visualize trade-offs is a trajectory chart. Figure 4 plots Compositionality (Y-axis) against Zero-Shot Classification (X-axis). Ideally, you want to be in the top-right corner.

Figure 4: Fine-tuning trajectories between compositionality(Comp) and zero-shot classification (ZS)…

Most methods (blue, green) curve backward—as they go up in compositionality, they go left (worse) in classification. FSC-CLIP (orange) shoots almost straight up, improving logic without sacrificing general knowledge.

Qualitative Examples

What does this look like in practice? Figure 5 shows a retrieval task where the model must pick the correct caption.

  • Scenario 1: An image of oranges and apples.

  • CLIP: Gets confused, ranks “oranges and cups” highly.

  • DAC-LLM: Gets confused.

  • FSC-CLIP: Correctly identifies “oranges and apples” and correctly rejects “oranges and cups.”

  • Scenario 2: A man bending over a table with candles.

  • CLIP: Thinks it’s a “cake with candles” (hallucinating the cake because candles usually imply cake).

  • FSC-CLIP: Correctly focuses on the “table” vs “cake” distinction.

(Note: While the qualitative figure is described here to illustrate the point, please refer to Figure 5 in the original paper for the visual examples).

Why Each Component Matters

The researchers performed an ablation study (removing parts of the model to see what breaks).

Table 2: Impact by individual component…

  • Row 2: Using only Local HN loss preserves multi-modal performance (high retrieval scores) but doesn’t maximize compositionality.
  • Row 3: Combining Global and Local HN losses improves compositionality but hurts retrieval (the trade-off appears).
  • Row 6 (Ours): Adding SCR (Focal Loss + Label Smoothing) restores the retrieval performance while keeping the high compositionality. This confirms that calibration is just as important as the local architecture.

Conclusion

The paper “Preserving Multi-Modal Capabilities of Pre-trained VLMs” highlights a critical maturity step in AI development. We are moving past the phase of “getting models to work” and into the phase of “getting models to work precisely without breaking them.”

FSC-CLIP demonstrates that Local Hard Negative Loss allows models to see fine-grained details that global vectors miss. Simultaneously, Selective Calibrated Regularization treats the learning process with nuance, acknowledging that “negative” examples aren’t always 100% wrong.

By respecting the subtle geometry of the multi-modal space, FSC-CLIP allows us to have our cake and eat it too: an AI that understands the grammar of visual scenes while remaining a robust, general-purpose learner. As VLMs continue to be integrated into robotics, search engines, and assistants, this kind of reliable, compositional understanding will be essential.