Bridging the Modality Gap: How IFCap Masters Zero-Shot Image Captioning Without Seeing Images

Image captioning—the art of teaching computers to describe what they see—has traditionally relied on massive datasets of paired images and texts. You show the model a picture of a cat, you give it the text “a cat sitting on a mat,” and repeat this millions of times. While effective, this approach is expensive and computationally heavy.

But what if a model could learn to caption images without ever seeing an image during training?

This concept, known as text-only training, utilizes the rich semantic knowledge embedded in language models. However, it faces a significant hurdle: the Modality Gap. The way a model represents text data mathematically is often fundamentally different from how it represents image data. If you train on text but test on images, the model often stumbles because the inputs “look” different in the embedding space.

In this post, we are diving deep into IFCap (Image-like Retrieval and Frequency-based Entity Filtering), a novel framework proposed by researchers from Hanyang University. IFCap introduces a clever way to bridge this modality gap using noise injection and a statistical approach to object detection that doesn’t rely on fixed vocabularies.

The Core Problem: The Modality Gap

To understand IFCap, we first need to understand the environment it operates in. Modern vision-language models like CLIP (Contrastive Language-Image Pre-training) project images and text into a shared vector space. In theory, the embedding for an image of a dog and the text “a dog” should be identical.

In reality, they are merely close, not identical. They occupy distinct regions within that space.

When a captioning model is trained strictly on text data (text-to-text), it optimizes its internal weights to process text embeddings. During inference (testing), we feed it an image embedding. Because of the slight mismatch in distribution—the modality gap—the model’s performance degrades. It’s like training a translator to read French but testing them on a dialect of Creole; they might get the gist, but the precision is lost.

Top: The modality gap causes a disconnect between training and inference. Bottom: Comparison of entity retrieval methods.

As shown in Figure 1 (Top), traditional text-to-text retrieval approaches overlook this gap. The yellow arrows represent the training flow (text), while the blue arrows represent the inference flow (images). They point in different directions, leading to suboptimal results.

The researchers visualized this phenomenon explicitly using t-SNE, a technique for visualizing high-dimensional data.

The distribution of CLIP embedding features. Note the gap between text retrieval (yellow) and image data (purple).

In Figure 2, look at the separation between the yellow dots (Text-to-text Retrieval) and the purple/orange dots (Images and Ground Truth). That physical distance on the plot represents the modality gap that causes errors in caption generation.

The IFCap Solution

IFCap proposes a unified framework to solve this by making the training data “look” more like the inference data and by extracting better object information during testing.

The overview of the IFCap architecture.

The architecture, illustrated in Figure 4, consists of three main innovations:

Image-like Retrieval (ILR): A training technique to simulate image features.
Fusion Module (FM): A mechanism to combine input features with retrieved context.
Frequency-based Entity Filtering (EF): A robust inference-time strategy for identifying objects.

Let’s break these down step-by-step.

1. Image-like Retrieval (ILR)

The researchers asked a simple question: If the model is going to see image embeddings during testing, why don’t we force the text embeddings during training to resemble image embeddings?

They achieved this through noise injection. During the training phase, instead of using the clean, perfect embedding of the input text, they add specific Gaussian noise to it.

\[ T _ { i } = { \mathcal { E } } _ { T } ( t _ { i } ) , T _ { i } ^ { \epsilon } = T _ { i } + \epsilon _ { r } . \]

Here, \(T_i\) is the text embedding. \(\epsilon_r\) is the noise sampled from a normal distribution. The resulting \(T_i^\epsilon\) is a “noisy” text embedding that statistically overlaps more with the distribution of image embeddings.

When the system performs retrieval (finding similar sentences in a database to help generate the caption), it uses this noisy query. This forces the model to become robust to the kind of variations it will encounter when it actually sees image embeddings later.

Referring back to Figure 2, look at the green dots labeled “Ours.” Notice how they cluster much more closely to the image and ground truth distributions compared to the standard text retrieval methods. By simulating the “imperfection” of image data, the model learns to handle it.

2. The Fusion Module

Retrieving similar sentences is great, but simply handing them to the language model isn’t enough. The model needs to weigh the importance of the input against the retrieved context.

The Fusion Module uses an attention mechanism to blend these sources. It takes the noisy input text features (\(T_e\)) and the features from the retrieved captions (\(R_e\)) and processes them through a cross-attention layer.

\[ \begin{array} { r } { \begin{array} { r l } & { T _ { e } = T _ { i } + \epsilon , ~ R _ { e } = \mathcal { E } _ { T } ( \mathrm { I L R } ( T _ { i } ) ) , } \\ & { F _ { e } = f _ { A t t } ( f _ { l _ { 1 } } ( T _ { e } ) , f _ { l _ { 2 } } ( R _ { e } ) ) , } \\ & { F = \mathrm { M a p } ( F _ { e } ; \theta _ { q } ) . } \end{array} } \end{array} \]

The fusion representation \(F\) effectively captures the interaction between the input and the retrieved knowledge. This fused feature is then fed into a mapping network (a Transformer) to prepare it for the caption decoder (GPT-2).

3. Frequency-based Entity Filtering (EF)

Perhaps the most intuitive contribution of this paper is how it handles object detection during inference (when the model is actually captioning images).

Previous methods (like ViECap) used classifiers to guess what objects were in an image. However, classifiers are limited by a fixed vocabulary. If “avocado” isn’t in the classifier’s list, the model will never explicitly detect it, potentially mislabeling it as “fruit” or “ball.”

IFCap takes a different approach: Frequency-based Entity Filtering.

Retrieve: Given an input image, the system retrieves the Top-\(K\) most similar sentences from a massive text database.
Parse: It extracts all nouns from these retrieved sentences.
Count: It calculates the frequency of each noun.
Filter: Nouns that appear frequently are likely to be present in the image. These are selected to form a “hard prompt.”

For example, if you retrieve 10 sentences for a picture of a park, and 8 of them contain the word “bench,” there is almost certainly a bench in the image, even if a standard classifier misses it.

Comparison of entity precision. The green bars (Entity Filtering) significantly outperform standard classifiers.

The effectiveness of this method is evident in Figure 3. The green bars represent IFCap’s Entity Filtering. On the COCO dataset, it achieves 86.1% precision, drastically outperforming ViECap (blue) and even the object detector DETR (brown).

This method creates a dynamic vocabulary. If the retrieved sentences contain rare words, IFCap can use them, freeing the model from fixed lists of objects.

Adaptive Thresholding

To decide which nouns to keep, the researchers propose using an adaptive threshold based on the statistical distribution of the noun frequencies:

\[ \tau _ { \mathrm { a d a p } } = \mu _ { F } + \sigma _ { F } . \]

By setting the threshold (\(\tau\)) to the mean frequency plus one standard deviation, the system dynamically adjusts to the confidence level of the retrieved sentences, selecting only the most salient entities.

Experiments and Results

The researchers validated IFCap on standard benchmarks like MS-COCO, Flickr30k, and NoCaps. The results show that addressing the modality gap yields significant improvements.

In-Domain Performance

On the COCO and Flickr30k datasets (where the training text style matches the test style), IFCap achieved state-of-the-art results among text-only methods.

Results on In-domain captioning. IFCap leads in almost every metric.

In Table 1, you can see IFCap (bottom row) scoring 108.0 on the CIDEr metric for COCO, beating the previous best (SynTIC) of 101.1. This is a substantial margin in the world of image captioning.

Cross-Domain Generalization

A true test of a captioning model is how well it handles images from domains it hasn’t seen. The researchers tested this by training on COCO text but testing on Flickr30k images (and vice versa).

Results on Cross-domain captioning.

Table 2 highlights IFCap’s robust generalization. Even without seeing images during training, the alignment provided by Image-like Retrieval allows the model to adapt to new visual domains more effectively than competitors like Knight or ViECap.

Video Captioning

Remarkably, IFCap also extends to video. By averaging visual features across video frames, the team applied the same architecture to the MSR-VTT and MSVD datasets.

Results on Video captioning.

As shown in Table 4, IFCap sets a new standard for text-only video captioning, demonstrating that the principles of Image-like Retrieval hold true even when temporal dynamics are introduced.

Tuning the Noise

One interesting aspect of the research was determining exactly how much noise to inject during training. Too little noise, and the modality gap remains. Too much, and the signal is lost.

Hyper-parameter search for the noise level sigma.

Figure 5 visualizes this search. The performance (y-axis) peaks around a noise variance (\(\sigma^2\)) of 0.04. This “sweet spot” confirms that a specific amount of perturbation is necessary to optimally align the text and image spaces.

Conclusion

IFCap represents a significant step forward in zero-shot image captioning. By acknowledging and actively addressing the modality gap, the researchers turned a weakness of text-only training into a manageable engineering problem.

The combination of Image-like Retrieval (making text look like images) and Frequency-based Entity Filtering (using retrieval consensus to find objects) allows the model to generate accurate, detailed captions without the immense cost of collecting paired image-text data.

This work suggests that in the era of large pre-trained models like CLIP, we can achieve remarkable results not just by building bigger models, but by better aligning the representations we already have. For students and researchers in multimodal AI, IFCap serves as a perfect example of how statistical intuition and geometric alignment can solve deep learning problems.

The Core Problem: The Modality Gap#

The IFCap Solution#

1. Image-like Retrieval (ILR)#

2. The Fusion Module#

3. Frequency-based Entity Filtering (EF)#

Adaptive Thresholding#

Experiments and Results#

In-Domain Performance#

Cross-Domain Generalization#

Video Captioning#

Tuning the Noise#

Conclusion#