Introduction

In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs)—models that can understand both text and images—have become the new frontier. A key capability of these models is In-Context Learning (ICL). This is the ability of a model to learn a new task simply by looking at a few examples provided in the prompt, without requiring any updates to its weights (no fine-tuning necessary).

For example, if you want an MLLM to write a funny caption for an image, you might first show it three examples of images with funny captions. The model “gets the idea” and applies that pattern to your new image.

But here is the billion-dollar question: Which examples should you choose?

If you select random examples, performance is often mediocre. To solve this, researchers use retrievers—algorithms designed to hunt through a dataset and find the most relevant “memories” to help the model. However, existing methods for multimodal retrieval have a significant blind spot: they are heavily biased toward visual data, often ignoring the text associated with those images.

In this post, we will dive deep into a research paper that challenges this status quo. We will explore how textual information is the missing link in multimodal retrieval and unpack a novel supervised method called MSIER (Multimodal Supervised In-context Examples Retrieval). This method doesn’t just look for similar pictures; it learns to predict which examples will actually help the model solve the task.

Background: The Anatomy of Multimodal ICL

To understand the innovation, we first need to understand the baseline. How does Multimodal In-Context Learning (M-ICL) work today?

In a standard setup, you have a Query (the test image you want the model to process) and a Memory (a training dataset containing image-text pairs). The goal is to retrieve a few items from the Memory to form a prompt.

Figure 1: An overview of multimodal in-context example retrieval.

As shown in Figure 1, the process follows a pipeline:

  1. Query Input: The system receives an input, such as a photo of a golden retriever.
  2. Retriever: This component scans the memory bank. In most previous works, like the RICES method, the retriever only looks at visual similarity. It might think, “This query is a dog; let me find other pictures that look like this.”
  3. Prompt Construction: The retrieved examples (e.g., a photo of birds) are stacked together with the query.
  4. MLLM Inference: The MLLM processes this sequence to generate an output.

The problem illustrated in Figure 1 is subtle but critical. The retriever selected an image of birds and a restaurant. While they might share some low-level visual features or just be random selections, they don’t necessarily teach the model how to caption the dog image.

The researchers behind this paper posed a fundamental hypothesis: Since MLLMs process both vision and language, shouldn’t our retrieval systems consider the text of the examples, not just the pixels?

The Unsupervised Discovery: Does Text Matter?

Before building a complex new system, the authors first had to prove that text makes a difference. They conducted an investigation using unsupervised retrieval (finding similar examples without training a specific neural network to do so).

They compared two settings:

  1. Q-I-M-I (Query-Image-Memory-Image): This is the standard approach. The system calculates the cosine similarity between the query image and the memory images. Text is ignored.
  2. Q-I-M-IT (Query-Image-Memory-Image+Text): Here, the system calculates similarity based on the image and the caption associated with the memory image.

The results were immediate and striking.

Figure 3: Comparison of Image-only vs. Image+Text retrieval performance.

Figure 3 shows the performance on an Image Captioning task (measured by the CIDEr score, where higher is better) as the number of “shots” (examples) increases. The red line (Q-I-M-IT), which includes text, consistently outperforms the blue line (Q-I-M-I).

This confirmed the intuition: Textual information is not just “extra” metadata; it is a crucial signal for selecting high-quality in-context examples.

Core Method: MSIER

Establishing that text matters was step one. Step two was addressing the limitation of unsupervised retrieval. Just because an image looks similar, or a caption reads similarly, doesn’t guarantee it will help the MLLM perform better. The ultimate metric of a good example is: Does including this example reduce the model’s error on the query?

To optimize for this, the authors introduced MSIER (Multimodal Supervised In-context Examples Retrieval).

The Concept: Learning to Retrieve

MSIER is a supervised method. This means it requires a training phase where a retriever model effectively “learns” which examples are helpful and which are not.

The method operates in two distinct stages: Scoring and Training.

Figure 2: Overview of the MSIER Method.

Stage 1: Scoring with the MLLM

Look at Figure 2. The process begins with the “Retriever” selecting a broad set of candidates (Top-N) from the training data. But we don’t know which of these 50 or so candidates are actually good.

To find out, the researchers use the MLLM itself as a judge.

  1. They take a training instance (e.g., an image of food).
  2. They pair it with different candidates retrieved from memory.
  3. They ask the MLLM to generate the caption and measure the NLL Loss (Negative Log-Likelihood).

If a candidate example causes the MLLM to be very confident and accurate (Low NLL Loss), it is labeled a Positive sample. If it leads to confusion or poor results (High NLL Loss), it is labeled a Negative sample.

In Figure 2, the examples with a high CIDEr score (107.41) are the “Positive” pairs, while those with lower scores (81.71) are “Negative.”

Stage 2: Contrastive Learning

Now that we have a labeled dataset of “Good” and “Bad” examples for various queries, we can train the retriever.

The goal is to train a model (initialized with CLIP) so that for any given query, the vector representation of the query is close to the vector representations of the “Positive” examples and far from the “Negative” ones. This is achieved using Contrastive Learning.

The mathematical objective is to minimize the loss function \(\mathcal{L}\):

Equation 2: The contrastive loss function.

In this equation:

  • \(x_q\) is the query.
  • \(e^+\) represents the positive example.
  • \(e^-\) represents the negative examples.
  • The function attempts to maximize the similarity (cosine) between the query and the positive example (the numerator) while minimizing similarity to the negatives (the denominator).

The Role of Text in Supervision

The authors didn’t just blindly apply supervision; they revisited their core finding about text. They experimented with different configurations for the supervised training: training with images only vs. training with images and text.

Figure 4: Impact of texts on proposed MSIER method.

Figure 4 illustrates this ablation study. The “T” stands for Training and “E” for Evaluation.

  • Green/Purple bars (T: Q-I-M-I): The retriever was trained using only images.
  • Pink/Blue bars (T: Q-I-M-IT): The retriever was trained using both images and text.

The results are clear: The bars on the right (Pink/Blue), where text was included in the training process, are significantly higher. This proves that incorporating text during the supervised training phase creates a much more robust retriever.

Experiments & Results

The researchers validated MSIER across three distinct multimodal tasks:

  1. Image Captioning (MS COCO dataset)
  2. Visual Question Answering (OK-VQA)
  3. Hateful Memes Classification (Detecting hate speech in memes)

Quantitative Performance

The proposed method demonstrated superior performance across the board. In the image captioning task, for example, MSIER with only 4 shots (4 examples) achieved performance comparable to random selection using 32 shots. This is a massive efficiency gain, allowing MLLMs to perform better with shorter prompts (which saves computational cost and context window space).

The table below (Table 9 from the paper) shows a specific comparison on the MS COCO dataset, highlighting that using MSIER as the retriever consistently yields higher scores than using standard CLIP-based retrieval (MMICES-CLIP).

Table 9: Comparison of M-ICL performance.

Qualitative Analysis

Numbers are great, but what does a “better” example actually look like?

Figure 6: Multimodal in-context examples retrieved by different methods.

Figure 6 provides a fascinating look at the retrieved examples.

  • Row 1 (RICES): The baseline visual retriever sees a tennis player and retrieves… just a generic tennis player.
  • Row 2 (MUIER): The unsupervised method with text gets a bit closer, mentioning the tennis court.
  • Row 3 (MSIER): The supervised method retrieves an example that is semantically dense: “A man holding a tennis racket with a ball in the air.”

By retrieving examples that share deep semantic structures rather than just surface-level visual similarities, MSIER helps the MLLM generate more precise and descriptive captions for the query.

Robustness and Transferability

Two common concerns with supervised methods are sensitivity to order and lack of transferability. The paper addresses both.

1. Does the order of examples matter? In standard Large Language Models, the order of prompt examples can drastically change the output. Surprisingly, Figure 5 shows that for MSIER (the green “Sup” line), the performance is relatively stable regardless of permutation order. This suggests that when the examples are high-quality, the model is less confused by their arrangement.

Figure 5: Impact of the order of retrieved multimodal in-context examples.

2. Can the retriever transfer to new datasets? Training a retriever for every single dataset is expensive. The authors tested if a retriever trained on OK-VQA could work on MS COCO.

Table 3: Transferability of MSIER.

Table 3 shows that while training on the target dataset (MS COCO) is optimal, a retriever trained on OK-VQA (bottom row) still performs admirably on MS COCO, outperforming the unsupervised baseline. This suggests that MSIER learns general principles of “what makes a good example” that can cross dataset boundaries.

3. Can it transfer between Model Sizes? Perhaps most importantly for practitioners, the authors found that a retriever trained using a smaller “Scorer” model (OpenFlamingo-3B) worked perfectly well for a larger inference model (OpenFlamingo-9B). You don’t need to burn resources scoring with your largest model to train the retriever.

Impact of Masking Text

As a final sanity check, the researchers asked: “What happens if we find the perfect examples, but then delete the text from them in the prompt?”

Table 7: Impact of masked text.

Table 7 shows the devastating impact of masking. If you remove the text captions from the selected examples (rows with “w/ mask”), performance collapses (e.g., from 100.58 down to 77.62 for MSIER). This reinforces the central thesis: the MLLM relies heavily on the textual component of the in-context examples to ground its understanding.

Conclusion

The research paper “How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?” offers a course correction for the field of Multimodal AI. It highlights that we have been underutilizing the linguistic half of “Vision-Language” models during the retrieval phase.

By moving from visual-only unsupervised retrieval to text-aware supervised retrieval (MSIER), we can:

  1. Find examples that are semantically relevant, not just visually similar.
  2. Achieve higher accuracy with fewer examples.
  3. Create retrievers that are robust and transferable across tasks and model sizes.

As MLLMs continue to scale, efficient context utilization will become increasingly important. MSIER provides a blueprint for how to feed these models exactly what they need to succeed.