Beyond the Single Frame: Why Multimodal LLMs Struggle with Multi-Image Scenarios

Introduction

The rise of Multimodal Large Language Models (MLLMs) like GPT-4V, LLaVA, and mPLUG-Owl has revolutionized how Artificial Intelligence perceives the world. These models can describe photos, answer questions about diagrams, and even write code based on whiteboard sketches. However, there is a significant gap between these benchmark achievements and real-world utility.

Most current benchmarks focus on single-image scenarios. The model is given one picture and asked a question. Yet, human visual consumption is rarely isolated to a single frame. When we browse a website, we integrate information from multiple product photos and textual descriptions. When we watch a tutorial, we follow a temporal sequence of steps. When we scroll through social media, we process interleaved text and images simultaneously.

If MLLMs are to become truly useful assistants, they must master multi-image understanding.

This brings us to MIBench, a comprehensive benchmark proposed by researchers from the Chinese Academy of Sciences and Alibaba Group. This paper introduces a rigorous framework to evaluate MLLMs not just on what they see in a single glance, but on how they reason, compare, and learn across multiple images. The results are sobering: while models excel at single images, they often crumble when faced with the complexity of multi-image inputs.

Figure 1: Overview of our MIBench, which covers three multi-image scenarios and a total of 13 tasks.

Background: The Gap in Evaluation

To understand why MIBench is necessary, we must look at the existing landscape of multimodal evaluation.

Standard benchmarks like MME, MMBench, and SEED-Bench have set the standard for evaluating MLLMs. They test recognition, localization, and reasoning, but they almost exclusively utilize single-image inputs.

There have been attempts to bridge this gap. Benchmarks like Sparkles-Eval look at multi-image dialogue, and Mantis-Eval looks at multi-image reasoning. However, these are often limited in scale (containing only a few hundred samples) or rely entirely on GPT-4 for scoring, which can introduce bias.

As shown in the comparison table below, MIBench represents a significant leap in scale and complexity. It introduces 13 distinct tasks across 13,000 samples, utilizing objective metrics rather than subjective model-based scoring.

Table 1: Comparison of the proposed MIBench with recent MLLM benchmarks.

The Core Method: Structuring Multi-Image Evaluation

The researchers categorized multi-image capabilities into three distinct scenarios, creating a taxonomy that helps us understand exactly where a model fails or succeeds.

1. Multi-Image Instruction (MII)

This scenario tests the fundamental ability to process multiple images simultaneously to follow an instruction. It is not enough to recognize objects; the model must understand the relationship between images. This category is broken down into five sub-tasks:

General Comparison (GC): Can the model identify if two images depict the same scene or attribute?
Subtle Difference (SD): A “spot the difference” task. This is incredibly difficult for current AI, requiring fine-grained perception to notice if a floor texture changed or an object moved.
Visual Referring (VR): The model must use one image to understand a reference in another. For example, “Is the object in Image 1 located to the left of the object in Image 2?”
Temporal Reasoning (TR): Evaluating understanding of sequence and time, similar to video understanding but using keyframes.
Logical Reasoning (LR): Analyzing causal relationships. For instance, seeing a sequence of a boy extending his hand and determining why he did it based on context.

2. Multimodal Knowledge-Seeking (MKS)

In the real world, visual information is often accompanied by text, and we use both to answer questions. This scenario provides the model with “external knowledge” in the form of interleaved images and text (like a Wikipedia page or a slide deck) and asks a question that requires synthesizing this information.

Fine-grained Visual Recognition (FVR): Recognizing specific breeds or types (e.g., dogs, flowers) by comparing the query image against a set of reference images provided in the context.
Text-Rich Images (TRI): Extracting information from slides or documents where visual layout and text are crucial.
Vision-linked Textual Knowledge (VTK): The question is about a visual entity, but the answer resides in the accompanying text.
Text-linked Visual Knowledge (TVK): Conversely, the question is text-based, but the answer requires verifying visual attributes in the images.

3. Multimodal In-Context Learning (MIC)

In-Context Learning (ICL) is the ability of an LLM to learn a new task simply by seeing a few examples (demos) in the prompt, without weight updates. MIC extends this to vision. Can we show a model three pictures of “defective parts” and three pictures of “good parts,” and have it classify a new image correctly?

The researchers evaluated this via:

Close-ended & Open-ended VQA: Answering questions based on patterns seen in demos.
Hallucination: Testing if providing factual demos reduces the model’s tendency to make up objects.
Demo-based Task Learning: Providing examples without instructions (e.g., “Image: [Rabbit], Text: 1”) to see if the model can infer the counting rule.

Figure 2: Examples of the multi-image scenarios with a total of 13 tasks.The corrct answers are marked n blue.

Data Construction and Quality Control

Creating a benchmark of this magnitude requires more than just scraping images. The authors employed a rigorous pipeline to ensure validity:

Distractor Generation: For multiple-choice questions, the wrong answers (distractors) must be plausible. The researchers used GPT-4 to generate challenging distractors or sampled hard negatives from dataset annotations. For example, in Temporal Reasoning, distractors might describe the same object but in a different incorrect sequence order.
External Knowledge Sampling: For MKS tasks, they didn’t just grab random text. They selected text and images that were relevant but required precise reasoning to avoid giving the model “easy” shortcuts.
Circular Evaluation: To prevent the model from guessing based on option position (e.g., always choosing “A”), the correct answer is rotated through all positions, and the model is only credited if it answers correctly in all configurations.
The “Blind” Test: To ensure the questions actually required vision, they fed the text-only portions to models. If a model could answer the question without seeing the images, the sample was discarded. This removes the “language bias” common in VQA datasets.

Experiments & Results

The team evaluated a wide range of models, including proprietary giants like GPT-4V and GPT-4o, and open-source models like LLaVA-1.5, Qwen-VL, Mantis, and mPLUG-Owl3.

1. The Performance Gap

The results table below highlights a stark reality: Closed-source models are currently in a league of their own regarding multi-image tasks.

Table 2: Evaluation results on the multi-image instruction and multimodal knowledge-seeking scenarios of MIBench

Key Observations:

GPT-4o Dominance: It leads across almost all categories, particularly in tasks requiring fine details like “Subtle Difference” (SD) and “Visual Referring” (VR).
Open-Source Struggles: While models like mPLUG-Owl3 perform decently on general comparison, they collapse on tasks requiring fine-grained perception. For example, in the Subtle Difference task, LLaVA-1.5 achieves only 14.9% accuracy, while GPT-4o achieves 90.5%.
The “Visual Referring” Bottleneck: Even the best model (GPT-4o) struggles with Visual Referring, achieving less than 50% accuracy. This suggests that current architectures struggle to map spatial relationships across different images.

2. Qualitative Analysis: Why do they fail?

Numbers tell us that models fail, but images tell us why. In the “Subtle Difference” task, the model must spot small changes between two images.

In the example below, the difference between Image 1 and Image 2 is the addition of a mushroom pizza or an olive oil bottle. Open-source models, likely due to lower image resolution processing or weaker attention mechanisms, fail to attend to these small, localized changes when the overall scene remains identical.

Figure 3: A qualitative case of the Subtle Difference task,where open-source MLLMs show inferior performance due to limited fine-grained perception ability.

3. The Failure of In-Context Learning

One of the most surprising findings in the paper concerns Multimodal In-Context Learning (MIC). In text-only LLMs, providing more examples (shots) almost always improves performance. In Vision-Language models, this is not the case.

Figure 4: Evaluation results on the Multimodal In-Context Learning scenario.

As shown in the charts above:

Negative Scaling: For many open-source models (like MMICL), performance actually decreases or remains flat as you add more demonstration images (Figures a and c).
Hallucination Persistence: Providing examples does very little to cure hallucinations (Figure c).
Format vs. Reasoning: While models can learn the format of the output (e.g., outputting a number), they struggle to learn the reasoning logic (e.g., how to count objects) purely from examples (Figure d).

4. The “Multi-Image Confusion” Phenomenon

The researchers identified a critical weakness termed “Multi-Image Confusion.” They ran an ablation study using the POPE (hallucination) dataset.

Setup A: Show 1 Image. Ask: “Is there a dog?” (Correct answer: No).
Setup B: Show 1 Image + 1 Distractor Image (that contains a dog). Ask about the first image: “Is there a dog?”

The Result: When the second image was introduced, the performance of single-image trained models (like LLaVA-1.5) dropped significantly (Table 4 below). The mere presence of a dog in an adjacent image caused the model to hallucinate a dog in the target image. This indicates a “bleeding” of visual features where the model fails to segregate information from different visual inputs.

Table 4: Ablation study on the multi-image confusion phenomenon and the temporal reasoning task.

Conclusion and Implications

MIBench serves as a reality check for the field of Multimodal AI. It demonstrates that while we have made massive strides in single-image captioning and QA, we are still in the early stages of true multi-image understanding.

Key Takeaways for Students:

Perception vs. Reasoning: Models are better at “seeing” (General Comparison) than “thinking” (Logical Reasoning or Temporal Reasoning).
Resolution Matters: The inability to spot subtle differences highlights the need for architectures that can handle high-resolution inputs efficiently.
Context is Hard: Simply concatenating images into a prompt is not enough. Models need specific training to understand boundaries and relationships between different visual inputs, otherwise, “confusion” occurs.
Benchmarks Drive Progress: By moving evaluation from single images to realistic interleaved sequences, MIBench provides the roadmap for the next generation of assistants that can truly browse the web and watch videos like humans do.

The future of MLLMs isn’t just about seeing better; it’s about seeing more and understanding the connections between the pieces. MIBench is the measuring stick we need to get there.

Introduction#

Background: The Gap in Evaluation#

The Core Method: Structuring Multi-Image Evaluation#

1. Multi-Image Instruction (MII)#

2. Multimodal Knowledge-Seeking (MKS)#

3. Multimodal In-Context Learning (MIC)#

Data Construction and Quality Control#

Experiments & Results#

1. The Performance Gap#

2. Qualitative Analysis: Why do they fail?#

3. The Failure of In-Context Learning#

4. The “Multi-Image Confusion” Phenomenon#

Conclusion and Implications#