Introduction
In the race toward Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have taken center stage. We have seen models like GPT-4V and Gemini demonstrate incredible proficiency in understanding static images—describing complex scenes, reading handwriting, and even explaining memes. However, the real world is not a series of frozen snapshots; it is a dynamic, continuous flow of visual, auditory, and textual information.
To truly approximate human-level perception, AI must master video analysis. But here lies a significant gap: while MLLM development has surged, the benchmarks used to test them have lagged behind. Most existing video benchmarks focus on short clips (often just a few seconds long) or lack the diverse data modalities (like subtitles and audio) that make video such a rich medium.
Enter Video-MME, a groundbreaking benchmark introduced in a recent paper by researchers from USTC, XMU, HKU, and others. Video-MME represents the first comprehensive, full-spectrum evaluation designed specifically for MLLMs in video analysis. By challenging models with videos ranging from 11 seconds to a full hour, and incorporating subtitles and audio, this research provides a sober look at where current AI stands—and the significant hurdles that remain in processing long-form temporal data.
The Problem with Current Benchmarks
Before diving into Video-MME, it is essential to understand why a new benchmark was necessary. Previous efforts to evaluate video understanding in AI have been constrained by three main limitations:
- Limited Duration: Benchmarks like MSRVTT-QA or MSVD-QA typically use videos averaging between 10 to 15 seconds. While useful for checking if a model can recognize an action (e.g., “running”), these clips fail to test a model’s ability to track a narrative or context over a longer period.
- Lack of Diversity: Many datasets focus on specific domains, such as instructional videos (How2QA) or first-person views (EgoSchema), lacking a broad representation of real-world scenarios like movies, sports, or news.
- Missing Modalities: Video is rarely just visual. It involves dialogue (subtitles) and ambient sound (audio). Most previous benchmarks ignore these layers, forcing models to rely on vision alone, which is not how humans consume video content.
The table below illustrates how Video-MME compares to these predecessors. Note the significant jump in “Average Duration” and the inclusion of “Subtitle & Audio” support.

Constructing Video-MME: A Full-Spectrum Approach
To address these gaps, the researchers constructed Video-MME with a focus on diversity, duration, and data breadth.
1. Domain Diversity
The dataset does not rely on a single source or genre. It spans 6 primary visual domains broken down into 30 subfields. These include:
- Knowledge: Technology, Humanities, etc.
- Film & Television: Movies, TV series.
- Sports Competition: Basketball, Football, etc.
- Artistic Performance: Magic shows, Acrobatics.
- Life Record: Vlogs, Travel.
- Multilingual: Videos serving different linguistic contexts.
This hierarchy ensures that a model cannot achieve a high score simply by overfitting to a specific type of video content.

2. Temporal Duration
Perhaps the most critical contribution of Video-MME is its classification of video lengths. The dataset is divided into:
- Short Videos: 11 seconds to 2 minutes.
- Medium Videos: 4 to 15 minutes.
- Long Videos: 30 to 60 minutes.
As shown in the charts above, the dataset maintains a balanced distribution across these lengths. This structure allows researchers to pinpoint exactly where a model fails. Does it handle a TikTok-style clip well but forget the beginning of a one-hour documentary? This segmentation is key to diagnosing the “context window” limitations of modern LLMs.
3. High-Quality Annotations
Automated benchmarks often suffer from noise. To ensure reliability, Video-MME utilizes rigorous manual labeling. Expert annotators watched all 900 videos and created 2,700 multiple-choice questions (3 per video).
These questions are not simple “what color is the car” queries. They are designed to test Perception (recognizing objects), Reasoning (deducing why something happened), and Synopsis (summarizing events).

Look at the examples in Figure 1 above.
- Left Example: The model must identify a specific date. To do this, it has to read a timestamp (“Day 1 is May 31”), listen to audio or read subtitles to identify the location (“Yosemite”), and perform arithmetic to determine the date of departure.
- Right Example: The question asks how a man sustained an injury. The answer requires connecting a scene at the 03:35 mark (the injury) with a scene at 27:30 where the character reappears with a bandage.
This requires temporal logic—the ability to hold information in memory for nearly 30 minutes and connect two distant events.
4. Quality Control and “Certificate Length”
To verify that the questions actually require watching the video (rather than just guessing based on the text), the researchers employed a “blind” test using Gemini 1.5 Pro. If the AI could answer the question using only the text prompt without seeing the video, the question was discarded.
Furthermore, the authors analyzed the Certificate Length. This metric represents the minimum amount of video footage a human needs to watch to verify an answer.

As seen in Table 3 (above), the average certificate length for “Long” videos is nearly 968 seconds (over 16 minutes). This confirms that Video-MME is significantly more challenging and requires deeper engagement with the content than datasets like EgoSchema, which has a much shorter certificate length.
Experiments and Key Findings
The researchers evaluated a wide range of models, including commercial giants (GPT-4V, GPT-4o, Gemini 1.5 Pro) and open-source contenders (Video-LLaVA, LLaVA-NeXT-Video). The results provide a fascinating snapshot of the current state of AI.
1. The Commercial vs. Open-Source Gap
The standout performer was Gemini 1.5 Pro, achieving an overall accuracy of 75%. It significantly outperformed its commercial rival, GPT-4o (71.9%), and left open-source models far behind (the best open-source model, VILA-1.5, reached 59%).
This dominance is largely attributed to Gemini’s massive context window, allowing it to process more frames and textual data from long videos without losing track of earlier information.
2. The Bottleneck of Counting and Reasoning
While models are becoming excellent at general perception (identifying objects), they struggle with counting and complex reasoning.

The radar chart above highlights a “joint bottleneck.” Notice how all models—even the high-performing Gemini 1.5 Pro (represented by the outermost shape)—dip significantly on Counting Problems and Temporal Perception. This indicates that while MLLMs can “see” the video, they still struggle to quantify elements or precisely sequence events in time.
3. The Impact of Modality (Subtitles & Audio)
Video is a multimodal experience. The study found that integrating subtitles and audio significantly boosts performance, particularly for long videos where visual information alone might be sparse or ambiguous.

In Table 5 (above), look at the “Long” category. When Gemini 1.5 Pro uses only frames, it scores 67.4%. When subtitles are added, accuracy jumps to 77.4%—a massive 10% gain. This suggests that text (dialogue) provides a crucial “anchor” for the model to navigate long temporal contexts.
4. Detailed Category Analysis
The impact of these modalities varies by domain. In categories heavily reliant on dialogue, such as “multilingual” or “knowledge,” the boost from subtitles is profound.

The detailed breakdown in Figure 4 shows that for some categories, adding audio (the dark blue bars) provides a distinct advantage over just frames and subtitles. However, for many categories, subtitles (teal bars) provide the most significant leap over frames alone (beige bars). This reinforces the idea that current MLLMs are still very text-centric learners; converting audio to text (subtitles) is often more effective for them than processing raw audio signals.
The “Long Video” Challenge
A consistent trend across all experiments was the performance degradation as videos got longer. Accuracy for almost all models dropped as they moved from Short -> Medium -> Long videos.
Why does this happen?
- Information Sparsity: Most models sample a fixed number of frames (e.g., 8 or 16) regardless of video length. For a 60-minute video, 16 frames result in massive gaps in information.
- Context Overload: Even for models like Gemini 1.5 Pro that can ingest many frames, managing the “noise” and retaining specific details over thousands of tokens is computationally difficult.
Conclusion and Future Directions
Video-MME serves as a reality check for the AI community. While we have made tremendous strides in image understanding, comprehensive video understanding remains unsolved, especially when it involves long durations and complex reasoning.
The paper identifies two critical paths forward:
- Architectural Innovation: We need better methods for handling long contexts, such as “Ring Attention” or more efficient ways to compress video tokens without losing temporal detail.
- Better Training Data: The community needs to move beyond short clips. We need instruction-tuning datasets that specifically teach models to perform temporal reasoning over long sequences.
Video-MME provides the yardstick we need to measure this progress. By exposing the weaknesses in counting, temporal logic, and long-term memory, it sets the stage for the next generation of Multimodal LLMs that can truly watch, listen, and understand the dynamic world around us.
](https://deep-paper.org/en/paper/2405.21075/images/cover.png)