Introduction

If you have ever tried to build a computer vision system that understands video, you have likely encountered the “sampling dilemma.”

Videos are essentially heavy stacks of images. To process a video using a Multimodal Large Language Model (MLLM), the standard approach is Uniform Frame Sampling. You extract one frame every second (or every few frames), encode each one as an image, stack them up, and feed them into the model.

This approach creates a difficult trade-off:

  1. Sample too sparsely (e.g., 1 frame per second): You save on computation, but you lose the motion. If Frame A shows a hand holding a cup, and Frame B shows the cup on the table, did the person place it gently or drop it? You can’t tell.
  2. Sample densely (e.g., 30 frames per second): You capture the motion, but the computational cost skyrockets. You end up processing thousands of visual tokens, many of which are redundant (the background wall didn’t change for 100 frames!).

Is there a better way? A recent paper titled “Efficient Motion-Aware Video MLLM” suggests that we have been looking at the data wrong. Instead of treating video as a sequence of decoded images, we should look at how the video is stored.

The researchers introduce EMA, a model that reads the compressed video structure directly. By leveraging the way video codecs (like H.264) already separate static backgrounds from moving objects, EMA achieves state-of-the-art performance with significantly lower computational costs.

In this post, we will tear down how modern video MLLMs work, why the compressed domain is a goldmine for AI, and how the EMA architecture unifies spatial and motion understanding.


The Problem: The Inefficiency of Decoded Frames

To understand why EMA is a breakthrough, we first need to visualize the inefficiency of current methods.

Most current Video MLLMs treat video understanding as a “multiple image understanding” task. They decode the video into a raw stream of RGB images and then encode each image separately. The Large Language Model (LLM) is then expected to figure out the temporal relationships—the “motion”—by comparing these static image features.

This is inefficient because videos are naturally redundant. If a person is speaking in front of a static background, 90% of the pixels don’t change from frame to frame. Re-encoding that static background 30 times a second is a waste of resources.

Furthermore, uniform sampling often misses the nuances of motion.

Figure 1. Comparison of sampling from decoded frames versus the GOPs (Group Of Pictures) from compressed video stream. Compressed video encoding generates tokens at only 1/T the length of sampled frames for the same clip,while capturing motion information more directly. It also shows greater efficiency on Video-QA (average of MSVD-QA and MSRVTT-QA [35]) and MotionBench within our EMA framework,achieving higher accuracy with less inference time (red arrow shows inference speed up).

As shown in Figure 1 above:

  • Path (a) Decoded Frame Stream: To capture the motion of the hand, you need to sample many frames (\(i\) through \(i+3\)). This results in a massive sequence of tokens (\(T \times N\)).
  • Path (b) Compressed Video Stream: The authors propose using the structure already present in the file. By using one key frame (Frame \(i\)) and the Motion Vectors (the instructions on how pixels move), they can represent the same sequence with a fraction of the tokens.

The graph in Figure 1 highlights the result: The EMA method (using GOPs) achieves higher accuracy on benchmarks like VideoQA and the new MotionBench while being significantly faster (red arrow).


Background: A Crash Course in Video Compression

To appreciate the EMA architecture, we need a quick refresher on how videos are actually compressed (e.g., in MPEG-4 or H.264). Video codecs don’t store every frame as a full picture. Instead, they use a concept called GOP (Group of Pictures).

A GOP typically consists of three types of frames:

  1. I-Frames (Intra-coded frames): These are full, standalone RGB images. They are expensive to store but provide the complete visual context.
  2. P-Frames (Predictive-coded frames): These frames only store the changes from the previous frame. They rely heavily on Motion Vectors.
  3. B-Frames (Bi-predictive frames): These use information from both previous and future frames.

What is a Motion Vector?

A Motion Vector (MV) tells the decoder: “Take the block of pixels at position \((x, y)\) in the previous frame and move it to \((x', y')\).”

\[ M V ( x , y ) = ( x ^ { \prime } - x , y ^ { \prime } - y ) \]

This vector captures the trajectory of objects. If a car drives across the screen, the I-frame shows the car, and the subsequent P-frames contain a list of vectors pointing in the direction of the car’s movement.

Figure 2. Description of the H.264 codec modes for a single GOP (Group of Pictures). P/B-frames are decoded sequentially in the decoding order. Decoding relies on motion vectors, which record the movement of each macroblock in the current frame relative to those in reference frames.

Figure 2 illustrates this structure perfectly. You have one detailed I-frame at the start. The subsequent frames are reconstructed using Motion Vectors (the arrows showing where macroblocks moved) and Residuals (small corrections for errors in prediction).

Why does this matter for AI?

The authors of EMA realized that this compression structure is naturally Slow-Fast:

  • Slow path: The I-frames provide high-resolution spatial details (colors, objects, scenes) but occur infrequently.
  • Fast path: The Motion Vectors provide dense, high-frequency temporal information (movement, trajectory) but are very sparse and lightweight data.

Instead of throwing this structure away by decoding everything into RGB pixels, EMA feeds this compressed structure directly into the model.


The Core Method: EMA Architecture

The Efficient Motion-Aware (EMA) model replaces the standard “Frame Encoder” found in most MLLMs with a specialized GOP Encoder.

Let’s break down how it transforms a video clip into understandable tokens.

1. The Input: Compressed Video Segments

The model segments the video into GOP units. Instead of a stack of images, the input for a single segment looks like this:

\[ \begin{array} { r } { \mathrm { I n p u t } _ { \mathrm { V i d e o } } = \underbrace { \left[ I _ { 1 } , M V _ { ( 1 , 1 ) } , M V _ { ( 1 , 2 ) } , \ldots , M V _ { ( 1 , M ) } \right] } _ { \mathrm { G O P } _ { 1 } } , \ldots , } \\ { \underbrace { \left[ I _ { N } , M V _ { ( N , 1 ) } , M V _ { ( N , 2 ) } , \ldots , M V _ { ( N , M ) } \right] } _ { \mathrm { G O P } _ { N } } \quad ( 4 } \end{array} \]

Here, \(I_1\) is the I-frame (the full picture), and \(MV_{(1,1)}...MV_{(1,M)}\) are the motion vector frames that follow it. This input is much lighter than \(M\) full RGB images.

2. The GOP Encoder

This is the heart of the paper. The GOP Encoder needs to process these two very different types of data (RGB pixels and Motion Vectors) and combine them.

Figure 3. An illustrative diagram of the overall model architecture. The compressed-domain video stream is divided into GOPs (Groups of Pictures), and each GOP is encoded using our designed GOP encoder. After concatenation, the encoded GOPs are input into the LLM along with text instructions. On the left side of the figure is the detailed structure of the GOP encoder, which decouples frame and motion encoding. It fuses the frame features with the aggregated motion feature sequence to produce a fixed-length GOP feature containing both spatial and motion information.

As shown in Figure 3, the architecture operates in parallel branches:

Branch A: The Frame Encoder (Spatial)

The I-frame (\(I_k\)) is passed through a standard Vision Encoder (specifically SigLIP). This extracts the “Spatial Semantics”—what objects are in the scene, the colors, the lighting, etc. To keep things efficient, they use pooling to reduce the number of tokens.

\[ \begin{array} { r } { \pmb { F } _ { k } ^ { I } = \operatorname { P o o l i n g } \left( \operatorname { E n c } _ { I } ( I _ { k } ) \right) } \end{array} \]

Branch B: The Motion Encoder (Temporal)

The Motion Vectors are processed separately. Since motion vectors are just sparse lists of displacements \((dx, dy)\), they don’t need a massive image encoder. The authors use a lightweight Transformer.

Crucially, because motion happens over time, they inject a Temporal Embedding (\(PosEmbed(t)\)) so the model knows the order of movements.

\[ \begin{array} { c } { { \pmb { F } _ { ( k , t ) } ^ { M V } = \mathrm { E n c } _ { M V } \Big ( \mathrm { P a t c h i f y } ( M V _ { k , t } ) + } } \\ { { \qquad \mathrm { P o s E m b e d } ( t ) \Big ) , \quad t = 1 , \ldots , M } } \end{array} \]

After encoding the sequence of motion vectors, the model aggregates them into a single motion representation for that GOP:

\[ \pmb { F } _ { k } ^ { M V } = \mathrm { A g g r e g a t o r } \bigg ( \big [ \pmb { F } _ { ( k , t ) } ^ { M V } \big ] _ { t = 1 } ^ { M } \bigg ) \]

3. Fusion: Bringing Space and Time Together

Now the model has a static image representation (\(F^I\)) and a summarized motion representation (\(F^{MV}\)). It fuses them using a Cross-Attention Mechanism.

Think of this as the model “painting” the motion onto the static image. The static image acts as the Query (\(Q\)), while the motion features act as the Key (\(K\)) and Value (\(V\)). The model looks at the static image and asks, “For this object at this location, what is its movement history?”

\[ \begin{array} { c } { { { \pmb F } _ { k } ^ { \mathrm { a t t n } } = \mathrm { A t t e n t i o n } \big ( { \cal Q } = { \pmb F } _ { k } ^ { I } , K , V = { \pmb F } _ { k } ^ { M V } \big ) } } \\ { { { \pmb F } _ { k } ^ { \mathrm { G O P } } = \mathrm { F F N } \big ( { \pmb F } _ { k } ^ { \mathrm { a t t n } } \big ) + { \pmb F } _ { k } ^ { I } } } \end{array} \]

The result is a set of GOP Features. These features are chemically pure video understanding: they contain the high-res details of the image and the dynamic trajectory of the movement, all compressed into the same token count as a single image.

4. Integration with LLM

Finally, these visual tokens are projected via an MLP and fed into a Large Language Model (like Qwen2-7B) alongside the user’s text prompt.

\[ \pmb { X } _ { V } = [ T P _ { 1 } , \mathrm { M L P } ( \pmb { F } _ { 1 } ^ { \mathrm { G O P } } ) , \dots , T P _ { N } , \mathrm { M L P } ( \pmb { F } _ { N } ^ { \mathrm { G O P } } ) ] \]

(Note: \(TP\) stands for Temporal Prompt, giving the LLM a text hint about which segment is which, e.g., “Segment 1”).


Introducing MotionBench

How do we know if a model actually understands motion?

Standard VideoQA benchmarks often ask questions that can be answered by looking at a single static frame (e.g., “Is the man wearing a hat?”). This doesn’t test if the model understands trajectory.

To address this, the authors created MotionBench, a dataset specifically designed to evaluate four distinct types of motion:

  1. Linear: Movement in a straight line.
  2. Curved: Arcs, throws, and parabolas.
  3. Rotation: Spinning objects.
  4. Contact: Collisions and interactions.

Figure 4. Examples of data from MotionBench. We show four types of examples: linear, curved, rotation, and contact. Yellow arrows in the video frames indicate the motion trajectories of the same specified object. Different trajectory patterns correspond to different data types.

As seen in Figure 4, distinguishing between these requires tracking an object’s position over time. A static frame cannot tell you if the spinning wheel is moving or stopped, nor can it tell you the trajectory of the thrown object.


Experiments and Results

The researchers compared EMA against several state-of-the-art Video MLLMs, including Video-LLaVA, LLaMA-VID, and Video-LaVIT.

1. Accuracy on VideoQA

On standard benchmarks (MSVD-QA, MSRVTT-QA), EMA outperforms the competition.

Table 1. Comparison with SOTA models on public videoQA benchmarks: MSVD-QA [3]; MSRVTT-QA [36]; ActivityNet-QA [2]. Maximum values are in bold.

Table 1 shows that EMA achieves the highest accuracy across the board. Notably, it beats Video-LaVIT, another model that uses compressed video inputs. The authors attribute this to their superior fusion strategy (Cross-Attention) compared to Video-LaVIT’s simple concatenation.

2. Efficiency Analysis

This is where EMA truly shines. Because it uses sparse Motion Vectors instead of decoding dense frames, it processes video much faster.

Table 3. Efficiency analysis under different model, video segment type and number. We report the inference time and maximum visual token count for MLLM generation. We use model-f and model-g to denote models trained with frame and GOP inputs, respectively. Model-D refers to the final model used as EMA. Maximum values are in bold, and second-highest values are underlined.

Table 3 provides a fascinating breakdown:

  • Token Count: EMA uses only 648 visual tokens compared to Video-LLaVA’s 2048.
  • Inference Time: EMA is roughly 3x faster than Video-LLaVA (127.1ms vs 391.4ms).
  • GOP vs. Frame: The table compares “model-f” (trained on frames) vs “model-g” (trained on GOPs). Even with the same number of segments, the GOP approach yields higher accuracy, proving that the Motion Vectors provide valuable signal that raw frames might miss (or require more complex encoders to extract).

3. Ablation Studies

The authors also performed ablation studies to prove their design choices.

Table 4. Ablation experiments on the method design. We compared the construction of the fusion module, the encoding modes of motion positional information, and the model training strategies. We evaluated the performance of all models on MSVD-QA [35], MSRVTT-QA [35], and MotionBench. Exp 0 is the final setting we utilized. Maximum values are in bold, and second-highest values are underlined.

Table 4 highlights the importance of the Fusion Module. Using Cross-Attention (Exp 0) significantly outperforms simple Addition (Exp 2) or Concatenation (Exp 3). This confirms that simply having the data isn’t enough; the model must explicitly learn the relationship between the static object and its motion path.

4. Long Video Understanding

Finally, the model isn’t just for short clips. When tested on VideoMME (a benchmark for long videos), EMA held its ground against models designed specifically for long contexts.

Table 2. Performance comparison with state-of-the-art models on long video tasks VideoMME [6]. Maximum values are in bold.

As shown in Table 2, EMA achieves the highest overall score (58.4), proving that the compressed representation is highly scalable.


Conclusion

The EMA paper presents a compelling argument: We should stop fighting against video compression and start working with it.

By using the structure that already exists in video files—high-quality keyframes backed by sparse motion vectors—we can build AI systems that are:

  1. More Efficient: Drastically reducing the number of tokens and inference time.
  2. More Motion-Aware: Directly ingesting trajectory data rather than trying to infer it from static pixel changes.

The introduction of MotionBench further pushes the field to move beyond simple object recognition and toward true dynamic scene understanding. As video content continues to explode in volume, efficient architectures like EMA will likely become the standard for how machines perceive the moving world.