Imagine asking an AI to watch a two-hour movie and then asking, “What was the number on the jersey of the man in the background at the very end?” or “How did the protagonist’s relationship with her sister evolve from the first scene to the last?”
For most current multimodal AI models, this is an impossible task. While models like GPT-4V or VideoLLaMA are impressive at analyzing short clips (typically 5 to 15 seconds), they hit a hard limit when the video stretches into minutes or hours. This limit is known as the Memory Wall. As a video gets longer, the number of visual “tokens” (pieces of information) the model must hold in its memory grows massively. Eventually, the GPU runs out of memory (OOM), or the model gets overwhelmed by noise and forgets the context.
In this post, we are diving deep into a new research paper titled “AdaCM\(^2\): On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction.” This paper proposes a novel solution that doesn’t just compress video blindly; it intelligently decides what to remember based on what you are asking.

As shown in Figure 1, while traditional methods compress video features based only on visual correlation (left), AdaCM\(^2\) (right) uses an adaptive approach that considers the correlation between the video and the text query to reduce memory usage significantly.
The Challenge: Why Long Videos Are Hard for AI
To understand the contribution of this paper, we first need to understand the bottleneck in current Video-Language Models (Video-LLMs).
Most modern Video-LLMs use a Visual Encoder (like a Vision Transformer) to turn video frames into mathematical representations called features. These features are then aligned with text using a component often called a Q-Former (Querying Transformer) before being fed into a Large Language Model (LLM) to generate an answer.
The Memory Explosion
In a standard Transformer architecture, the model maintains a KV Cache (Key-Value Cache). This cache stores the information from previous tokens so the model doesn’t have to re-compute them at every step.
- Short Video: A 10-second clip might generate a few hundred visual tokens. The cache is small.
- Long Video: A 2-hour movie generates millions of tokens. The cache grows linearly (or sometimes quadratically) with the video length.
Previous attempts to solve this involved:
- Hard Compression: simply averaging frames or skipping them (losing detail).
- Visual Similarity: Merging frames that look similar (e.g., if a scene is static, merge the tokens).
The flaw in previous methods: They ignore the text. If you ask, “What color is the car?”, the model needs to retain tokens related to the car. If the compression method blindly merges the “car” tokens with the “road” tokens because they appear in the same static shot, the answer is lost.
The Core Observations
The authors of AdaCM\(^2\) built their solution on two key observations regarding how attention works in these models.
Observation 1: Sparsity (Not everything matters)
When a model looks at a frame to answer a specific question, it doesn’t need every pixel. It only needs the “visual tokens” that correlate with the “text tokens” of the question.

Figure 3(a) and 3(b) illustrate this Intra-Frame Cross-Attention Sparsity.
- Figure 3(a): This heatmap shows the attention between text tokens (y-axis) and visual tokens (x-axis). Notice how dark most of the map is? That indicates low correlation. Only a few bright spots (high attention) exist.
- Figure 3(b): The distribution shows that the vast majority of tokens have near-zero attention scores.
Implication: We can safely throw away a lot of visual data if we know it’s not relevant to the text prompt.
Observation 2: Layer Redundancy
Deep learning models process information in layers. Early layers process raw pixels and edges, while deeper layers process abstract concepts. The researchers found that in deeper layers, the information between adjacent frames is highly similar.
Figure 3(c) shows the cosine similarity of attention scores across frames for different layers. Notice how the lines for deeper layers (like Layer 10 or 12) stay high even as the distance between frames increases. This suggests that redundancy varies across layers, meaning our memory reduction strategy should be adaptive—aggressive in some layers, conservative in others.
The Method: AdaCM\(^2\)
The proposed framework, AdaCM\(^2\), stands for Adaptive Cross-Modality Memory Reduction. It is designed to work in a “plug-and-play” manner with existing architectures like BLIP-2.
Here is the high-level architecture:

The workflow consists of three main stages:
- Video Feature Extraction: Using a frozen visual encoder.
- Adaptive Memory Reduction: The core innovation, happening inside the Video Q-Former.
- Text Generation: The LLM produces the answer.
Let’s break down the mathematical engine driving this efficient memory usage.
1. Regressive Query Learning
Instead of feeding the whole video into the model at once (which crashes memory), AdaCM\(^2\) processes the video frame-by-frame in a regressive manner.
First, standard feature extraction occurs with position embeddings to help the model understand time:

Here, \(f_t\) is the feature at time \(t\), combining the image feature \(x_t\) and a temporal position embedding \(E(t)\).
As the video progresses, the model builds a Video Cache. The Key (\(K\)) and Value (\(V\)) matrices are updated sequentially:

2. Cross-Modality Attention Score
This is where the “Cross-Modality” part comes in. The model calculates an attention score (\(S_t\)) that measures the relationship between the visual tokens (\(K_t\)) and the query/text tokens (\(Q_t\)).

This attention map tells the model exactly which parts of the video are relevant to the current text query.
3. Adaptive Layer-Wise Memory Reduction
Now for the “Memory Reduction” part. Since we cannot keep growing the cache \(K_t\) and \(V_t\) forever, we need to evict some tokens.
The authors propose a split-and-prune strategy, visualized effectively here:

The process works as follows:
Step A: Partitioning the Cache At any given time \(t\), the cache is divided into two parts:
- Recent Cache (\(\tilde{K}_t\)): The most recent frames. We keep these intact because recent context is usually critical for continuity.
- Previous Cache (\(\hat{K}_t\)): Older frames. This is where we need to save space.
The split is determined by a ratio \(\alpha\):

Step B: Identifying & Reducing For the Previous Cache, the model looks at the Cross-Modality Attention Scores. It sums up the attention scores for each visual token relative to the text tokens:

If a visual token has a high total score, it means it is highly relevant to the text query (e.g., it represents the “red car” we asked about). The model selects the top-\(\beta\) percent of these tokens to keep and discards the rest.

Here, \(\beta\) is the conserve ratio. By keeping only the top-\(\beta\) tokens, the memory footprint is drastically reduced.
The Resulting Memory Ceiling Because the cache is constantly being pruned by a factor of \(r\) (derived from \(\alpha\) and \(\beta\)), the total memory usage does not grow linearly to infinity. It converges to a fixed limit, even if the video is infinitely long.

This mathematical bound is what allows AdaCM\(^2\) to process extremely long videos without crashing.
Experiments and Results
The researchers put AdaCM\(^2\) to the test on several benchmarks, including LVU (Long-form Video Understanding), Breakfast, and COIN datasets.
1. Long-Term Understanding Performance
The primary benchmark is the LVU dataset, which includes tasks like identifying the director, genre, or relationships in movies.

As shown in Table 1, AdaCM\(^2\) outperforms existing methods like MA-LMM and MovieChat across almost all categories. It achieves a 67.5% average accuracy, a significant jump over the previous best of 63.0%.
The model also showed superior performance on the Breakfast and COIN datasets (instructional videos), proving it can track procedural steps over time.

2. Video Captioning & QA
For more traditional tasks like captioning and QA on shorter datasets (MSRVTT, MSVD), AdaCM\(^2\) maintained or improved upon state-of-the-art performance. This confirms that the memory reduction technique doesn’t hurt the model’s ability to handle standard video tasks.

3. The “Memory Wall” Breached
The most impactful result is the memory consumption analysis. The researchers plotted the GPU memory usage as the number of frames increased.

Figure 6 is the defining image of this paper.
- InstructBLIP (Purple): Hits an Out-Of-Memory (OOM) error almost immediately (around 100 frames).
- VideoLLaMA (Orange): Memory grows linearly; it will eventually crash on a long movie.
- MA-LMM (Pink): Uses a high, constant amount of memory.
- AdaCM\(^2\) (Green/Teal): Maintains a low, constant memory usage regardless of frame count.
It reduces GPU memory consumption by up to 65% compared to other methods, while achieving better accuracy.
4. Qualitative Case Study
Numbers are great, but can it actually watch a movie? Figure 2 demonstrates a zero-shot case study on the Ego4D dataset.

In the top example, the model processes a video lasting over 2 hours. It successfully identifies the number “10” on a person’s jersey at the very end of the video. This requires the model to have managed its memory effectively over thousands of frames without discarding the crucial visual details needed to answer the question.
Ablation Studies: Does the Strategy Matter?
The researchers checked if their “smart” eviction was actually better than just randomly deleting tokens (“Random Eviction”).

Figure 7 shows that AdaCM\(^2\) consistently beats random eviction, proving that the Cross-Modality Attention is correctly identifying the valuable information to keep.
They also analyzed the hyperparameters \(\alpha\) (split ratio) and \(\beta\) (conserve ratio).

Interestingly, keeping more tokens doesn’t always equal better performance. As seen in Figure 8, accuracy peaks at certain ratios. Retaining too much redundant information can actually distract the model (and increase memory cost), validating the idea that “less is often more” in video processing.
Conclusion
The AdaCM\(^2\) paper presents a significant step forward in Video AI. By shifting from “compress everything” to “keep only what is relevant to the question,” the researchers have cracked one of the toughest nuts in multimodal learning: the trade-off between long-term context and memory constraints.
Key Takeaways:
- Adaptive Memory: Memory reduction shouldn’t be static; it should adapt to the prompt and the network layer.
- Cross-Modality is Key: Visual importance is relative. A “tree” token is only important if the user asks about nature.
- Efficiency Wins: AdaCM\(^2\) enables standard GPUs to process videos that are hours long, opening the door for AI assistants that can analyze movies, surveillance footage, or long instructional guides in real-time.
As Large Language Models continue to evolve, techniques like AdaCM\(^2\) will be essential infrastructure, ensuring that our AI models have the memory span to match their reasoning capabilities.
](https://deep-paper.org/en/paper/2411.12593/images/cover.png)