The world of video is exploding. From bite-sized clips on social media to full-length feature films, we are generating and consuming more video content than ever before. For AI, truly understanding this content is a monumental task. A single video can contain mountains of spatiotemporal information—ranging from subtle gestures to complex, multi-minute narratives.
The core challenge for modern video understanding models comes down to two conflicting needs:
- Efficiency — Video data is massive and often highly redundant. Models must process it quickly without exhausting computational resources.
- Global Context — Videos aren’t just isolated frames. Understanding them requires capturing dependencies that can span hundreds or thousands of frames.
The Historical Trade-Off
For years, two families of models have dominated:
- 3D Convolutional Neural Networks (CNNs): Great at capturing local, space-time patterns, but struggle with long-range dependencies.
- Video Transformers: Self-attention lets them connect every frame to every other frame — perfect for long-range dependencies. The drawback? The quadratic complexity of attention makes them painfully slow and memory-hungry for long, high-resolution videos.
This trade-off has left a gap: we need a model as efficient as a CNN but as globally aware as a Transformer.
Enter the State Space Model (SSM) — an idea borrowed from control theory. Recently, a breakthrough model called Mamba showed that SSMs can model long sequences with linear complexity, rivaling Transformers at a fraction of the cost. That raises the tantalizing question: Can Mamba work for video understanding?
A new paper — VideoMamba: State Space Model for Efficient Video Understanding — answers with a resounding yes. The researchers introduce VideoMamba, a pure SSM-based architecture that redefines the state-of-the-art in practical video analysis. As you’ll see, it’s not just an incremental improvement — it’s a fundamental shift in how we think about long-form video tasks.
Figure 1: Throughput and memory comparisons between TimeSformer and VideoMamba. VideoMamba can be up to 6× faster and use 40× less GPU memory for long video sequences.
Understanding State Space Models (SSMs)
An SSM models a sequence by maintaining a hidden state — essentially a compact summary of everything seen so far — and updating it step by step. The continuous form is:
\[ h'(t) = \mathbf{A} h(t) + \mathbf{B} x(t) \]\[ y(t) = \mathbf{C} h(t) \]Here:
- \( x(t) \) = input at time \( t \)
- \( y(t) \) = output
- \( h(t) \) = hidden state
- \( \mathbf{A}, \mathbf{B}, \mathbf{C} \) define how the state evolves and produces outputs
For deep learning, these equations are discretized into a recurrence:
\[ h_t = \overline{\mathbf{A}} h_{t-1} + \overline{\mathbf{B}} x_t \]\[ y_t = \mathbf{C} h_t \]Traditional SSMs kept \( \overline{\mathbf{A}} \), \( \overline{\mathbf{B}} \), and \( \mathbf{C} \) fixed. Mamba’s leap forward was making them dynamic — dependent on the input. Its Selective Scan Mechanism (S6) lets the model decide what to remember or forget based on the content. That grants Mamba contextual awareness akin to attention, yet with linear complexity scaling.
The VideoMamba Architecture
VideoMamba builds on the simple but effective blueprint of the Vision Transformer (ViT) — but swaps expensive self-attention blocks for efficient Bidirectional Mamba (B-Mamba) blocks.
Figure 2: Framework of VideoMamba. Videos are split into 3D patches, processed through B-Mamba blocks, and classified via a final head.
Step-by-step:
Video to Patches:
Input video \(X^v\) is segmented into 3D spatiotemporal patches via a small 3D convolution. Example: 16×16 pixel patches over several frames. Flatten these into a sequence of tokens.Add Position Embeddings:
\[ \mathbf{X} = [\mathbf{X}_{\text{cls}}, \mathbf{X}] + \mathbf{p}_s + \mathbf{p}_t \]
Add a[CLS]
classification token at the start and learnable spatial embedding (\(\mathbf{p}_s\)) + temporal embedding (\(\mathbf{p}_t\)) to preserve position info:Bidirectional Mamba Blocks:
A standard Mamba scans a sequence forward. B-Mamba scans both forward and backward, then merges the results — capturing richer spatial context.
Figure 3: Standard Mamba (1D) vs. Bidirectional Mamba (2D), crucial for spatially rich data like images and video.
Scanning 3D Data: Spatiotemporal Strategies
Applying a 1D scan to a video’s 3D grid requires flattening it in some order. The authors tried four methods:
Figure 4: (a) Spatial-First, (b) Temporal-First, (c,d) hybrid spatiotemporal scans.
- Spatial-First: Process all patches of frame 1, then frame 2, etc.
- Temporal-First: Process the same patch position across all frames, then move spatially.
- Hybrids: Mix spatial/temporal ordering.
Winner: Spatial-First. It’s simple, effective, and benefits from 2D image knowledge learned during pre-training.
Tackling Overfitting with Self-Distillation
Scaling VideoMamba to larger variants (e.g., VideoMamba-M) led to overfitting. The fix? Self-Distillation.
Train a smaller, high-performing “teacher” model (e.g., VideoMamba-S), then train the larger “student” to match the teacher’s features in addition to learning from labels. This keeps big models grounded and generalizable.
Figure 5: Self-Distillation (red curve) prevents overfitting, leading to higher final accuracy.
Masked Modeling for Motion Sensitivity
To sharpen temporal sensitivity, the authors used masked modeling pre-training — hiding portions of the input and predicting them. Standard random masking wasn’t optimal for Mamba’s 1D convolution, which prefers continuous tokens.
They designed row masking strategies, masking entire spatial rows, plus attention masking to preserve meaningful adjacency.
Figure 6: Tailored masking strategies improve pre-training efficiency.
Four Core Abilities: Scalability, Sensitivity, Superiority, Compatibility
1. Scalability
Tested first on ImageNet-1K:
Model | Params | FLOPs | Top-1 (%) |
---|---|---|---|
VideoMamba-Ti | 7M | 1.1G | 76.9 |
VideoMamba-S | 26M | 4.3G | 81.2 |
VideoMamba-M | 74M | 12.7G | 82.8 |
Fine-tuning VideoMamba-M at 576×576 increased Top-1 to 84.0%.
2. Sensitivity: Short-Term Action Recognition
On Kinetics-400 (scene-focused), VideoMamba-M scored 81.9% — +2.0% over ViViT-L using much less pre-training data.
On Something-Something V2 (motion-focused), it reached 68.3% — +3.0% over ViViT-L.
Masked pre-training results surpassed VideoMAE.
Figure 7: Spatial-First scanning is most effective; increasing frames often benefits performance more than resolution.
3. Superiority: Long-Term Video Understanding
Datasets: Breakfast (long cooking tasks), COIN (instructional tasks), LVU (movie clips).
Prior methods used 2-stage pipelines due to computational limits. VideoMamba’s linear complexity enabled end-to-end training.
Results:
- Breakfast — VideoMamba-S: 97.4% Top-1 (SOTA)
- COIN — VideoMamba-S: 88.7% Top-1 (SOTA)
- LVU — Even smallest VideoMamba-Ti matched or beat SOTA in many tasks.
4. Compatibility: Multi-Modal Video Understanding
Pre-trained on large-scale video-text/image-text datasets, tested on zero-shot text-to-video retrieval. Outperformed ViT-based UMT, especially on complex, long videos (ActivityNet, LSMDC), confirming robustness for multi-modal alignment.
Conclusion: A New Era for Video AI
VideoMamba replaces the quadratic-cost attention mechanism with a linear-cost SSM, achieving efficiency, scalability, and superior performance:
- Dramatic Efficiency Gains — End-to-end long-video training now feasible.
- State-of-the-Art Results — From short action recognition to multi-minute video comprehension.
- Scalable Design — Self-distillation overcomes large-model overfitting.
Limitations include untested ultra-large configurations and integrating more modalities like audio/language for hour-scale comprehension — both ripe for future work.
For researchers and practitioners, VideoMamba signals a clear shift: the future of video understanding may lie beyond the Transformer.