For the past few years, Vision Transformers (ViTs) have dominated computer vision. By treating images as sequences of patches and applying self-attention, these models have set new benchmarks in image classification, object detection, and semantic segmentation. However, this power comes at a steep computational cost.
The self-attention mechanism at the core of Transformers suffers from quadratic complexity. In plain terms, if you double the number of image patches (for example, by increasing resolution), the computation and memory demands don’t just double—they quadruple. This makes high-resolution image processing slow, memory-hungry, and often impractical without specialized hardware or cumbersome architectural workarounds.
But what if we could retain the global, context-aware capabilities of Transformers without the quadratic bottleneck? This question has led researchers to explore alternatives. One promising candidate comes from an unexpected place: classical control theory. Enter State Space Models (SSMs), and their latest powerful incarnation—Mamba. In natural language processing, Mamba’s linear scalability with sequence length has matched, and sometimes surpassed, Transformers.
A new paper, “Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model”, takes the next logical leap: adapting Mamba to computer vision. The proposed Vision Mamba (Vim) backbone uses SSMs to process images, achieving Transformer-level performance—or better—while being significantly more efficient.
Figure 1: Head-to-head comparison of Vim and the DeiT Vision Transformer. Vim consistently outperforms DeiT in accuracy across tasks, while delivering superior speed and memory efficiency, especially at high resolutions.
In this article, we’ll unpack the Vision Mamba paper. We’ll cover the fundamentals of State Space Models, see how the authors adapted them for vision, and examine the impressive results that could signal a shift in backbone design for computer vision.
From Transformers to State Space Models
To appreciate the impact of Vim, let’s start with the problem it addresses.
Vision Transformers (ViTs) work by slicing an image into patches (e.g., 16×16 pixels), flattening them into vectors, and treating these vectors as a sequence—akin to words in a sentence. Self-attention then computes interactions between every pair of patches, enabling the model to capture global context. For example, a patch of a cat’s ear can “attend” to a patch of its tail, no matter the spatial distance. The drawback? Calculating every pairwise interaction in sequences of length N leads to O(N²) complexity.
State Space Models (SSMs) take a different approach. Inspired by continuous systems, they define a latent “state” \( h(t) \) that evolves over time according to:
\[ h'(t) = A h(t) + B x(t) \]\[ y(t) = C h(t) \]Here, \( A, B, C \) are parameters describing the system’s dynamics. To use SSMs for deep learning, we discretize the equations using a step size \( \Delta \), producing discrete parameters \( \overline{A} \) and \( \overline{B} \):
Once discretized, the model can be expressed in two forms:
Recurrent form: processes the sequence step-by-step, updating the hidden state at each element.
Convolutional form: represents the recurrent process as a single large convolution over the entire sequence, enabling parallel training.
The Mamba architecture improved SSMs by making parameters \( B, C, \Delta \) data-dependent—determined by the current input sequence—allowing selective adaptation akin to attention, yet preserving linear complexity.
Adapting Mamba for Vision: The Vim Architecture
Transforming Mamba—built for 1D text sequences—into an image-processing model required clever engineering.
Figure 2: Vision Mamba follows a ViT-like patch-and-embed pipeline but replaces self-attention with bidirectional SSM-based Vim blocks.
The Vim pipeline:
- Patching: Non-overlapping image patches are extracted.
- Embedding: Each patch is flattened and linearly projected into a token. Positional embeddings add spatial context, and classification tokens can be included.
- Vim Encoder: Tokens pass through a stack of \( L \) bidirectional Vim blocks.
- Prediction: The classification token output is fed to an MLP head.
Bidirectional Vim Blocks: Capturing Context from All Directions
A standard Mamba block processes sequences in one direction, which works for text but not for images. Pixels rely on context from all neighboring directions. Vim addresses this with bidirectional processing:
Inside each Vim block:
- Input is normalized.
- Two projections are computed:
- \( x \): the main branch features.
- \( z \): a gating vector.
- Forward branch: processes patches from start to end using a Conv1d and SSM.
Backward branch: processes patches in reverse order. - Each branch captures context in its respective direction via its selective SSM.
- Outputs from both branches are gated with \( z \) and summed.
- A final projection and residual connection complete the block.
By fusing forward and backward contexts, Vim builds a holistic representation akin to self-attention—without its quadratic overhead.
Why Vim is Efficient
Three main optimizations make Vim stand out:
Computational Efficiency:
Self-attention scales as \( O(M^2 D) \) (quadratic in sequence length \( M \)), while Vim’s SSM is \( O(M D N) \) (linear, with small fixed \( N \)).Memory Efficiency:
Vim avoids storing massive attention matrices, keeping memory use linear with sequence length.IO Efficiency:
Parameters are loaded once into fast GPU SRAM, computations run there, and only final outputs are written back—minimizing data transfer bottlenecks.
Experimental Results
The paper benchmarks Vim against leading models across vision tasks.
ImageNet Classification
Table 1: Vim consistently outperforms DeiT models of similar size. Long-sequence fine-tuning boosts accuracy further.
With “long sequence fine-tuning” (smaller patch stride), Vim captures finer detail, improving accuracy by 1–2 points. Vim-S† reaches 81.4%—matching much larger Transformer models.
High-Resolution Efficiency
Figure 3: While speeds are similar at low resolution, Vim’s advantage increases with image size.
Figure 4: Memory usage comparison. Vim remains efficient at high resolution, avoiding OOM errors.
At 1248×1248 resolution, Vim runs 2.8× faster and uses 86.8% less GPU memory than DeiT, which crashes due to OOM.
Downstream Tasks: Segmentation and Detection
On semantic segmentation (ADE20K), Vim outperforms DeiT:
Table 2: Vim-S matches ResNet-101’s mIoU while using much fewer parameters.
On object detection and instance segmentation (COCO), Vim-Ti surpasses DeiT-Ti by notable margins:
Table 3: Vim’s sequential modeling outperforms DeiT, even without 2D-specific priors like windowed attention.
Vim processed high-res detection tasks without architectural changes, unlike DeiT, which required windowed attention. Qualitative results show Vim capturing large-scale objects missed by DeiT:
Figure 5: Superior long-range context allows Vim to detect large objects more completely than DeiT.
Ablation Studies
Bidirectionality Matters: Removing bidirectionality harms segmentation badly (−3.6 mIoU). The full Bidirectional SSM + Conv1d design yields the best scores:
Table 4: Bidirectional processing is essential for dense prediction tasks.
Positioning the [CLS]
Token: Surprisingly, placing it in the middle of the sequence performs best—76.1% accuracy—leveraging bidirectional processing:
Table 5: Middle class token placement maximizes forward/backward context aggregation.
Conclusion: A New Contender Emerges
The Vision Mamba paper makes a strong case for SSM-based vision backbones. By adapting Mamba with bidirectional processing and position embeddings, Vim:
- Matches or exceeds highly optimized Transformers (e.g., DeiT) across diverse tasks.
- Scales linearly in speed and memory with sequence length—ideal for high-resolution inputs.
- Maintains pure sequential modeling, avoiding heavy 2D priors.
This efficiency unlocks applications previously impractical with Transformers: gigapixel medical slides, massive satellite images, or long video streams—processed end-to-end, without complex tiling.
The paper shows that self-attention’s dominance in high-performance vision is not absolute. With Vim, the future of computer vision backbones might well be written in the language of State Space Models.