For the past few years, Vision Transformers (ViTs) have dominated computer vision. By treating images as sequences of patches and applying self-attention, these models have set new benchmarks in image classification, object detection, and semantic segmentation. However, this power comes at a steep computational cost.

The self-attention mechanism at the core of Transformers suffers from quadratic complexity. In plain terms, if you double the number of image patches (for example, by increasing resolution), the computation and memory demands don’t just double—they quadruple. This makes high-resolution image processing slow, memory-hungry, and often impractical without specialized hardware or cumbersome architectural workarounds.

But what if we could retain the global, context-aware capabilities of Transformers without the quadratic bottleneck? This question has led researchers to explore alternatives. One promising candidate comes from an unexpected place: classical control theory. Enter State Space Models (SSMs), and their latest powerful incarnation—Mamba. In natural language processing, Mamba’s linear scalability with sequence length has matched, and sometimes surpassed, Transformers.

A new paper, “Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model”, takes the next logical leap: adapting Mamba to computer vision. The proposed Vision Mamba (Vim) backbone uses SSMs to process images, achieving Transformer-level performance—or better—while being significantly more efficient.

A summary of performance comparisons between DeiT (a Vision Transformer) and Vim (Vision Mamba). Vim shows higher accuracy in classification, segmentation, and detection tasks, while also being significantly faster and using far less GPU memory on high-resolution images.

Figure 1: Head-to-head comparison of Vim and the DeiT Vision Transformer. Vim consistently outperforms DeiT in accuracy across tasks, while delivering superior speed and memory efficiency, especially at high resolutions.

In this article, we’ll unpack the Vision Mamba paper. We’ll cover the fundamentals of State Space Models, see how the authors adapted them for vision, and examine the impressive results that could signal a shift in backbone design for computer vision.


From Transformers to State Space Models

To appreciate the impact of Vim, let’s start with the problem it addresses.

Vision Transformers (ViTs) work by slicing an image into patches (e.g., 16×16 pixels), flattening them into vectors, and treating these vectors as a sequence—akin to words in a sentence. Self-attention then computes interactions between every pair of patches, enabling the model to capture global context. For example, a patch of a cat’s ear can “attend” to a patch of its tail, no matter the spatial distance. The drawback? Calculating every pairwise interaction in sequences of length N leads to O(N²) complexity.

State Space Models (SSMs) take a different approach. Inspired by continuous systems, they define a latent “state” \( h(t) \) that evolves over time according to:

\[ h'(t) = A h(t) + B x(t) \]

\[ y(t) = C h(t) \]

Here, \( A, B, C \) are parameters describing the system’s dynamics. To use SSMs for deep learning, we discretize the equations using a step size \( \Delta \), producing discrete parameters \( \overline{A} \) and \( \overline{B} \):

Equation for discretizing the continuous state space parameters A and B.

Once discretized, the model can be expressed in two forms:

  • Recurrent form: processes the sequence step-by-step, updating the hidden state at each element. The recurrent formulation of a discrete state space model.

  • Convolutional form: represents the recurrent process as a single large convolution over the entire sequence, enabling parallel training. The convolutional formulation of a state space model, where the output y is the result of a convolution between the input x and a structured kernel K.

The Mamba architecture improved SSMs by making parameters \( B, C, \Delta \) data-dependent—determined by the current input sequence—allowing selective adaptation akin to attention, yet preserving linear complexity.


Adapting Mamba for Vision: The Vim Architecture

Transforming Mamba—built for 1D text sequences—into an image-processing model required clever engineering.

An overview of the Vision Mamba (Vim) architecture. An input image is divided into patches, embedded with positional information, and processed by Vim Encoder blocks using bidirectional State Space Models.

Figure 2: Vision Mamba follows a ViT-like patch-and-embed pipeline but replaces self-attention with bidirectional SSM-based Vim blocks.

The Vim pipeline:

  1. Patching: Non-overlapping image patches are extracted.
  2. Embedding: Each patch is flattened and linearly projected into a token. Positional embeddings add spatial context, and classification tokens can be included. Equation showing how image patches are projected and combined with positional embeddings to form the initial token sequence.
  3. Vim Encoder: Tokens pass through a stack of \( L \) bidirectional Vim blocks.
  4. Prediction: The classification token output is fed to an MLP head.

Bidirectional Vim Blocks: Capturing Context from All Directions

A standard Mamba block processes sequences in one direction, which works for text but not for images. Pixels rely on context from all neighboring directions. Vim addresses this with bidirectional processing:

Inside each Vim block:

  1. Input is normalized.
  2. Two projections are computed:
    • \( x \): the main branch features.
    • \( z \): a gating vector.
  3. Forward branch: processes patches from start to end using a Conv1d and SSM.
    Backward branch: processes patches in reverse order.
  4. Each branch captures context in its respective direction via its selective SSM.
  5. Outputs from both branches are gated with \( z \) and summed.
  6. A final projection and residual connection complete the block.

By fusing forward and backward contexts, Vim builds a holistic representation akin to self-attention—without its quadratic overhead.


Why Vim is Efficient

Three main optimizations make Vim stand out:

  1. Computational Efficiency:
    Self-attention scales as \( O(M^2 D) \) (quadratic in sequence length \( M \)), while Vim’s SSM is \( O(M D N) \) (linear, with small fixed \( N \)).
    A comparison of the computational complexity formulas for self-attention and the State Space Model (SSM), highlighting quadratic vs. linear scaling.

  2. Memory Efficiency:
    Vim avoids storing massive attention matrices, keeping memory use linear with sequence length.

  3. IO Efficiency:
    Parameters are loaded once into fast GPU SRAM, computations run there, and only final outputs are written back—minimizing data transfer bottlenecks.


Experimental Results

The paper benchmarks Vim against leading models across vision tasks.

ImageNet Classification

A table comparing the ImageNet top-1 accuracy of various ConvNets, Transformers, and SSM-based models. Vim models consistently outperform their DeiT counterparts.

Table 1: Vim consistently outperforms DeiT models of similar size. Long-sequence fine-tuning boosts accuracy further.

With “long sequence fine-tuning” (smaller patch stride), Vim captures finer detail, improving accuracy by 1–2 points. Vim-S† reaches 81.4%—matching much larger Transformer models.


High-Resolution Efficiency

A line chart showing that as image resolution increases, Vim’s processing speed (FPS) remains higher than DeiT’s.

Figure 3: While speeds are similar at low resolution, Vim’s advantage increases with image size.

A line chart showing GPU memory usage. DeiT’s memory consumption explodes quadratically with resolution, leading to an OOM error, while Vim’s grows linearly.

Figure 4: Memory usage comparison. Vim remains efficient at high resolution, avoiding OOM errors.

At 1248×1248 resolution, Vim runs 2.8× faster and uses 86.8% less GPU memory than DeiT, which crashes due to OOM.


Downstream Tasks: Segmentation and Detection

On semantic segmentation (ADE20K), Vim outperforms DeiT:

A table showing semantic segmentation results on ADE20K. Vim outperforms DeiT in mean Intersection over Union (mIoU).

Table 2: Vim-S matches ResNet-101’s mIoU while using much fewer parameters.

On object detection and instance segmentation (COCO), Vim-Ti surpasses DeiT-Ti by notable margins:

A table showing object detection and instance segmentation results on COCO. Vim-Ti achieves higher AP scores than DeiT-Ti.

Table 3: Vim’s sequential modeling outperforms DeiT, even without 2D-specific priors like windowed attention.

Vim processed high-res detection tasks without architectural changes, unlike DeiT, which required windowed attention. Qualitative results show Vim capturing large-scale objects missed by DeiT:

A visual comparison showing Vim-Ti correctly segmenting a large B-29 bomber, while DeiT-Ti only captures part of it.

Figure 5: Superior long-range context allows Vim to detect large objects more completely than DeiT.


Ablation Studies

Bidirectionality Matters: Removing bidirectionality harms segmentation badly (−3.6 mIoU). The full Bidirectional SSM + Conv1d design yields the best scores:

A table from the ablation study on bidirectional strategies. This design achieves top accuracy across benchmarks.

Table 4: Bidirectional processing is essential for dense prediction tasks.

Positioning the [CLS] Token: Surprisingly, placing it in the middle of the sequence performs best—76.1% accuracy—leveraging bidirectional processing:

A table from the ablation study on classification design. Middle placement yields the highest accuracy.

Table 5: Middle class token placement maximizes forward/backward context aggregation.


Conclusion: A New Contender Emerges

The Vision Mamba paper makes a strong case for SSM-based vision backbones. By adapting Mamba with bidirectional processing and position embeddings, Vim:

  1. Matches or exceeds highly optimized Transformers (e.g., DeiT) across diverse tasks.
  2. Scales linearly in speed and memory with sequence length—ideal for high-resolution inputs.
  3. Maintains pure sequential modeling, avoiding heavy 2D priors.

This efficiency unlocks applications previously impractical with Transformers: gigapixel medical slides, massive satellite images, or long video streams—processed end-to-end, without complex tiling.

The paper shows that self-attention’s dominance in high-performance vision is not absolute. With Vim, the future of computer vision backbones might well be written in the language of State Space Models.