For the past decade, computer vision has been dominated by two architectural titans: Convolutional Neural Networks (CNNs) and, more recently, Vision Transformers (ViTs). CNNs are celebrated for their efficiency and strong inductive biases toward local patterns, while ViTs, powered by the self-attention mechanism, excel at capturing global relationships in images.

However, this power comes at a cost — the self-attention mechanism has quadratic complexity (\(O(N^2)\)) with respect to the number of image patches, making ViTs computationally expensive and slow, especially for high-resolution images common in tasks like object detection and segmentation.

What if there was a third way? An architecture that could combine the global context modeling of Transformers with the linear-time efficiency of CNNs?

Researchers in Natural Language Processing (NLP) have been exploring this with a new class of models known as State Space Models (SSMs). One of the most promising recent developments is Mamba, an SSM that achieves remarkable performance on long-sequence language tasks — all while maintaining linear complexity.

Inspired by Mamba’s breakthrough, the paper “VMamba: Visual State Space Model” adapts this powerful NLP architecture for computer vision. The researchers introduce VMamba, a new vision backbone that aims to achieve the best of both worlds:

  • A global receptive field and dynamic, content-aware weights like Transformers.
  • Efficiency and scalability akin to CNNs.

This article explores how VMamba works in depth, breaking down the core concepts, novel architecture, and its impressive benchmark results.


Background: From Convolutions to State Spaces

To understand VMamba’s novelty, it’s worth recapping the current landscape of vision models and introducing the state space concept.

CNNs vs. Vision Transformers

CNNs (e.g., ResNet, ConvNeXt) build hierarchical representations using learnable filters (kernels) applied across the image. They efficiently capture local features — like edges or textures — but rely on multiple layers to model long-range dependencies.

ViTs treat images as sequences of patches and apply self-attention, allowing all patches to directly interact with each other from the first layer. This yields a global receptive field but at quadratic computational cost, which becomes prohibitive for large images.

Comparison of self-attention and the new 2D-Selective-Scan (SS2D)

Figure 1: Self-attention (a) establishes all-to-all connectivity between patches, achieving global context but at high computational cost. SS2D (b) connects patches via structured scan paths, achieving global coverage more efficiently.


A Primer on State Space Models (SSMs)

SSMs originate from control theory, modeling systems that evolve over time. A continuous-time SSM can be expressed as:

\[ \mathbf{h}'(t) = \mathbf{A}\mathbf{h}(t) + \mathbf{B}u(t) \]

\[ y(t) = \mathbf{C}\mathbf{h}(t) + D u(t) \]

Where:

  • \(u(t)\): Input signal
  • \(\mathbf{h}(t)\): Hidden state summarizing past inputs
  • \(y(t)\): Output signal
  • \(\mathbf{A}, \mathbf{B}, \mathbf{C}, \mathbf{D}\): Learnable parameters controlling evolution of the system

For image or text processing, these systems are discretized into recurrence relations:

\[ \mathbf{h}_k = \overline{\mathbf{A}}\,\mathbf{h}_{k-1} + \overline{\mathbf{B}}\,u_k \]

\[ y_k = \overline{\mathbf{C}}\,\mathbf{h}_k + \overline{\mathbf{D}}\,u_k \]

This is highly efficient for inference — much like RNNs — and can be parallelized for training.

However, traditional SSMs use fixed matrices (Linear Time-Invariant systems). This limits adaptability to different contexts. Mamba addressed this by introducing a Selective Scan Mechanism (S6), where parameters are generated dynamically based on input content. This lets the model decide what to retain or discard in memory, enabling richer representations.


The Core Innovation: Building VMamba

The challenge in adapting Mamba for vision lies in moving from 1D sequential data (text) to 2D spatial data (images). Images have no inherent sequential order. VMamba addresses this with the 2D-Selective-Scan (SS2D) module.


VMamba Architecture

VMamba adopts a hierarchical design similar to Swin Transformer and ConvNeXt:

  1. Stem Module divides the input image into small patches.
  2. Network Stages, each containing multiple Visual State Space (VSS) blocks.
  3. Between stages, down-sampling layers reduce spatial resolution and increase channel depth.

The architecture of VMamba and evolution of its core block

Figure 3: (a) VMamba’s hierarchical architecture. (b-d) Evolution from Mamba Block to Vanilla VSS Block to the final optimized VSS Block.

The final VSS block mirrors a Transformer block structure:

  • Normalization → SS2D Module → Residual Connection → Normalization → MLP.

SS2D replaces the self-attention with a state space mechanism optimized for 2D data.


2D-Selective-Scan (SS2D)

The SS2D pipeline (Figure 2) unfolds the 2D feature map into multiple 1D sequences, processes them via Mamba-style S6 blocks, and merges them back:

The three-step process of 2D-Selective-Scan (SS2D)

Figure 2: Step 1 — Cross-Scan: Extract four directional sequences from the 2D map. Step 2 — Each sequence processed via an S6 block. Step 3 — Cross-Merge back into a 2D feature map.

Steps:

  1. Cross-Scan:

    • Forward: top-left to bottom-right
    • Reverse: bottom-right to top-left
    • Flipped forward: top-right to bottom-left
    • Flipped reverse: bottom-left to top-right
  2. Selective Scanning (S6): Each path processed independently, selectively accumulating context along the traversal.

  3. Cross-Merge: Merge the four outputs and reshape to original spatial layout.

This structured scan method yields global context with linear complexity.


Acceleration Optimizations

The initial VMamba was accurate but not fast enough for deployment. Authors applied several optimizations:

  • Reimplemented Cross-Scan/Merge using Triton for GPU efficiency.
  • Replaced einsum operations with optimized linear layers.
  • Tuned hyperparameters (d_state, ssm-ratio) for better throughput.
  • Removed multiplicative branch to streamline computation.

Experimental Results

Image Classification — ImageNet-1K

VMamba was tested against popular CNNs and Transformers.

Performance comparison on ImageNet-1K

Table 1: VMamba models outperform Swin and ConvNeXt counterparts at similar scales, with superior throughput.

Example: VMamba-T achieves 82.6% Top-1 Accuracy, surpassing Swin-T by +1.3%, and runs 35% faster.


Downstream Tasks

VMamba’s strengths carry over to dense prediction tasks:

Results for object detection and semantic segmentation

Table 2: COCO object detection and ADE20K segmentation show significant gains over Swin and ConvNeXt.

On COCO object detection:

  • VMamba-T beats Swin-T by +4.6 mAP.

On ADE20K segmentation:

  • VMamba-T outperforms Swin-T by +3.4 mIoU.

Performance plots confirm a consistent lead in Figure 4(a):

VMamba’s adaptability and resolution stability

Figure 4: (a) VMamba yields better dense task performance for a given ImageNet accuracy. (b) Accuracy degradation is gentler as input resolution increases.


Linear Scaling Advantage

Vision Transformers’ quadratic scaling hampers high-res performance. VMamba, with SS2D, scales linearly — like CNNs:

Scaling of FLOPs, throughput, and memory

Figure 5: FLOPs/memory grow linearly with resolution, enabling high throughput even for large inputs.


Under the Hood: Interpretability

Effective Receptive Field (ERF)

Before training, most models show local ERFs. Post-training, VMamba develops a global ERF, similar to ViTs, but with more uniform 2D coverage:

ERF comparison before and after training

Figure 7: VMamba achieves a global ERF after training, enabling better long-range feature aggregation.


Activation Maps

SS2D can be viewed analogously to attention maps.
Example: Query patch on a sheep highlights relevant areas across its scan path, remembering distant parts of the object.

Activation maps for a query patch

Figure 6: VMamba effectively attends to all relevant object regions, leveraging scan paths to accumulate context.


Conclusion & Outlook

VMamba represents a paradigm shift in vision backbones by leveraging state space modeling for 2D spatial data:

  • Linear Scaling: Efficient for high-res images, on par with CNNs.
  • Global Context: From initial layers, akin to Transformers.
  • Top-tier Performance: Consistent wins across classification, detection, and segmentation.

Future directions include:

  • Unsupervised pre-training for SSM-based architectures.
  • Exploring generalized scanning patterns for diverse modalities.
  • Scaling VMamba to larger architectures.

As demand grows for efficient, powerful vision models, architectures like VMamba may redefine the landscape — making room for a third titan alongside CNNs and Transformers.