For years, computer vision has been dominated by two architectural titans: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). CNNs excel at capturing local features through sliding convolutional filters, while ViTs leverage self-attention to model global relationships across an entire image.

Now, a new contender has emerged from the world of sequence modeling: the State Space Model (SSM), and in particular its modern, high-performing variant, Mamba.

Mamba has shown remarkable prowess in handling long 1D sequences such as text and genomics, offering linear-time complexity and impressive performance. Naturally, researchers sought to bring its advantages to vision tasks. However, initial attempts such as Vision Mamba (Vim) and VMamba, while promising, have not decisively surpassed CNNs and ViTs. This raises a critical question:

Why has a model so powerful for 1D data struggled to unlock its full potential on 2D images?

A recent paper, “LocalMamba: Visual State Space Model with Windowed Selective Scan”, argues that the answer lies not in the model itself, but in how we feed it visual information. The core challenge is a mismatch: images are 2D grids of pixels with strong local correlations, while Mamba is a 1D sequence processor. Flattening an image into a long sequence breaks these crucial local relationships.

LocalMamba introduces a clever fix:

  1. A windowed selective scan that preserves local 2D dependencies.
  2. An automatic search for the best scan patterns per layer.

This isn’t just an incremental improvement — it fundamentally rethinks how SSMs should see an image, leading to significant gains across classification, detection, and segmentation tasks.


Roadmap

In this article, we’ll explore:

  • The basics of State Space Models and their efficiency.
  • The core problem of applying 1D sequence models to 2D images.
  • LocalMamba’s elegant solution: local scanning + automated direction search.
  • The experimental results proving this approach is a major step forward for vision SSMs.

Background: From Convolutions to State Spaces

A Primer on State Space Models (SSMs)

State Space Models originate from control theory, adapted into deep learning for sequence modeling. They describe the evolution of a sequence via an intermediate latent state:

\[ h'(t) = \boldsymbol{A}h(t) + \boldsymbol{B}x(t) \]

\[ y(t) = \boldsymbol{C}h(t) \]

Here:

  • \(x(t)\): input at time t
  • \(y(t)\): output at time t
  • \(h(t)\): hidden state
  • \(\boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}\): matrices defining system dynamics.

When discretized:

\[ h_t = \bar{A} h_{t-1} + \bar{B} x_t \]

\[ y_t = C h_t \]

This resembles an RNN, but classical SSMs can be reformulated for parallel computation using global convolution:

\[ y = x \circledast \overline{K} \]

where:

\[ \overline{K} = (C\overline{B}, C\overline{AB}, \dots, C \overline{A}^{L-1}\overline{B}) \]

This allows linear-time scaling and efficient training similar to CNNs.

Mamba: Making SSMs Selective

Traditional SSMs use static parameters (\(\bar{A}, \bar{B}, C\)) regardless of the input. Mamba introduces input-dependent parameterization (“selective scan” or S6), generating \(\boldsymbol{B}, \boldsymbol{C}\), and a timescale parameter \(\Delta\) from the sequence itself.

This dynamic adaptability lets Mamba selectively store or forget information based on content, achieving state-of-the-art results in language modeling.


The Core Problem: A Square Peg in a Round Hole

Mamba works on sequences. But images are 2D grids — how do we feed them to a sequence model?

The straightforward method: flatten the image into patches arranged in a single sequence.

Illustration of different scan methods. The left shows a grid over an image of a snake. Panel (a) shows a row-by-row scan. Panel (b) shows a column scan. Panel (c) shows LocalMamba’s windowed scan, keeping neighboring pixels close.

Figure 1: Previous methods like Vim (a) flatten images row by row, greatly separating vertically adjacent pixels. VMamba (b) uses cross-scans but still processes distant regions in the same scan. LocalMamba (c) partitions into windows and scans within them, preserving local relationships.

Flattening destroys spatial locality: neighboring pixels may become hundreds of steps apart in a 1D sequence. The SSM is then forced to learn local patterns that CNNs detect effortlessly.

VMamba improved matters by cross-scanning both horizontally and vertically. But even this still processes far-apart tokens together, and struggles to fully preserve local detail.


The LocalMamba Method: Seeing in Windows

LocalMamba addresses this with:

  1. Local Scanning: Preserving neighborhoods.
  2. Spatial & Channel Attention (SCAttn): Merging features intelligently.
  3. Automated Scan Direction Search: Finding optimal scans for each layer.

1. Local Scan: Preserving Neighborhoods

Instead of scanning the entire image in one go, LocalMamba divides it into non-overlapping windows (e.g., \(3 \times 3\) or \(7 \times 7\) patches) and scans each window individually.

Tokens close in the original image remain close in the processed sequence — enabling efficient capture of fine local details.

To retain global context, the LocalMamba block uses four parallel scans:

  • Standard global scan
  • Local windowed scan
  • Flipped versions of both (reverse order)

The architecture of a LocalVim block and the SCAttn module. Panel (a) shows four parallel scan branches feeding into an SSM and SCAttn module. Panel (b) details the SCAttn module, which uses channel and spatial attention to weigh features.

Figure 2: (a) The LocalVim block: four scan branches processed in parallel. (b) The Spatial and Channel Attention (SCAttn) module merges these features, weighting channels and spatial positions for optimal fusion.


2. SCAttn: Fusing Scans Intelligently

With multiple scans producing different features, the challenge is combining them without losing important cues.

SCAttn solves this by learning:

  • Channel attention: Which feature channels are most informative.
  • Spatial attention: Which tokens (patches) are most salient.

This ensures both local and global information are merged into a rich, unified representation.


Which scan directions best suit each layer — horizontal, vertical, local small windows, local large windows? The optimal choice likely varies throughout the model.

Manual design would be tedious and suboptimal. Inspired by DARTS, LocalMamba uses differentiable search over 8 candidate directions:

  • Horizontal
  • Vertical
  • \(2 \times 2\) local
  • \(7 \times 7\) local
  • …and flipped versions of each.

Each layer learns a weighted sum over all candidates:

\[ \boldsymbol{y}^{(l)} = \sum_{s \in \mathcal{S}} \frac{\exp(\alpha_s^{(l)})}{\sum_{s' \in \mathcal{S}} \exp(\alpha_{s'}^{(l)})} \mathrm{SSM}_s(\boldsymbol{x}^{(l)}) \]

After training the “supernet”, the top 4 directions per layer are retained for the final model.

Visualization of the scan directions chosen by the automated search for different models.

Figure 3: Search patterns for LocalVim. Local scans (orange, red) dominate early and late layers, while global scans (blue, green) appear more in the middle.


Experiments: Putting LocalMamba to the Test

Two families:

  • LocalVim: Plain (uniform) structure like the original Vim.
  • LocalVMamba: Hierarchical structure like Swin Transformer.

ImageNet Classification

Table comparing different backbone models on ImageNet-1K classification accuracy.

Table 1: ImageNet-1K classification results. * marks models with local scan but without automated search.

LocalVim-T reaches 76.2%, a +3.1% jump over Vim-Ti with identical FLOPs (1.5G). Even the non-searched variant LocalVim-T* gets 75.8%, showing the local scan itself yields most of the improvement.

Bar chart showing the accuracy improvement of adding local scan to Vim models.

Figure 4: Adding local scan to Vim significantly boosts accuracy.


Object Detection on COCO

COCO object detection and segmentation results table.

Table 2: COCO results using Mask R-CNN. LocalVMamba beats VMamba and other strong backbones like Swin.

LocalVMamba-T improves box AP by +4.0 over Swin-T, with similar gains in mask AP.


Semantic Segmentation on ADE20K

ADE20K segmentation results table.

Table 3: ADE20K segmentation results. LocalVim-S surpasses Vim-S by +1.5 mIoU.

LocalVMamba-S reaches 51.0 mIoU (MS), setting a new benchmark for Mamba-based visual backbones.


Ablation Study

Ablation study table showing contributions of each component.

Table 4: Every component — local scan, combining scans, SCAttn — contributes to accuracy gains.

Step-by-step:

  1. Baseline (Vim-T): 73.1%
  2. Local Scan only: +1.0%
  3. Local + Global scans: +1.1%
  4. Add SCAttn: +0.6%
  5. Add Search: +0.4%

Conclusion

LocalMamba delivers a compelling answer to making SSMs effective for vision by solving the mismatch between 1D sequence processing and 2D image structure.

Key contributions:

  1. Windowed Local Scanning: Preserves spatial neighborhoods, enabling fine feature learning.
  2. Automated Direction Search: Finds optimal local/global scan mixes per layer.

The results are clear: substantial improvements over previous Vim and VMamba models, strong competition against leading CNNs and ViTs, and new state-of-the-art in certain settings.

For visual SSMs, how you scan is as important as what you scan.

By giving sequence models a more natural way to “see” images, LocalMamba points towards exciting new directions for adapting powerful architectures from other domains to vision challenges — blending efficiency with accuracy in ways that could reshape future deep learning backbones.