Can Mamba Beat Transformers? Exploring Samba for Salient Object Detection

When you look at a photograph, your eyes don’t process every pixel with equal intensity. You instantly focus on the “important” parts—a person waving, a bright red car, or a cat sitting on a fence. This biological mechanism is what computer vision researchers call Salient Object Detection (SOD).

For years, the field of SOD has been a tug-of-war between two dominant architectures: Convolutional Neural Networks (CNNs) and Transformers. CNNs are efficient but struggle to understand the “whole picture” (limited receptive fields). Transformers are masters of global context but are computationally heavy, with complexity growing quadratically as image resolution increases.

Is there a third option? Enter Mamba, a State Space Model (SSM) that promises the global understanding of Transformers with the linear efficiency of CNNs.

In this post, we are diving deep into Samba, a novel framework presented at CVPR that adapts the Mamba architecture for general saliency detection. We will explore how the authors overcame the unique challenges of applying 1D sequence modeling to 2D images and achieved state-of-the-art results across various modalities.

The Problem: The 2D vs. 1D Mismatch

To understand why Samba is necessary, we first need to understand the limitation of current methods.

CNNs: Great at local details (edges, textures) but struggle to connect distant parts of an image.
Transformers: use Self-Attention to connect every pixel to every other pixel. This captures global context perfectly but is incredibly slow and memory-hungry for high-resolution images.

Mamba (State Space Models) offers a compelling alternative. It models visual data as a sequence, processing information recursively. This allows it to remember long-range dependencies (global context) with linear computational complexity (\(O(N)\)).

However, Mamba was originally designed for text (1D sequences). Images are 2D grids. To use Mamba for vision, you have to “flatten” the image into a sequence of patches.

The Core Issue: Standard flattening methods (like scanning row-by-row) often break the spatial continuity of objects. If a “salient” object (like a dog) is sliced up and separated by long stretches of background pixels in the 1D sequence, the Mamba model “forgets” the object’s features by the time it encounters the next piece of it.

The Samba Architecture

The researchers propose Samba, a unified framework capable of handling standard RGB images, as well as complex tasks like RGB-D (Depth), RGB-T (Thermal), and Video SOD.

Figure 2. Overall architecture of the proposed Samba model for general SOD tasks.

As shown in Figure 2, the architecture follows a classic Encoder-Decoder structure but with Mamba DNA throughout:

Encoder: A Siamese backbone based on Visual State Space (VSS) layers extracts multi-level features.
Converter: A Multi-modal Fusion Mamba (MFM) block integrates extra data (like depth or thermal) if present.
Decoder: This is where the magic happens. It uses two novel components—Saliency-Guided Mamba Blocks (SGMB) and Context-Aware Upsampling (CAU)—to reconstruct the high-precision saliency map.

Understanding the Math Behind Mamba

Before dissecting the novel blocks, let’s briefly look at the mathematical foundation. Mamba relies on linear time-invariant (LTI) systems that map an input sequence \(x(t)\) to an output \(y(t)\) through a latent state \(h(t)\).

Equation 1: Continuous-time system

To use this in deep learning, the system is discretized (broken into steps), allowing it to be expressed recursively:

Equation 2: Discretization parameters

Equation 3: Discrete-time recurrent form

The classic Visual State Space (VSS) block, used in the encoder, processes features by splitting them into streams and using a Selective Scan (SS2D) module.

Figure 3. Diagram of the visual state space (VSS) block and selective scan (SS2D) module.

The SS2D module scans the image in four fixed directions (corners to opposite corners) to simulate global context. However, for Salient Object Detection, these fixed directions aren’t smart enough.

Innovation 1: Rethinking the Scan with SGMB

The most significant contribution of this paper is the Saliency-Guided Mamba Block (SGMB).

The authors realized that for Mamba to detect objects effectively, the “scan” needs to stay inside the object as long as possible to maintain memory of its features.

Standard scanning strategies are rigid. Look at Figure 1 below. Patterns (a), (b), and (c) are fixed geometric scans. If a salient object is irregularly shaped, these scans essentially chop the object into disconnected fragments in the 1D sequence.

Figure 1. Comparison between existing scanning strategies and our scanning strategy.

The authors propose Spatial Neighboring Scanning (SNS) (shown in Figure 1d). This algorithm treats the scanning process as a pathfinding problem. It tries to find the shortest path that traverses all salient patches while keeping them spatially close in the sequence.

How SNS works:

It takes a “coarse” saliency map (a rough guess of where the object is).
It scans row by row but dynamically decides whether to go “left-to-right” or “right-to-left” based on which end is closer to the salient pixels in the next row.
This minimizes the “jump” between rows, keeping the object’s pixels continuous in the sequence.

We can see this visualized in the motion of a skater in Figure 5. The scanning path (the line) adapts to the shape of the skater, ensuring the model processes the person as a continuous entity rather than scattered noise.

Figure 5. Scanning paths of salient regions generated by SNS.

The SGMB integrates this SNS strategy. It uses the coarse saliency map to generate these optimized indices (\(I_s\)) and reorders the features before feeding them into the Mamba block.

Figure 4. Diagram of the saliency guided Mamba block (SGMB).

By ensuring the 1D sequence respects the 2D spatial continuity of the object, the recurrent nature of Mamba becomes a strength rather than a weakness.

Innovation 2: Context-Aware Upsampling (CAU)

The second major issue the authors tackled is feature alignment. In a decoder, you typically need to merge high-level features (small resolution, rich semantic meaning) with low-level features (high resolution, edge details).

Most networks use “nearest-neighbor” interpolation to upscale the small features. This is fast but “dumb”—it doesn’t learn anything and often leads to misalignment between the layers.

The authors propose Context-Aware Upsampling (CAU).

Figure 6. Diagram of the context-aware upsampling (CAU) method.

Instead of simple interpolation, CAU makes the upsampling process learnable and dependent on context:

Patch Pairing: It pairs patches from the high-level features (\(f_{i+1}\)) with their corresponding spatial neighbors in the low-level features (\(f_i\)).
Sequence Modeling: It concatenates these pairs into a sequence and feeds them into an S6 (Mamba) block.
Causal Prediction: Because Mamba is a causal model (prediction depends on history), the S6 block learns to predict the distribution of the high-resolution features based on the low-resolution context.

This results in upsampled features that are semantically aligned with the high-resolution map, significantly sharpening the boundaries of the detected objects.

Samba isn’t just for standard photos. It is designed to handle “General” SOD, which includes depth maps, thermal images, and optical flow (for video).

To handle this, the authors insert a Convertor between the encoder and decoder. This module fuses the RGB features (\(f^r\)) with the auxiliary modality (\(f^x\)) using a specific Mamba-based fusion block.

Equation: Multi-modal fusion logic

The equation above shows how features are projected, concatenated, processed by an S6 block to learn cross-modal interactions, and then merged.

Experimental Results

The authors tested Samba on 21 different datasets across 5 different SOD tasks. The results show that Samba consistently outperforms both CNN-based and Transformer-based state-of-the-art (SOTA) methods.

Quantitative Analysis

In standard RGB Salient Object Detection, Samba achieves top-tier performance while maintaining a lower parameter count compared to heavy Transformer models (like SwinNet).

Table 1. Quantitative comparison of our Samba against other SOTA RGB SOD methods.

The advantage becomes even clearer in complex tasks like RGB-D Video SOD, where maintaining temporal and cross-modal consistency is difficult. Samba achieves significantly higher F-measure (\(F_m\)) and Structure-measure (\(S_m\)) scores compared to dedicated video models.

Table 5. Quantitative comparison of our Samba against other SOTA RGB-D VSOD methods.

Qualitative Analysis

Visual comparisons highlight Samba’s strength in handling difficult scenarios.

Look at Figure 7 below:

Row 1 (Gazebo): Notice the hollow sections between the pillars. Other models (like VSCode-S or ICON-S) blur these areas or fill them in. Samba correctly identifies the hollow space.
Row 2 (Crowd): Samba accurately separates the two individuals without merging them into a blob or missing limbs.
Row 4 (Cluttered Desk): The background is messy, but Samba isolates the monitor and objects perfectly.

Figure 7. Visual comparison against SOTA RGB SOD methods.

Ablation Studies

To prove that the new blocks (SGMB and CAU) are actually doing the heavy lifting, the authors performed ablation studies.

Table 6. Ablation study of Samba.

Variants A1/A2: Removing the SGMB or using a standard block drops performance significantly.
Variants A3-A5: Using standard scanning patterns (Z-scan, S-scan) instead of the proposed SNS results in lower scores, proving that how you scan the image matters.
Variants B1-B3: Replacing CAU with standard upsampling (B1) or other learnable upsamplers (B2/B3) also degrades performance.

Conclusion

The “Samba” paper represents a significant step forward in adapting State Space Models for computer vision. It highlights a critical insight: when adapting 1D sequence models to 2D images, the order of the sequence defines the model’s understanding of space.

By inventing Spatial Neighboring Scanning, the authors ensured that the “memory” of the Mamba model aligns with the physical structure of the salient object. Combined with a smarter, context-aware upsampling method, Samba offers a unified framework that is not only more accurate but also computationally efficient.

As the field moves toward efficient alternatives to Transformers, techniques like those introduced in Samba—specifically the dynamic scanning strategies—will likely become standard practice in high-performance computer vision.

The Problem: The 2D vs. 1D Mismatch#

The Samba Architecture#

Understanding the Math Behind Mamba#

Innovation 1: Rethinking the Scan with SGMB#

The Spatial Neighboring Scanning (SNS) Algorithm#

Innovation 2: Context-Aware Upsampling (CAU)#

Multi-Modal Fusion#

Experimental Results#

Quantitative Analysis#

Qualitative Analysis#

Ablation Studies#

Conclusion#