Introduction

In the world of computer vision, Image Style Transfer (ST) is one of the most visually captivating tasks. It enables us to take a content image (like a photograph of a street) and a style image (like Starry Night), merging them so the photograph looks as if Van Gogh painted it himself.

While the artistic results are stunning, the engineering behind them faces a significant bottleneck: the trade-off between generation quality and computational efficiency.

To generate high-quality artistic images, a model needs a “global receptive field”—it needs to understand the entire image at once to capture large-scale style patterns and semantic structures.

  • CNNs (Convolutional Neural Networks) achieve this by stacking many layers, which becomes computationally heavy.
  • Transformers (using Self-Attention) naturally have global fields but suffer from quadratic complexity (\(O(N^2)\)), making them slow and memory-hungry.
  • Diffusion Models produce incredible details but require numerous iterative steps, taking significantly longer to infer.

Enter SaMam (Style-aware Mamba). In a recent paper, researchers propose utilizing the Mamba architecture—a State Space Model (SSM) known for linear complexity (\(O(N)\))—to solve this dilemma.

Figure 1. Trade-off between inference time t (ms) and ArtFID achieved by different methods. The size of a circle represents MACs (G).

As shown in Figure 1, SaMam (the green circle) occupies a “sweet spot.” It achieves an ArtFID score (a measure of quality, lower is better) comparable to heavy Diffusion models (like ZStar) but with a fraction of the inference time and computational cost (MACs).

In this post, we will deconstruct how SaMam adapts the Mamba architecture for vision tasks, specifically tailored for arbitrary style transfer.


Background: The Efficiency Problem

Before diving into SaMam, we need to understand why current methods struggle to balance speed and quality.

The Receptive Field Dilemma

For a neural network to stylize a specific pixel, it helps to know what is happening in the rest of the image.

  • CNN-based methods usually have limited local windows. To “see” the whole image, they must process the image through deep hierarchies of layers, which increases Floating Point Operations (FLOPs).
  • Transformer-based methods calculate relationships between every patch and every other patch. As image resolution grows, the calculation time explodes quadratically.

The State Space Model (SSM) Solution

State Space Models, particularly the Mamba variant, have recently revolutionized Natural Language Processing (NLP) by modeling long sequences with linear complexity. They map a 1-D function or sequence \(x(t) \to y(t)\) through an implicit latent state \(h(t)\).

The continuous system is defined by linear ordinary differential equations (ODEs):

Continuous time SSM equation

Here, \(\mathbf{A}\), \(\mathbf{B}\), \(\mathbf{C}\), and \(\mathbf{D}\) are weighting parameters. To use this in deep learning, we “discretize” these equations (converting continuous time to discrete steps). This transforms the equations into a recursive form that looks very similar to Recurrent Neural Networks (RNNs):

Discrete recursive form of SSM

The magic of Mamba lies in its ability to let parameters like \(\mathbf{B}\), \(\mathbf{C}\), and \(\Delta\) (timescale) be input-dependent. This allows the model to selectively “remember” or “ignore” information, creating a global receptive field without the massive computational cost of Transformers.


The SaMam Framework

The researchers developed SaMam to adapt Mamba for the specific challenges of Style Transfer. The framework is not just a standard Mamba model; it introduces several novel mechanisms to handle 2D images and style injection.

1. Overall Architecture

The architecture, illustrated in Figure 2, consists of three main components:

  1. Style Mamba Encoder: Extracts features from the artwork.
  2. Content Mamba Encoder: Extracts features from the photograph.
  3. Style-aware Mamba Decoder: Fuses the two to generate the final image.

Figure 2. An overview of our SaMam framework (a) and an illustration of the selective scan methods (b).

Input images are first processed into patches (similar to Vision Transformers). These patches pass through Vision State Space Modules (VSSMs).

The Zigzag Scan

A major challenge with applying Mamba to images is that Mamba is designed for 1D sequences (like text). When you flatten a 2D image into a 1D sequence using a standard raster scan (row by row), pixels that are vertically adjacent in the image might end up very far apart in the sequence. This is called spatial discontinuity.

To solve this, SaMam introduces a Zigzag Scan (shown in Figure 2a).

  • Standard scans (like Sweep or Cross scans in Fig 2b) often jump from the right edge of one row to the left edge of the next.
  • The Zigzag scan traverses the image in a continuous snake-like pattern. This preserves semantic continuity, ensuring that adjacent tokens in the sequence are spatially close in the image.

2. The Decoder and S7 Block

The core innovation of this paper is the Style-aware Vision State Space Module (SAVSSM) located in the decoder. Standard Mamba blocks are designed to process content, but they don’t have an inherent mechanism to be “conditioned” on a style.

The researchers created a new block called the S7 Block (Style-aware Selective Scan Structured State Space Sequence Block).

Figure 3. The detailed architecture of Style-aware Vision State Space Module (SAVSSM).

How the S7 Block Works

In a standard Mamba block (S6), the transition matrix \(\mathbf{A}\) is usually fixed or learned as a static parameter. In the S7 Block, the weighting parameters are predicted dynamically from the style embedding (\(\mathbf{E}_s\)).

Equation predicting A and D from Style Embedding

By predicting \(\mathbf{A}\) and \(\mathbf{D}\) from the style image:

  1. Style Selectivity: The hidden state update (\(h_t\)) is now influenced by the artistic style. The model creates a dynamical system where the rules of transition themselves depend on whether the style is a Monet or a Picasso.
  2. Efficiency: Despite this dynamic adaptation, the operation remains linear in complexity, utilizing parallel scans.

3. Addressing Mamba’s Weaknesses

While Mamba is efficient, it has limitations when applied to vision. The authors introduce specific fixes for these issues.

Local Enhancement (LoE)

Because Mamba flattens images into sequences, there is a risk of “local pixel forgetting.” Even with Zigzag scans, some local neighbor relationships are lost.

To fix this, the researchers add a Local Enhancement (LoE) module at the end of the encoders. This module uses standard Convolution layers (which are excellent at local features) and channel attention to “compensate” for any local details the Mamba block might have missed.

Style-aware Modules

To ensure the style permeates every part of the generation, SaMam replaces standard components with style-aware versions:

  1. SConv (Style-aware Convolution): Instead of standard depth-wise convolutions, the kernels are generated based on the style embedding. This captures local geometric structures (like brush strokes).
  2. SCM (Style-aware Channel Modulation): This rescales channels in the residual branches based on style, helping the model emphasize specific features (like color palettes).
  3. SAIN (Style-aware Instance Norm): Normalization is crucial in style transfer. SAIN predicts the mean and variance for normalization directly from the style embedding.

The effectiveness of using Style-aware Instance Norm (SAIN) versus other normalization strategies is plotted below. SAIN (specifically with zero-initialization) converges to better ArtFID scores.

Figure 4. Comparison of different norm strategies.


Experiments and Results

The researchers compared SaMam against state-of-the-art methods, including CNNs (AesPA), Transformers (StyTr2), and Diffusion models (ZStar, StyleID).

Qualitative Comparison

Visually, SaMam excels at balancing content preservation with style application.

Figure 5. Qualitative comparison with previous state-of-the-art methods.

In Figure 5, look at the row with the text (7th row).

  • ZStar and StyleID (Diffusion based) tend to hallucinate or destroy the text structures.
  • SaMam keeps the text legible while applying the texture accurately.
  • Similarly, in the 6th row (buildings), SaMam preserves the straight lines of the architecture better than the competitors while still applying the “speckled” style.

Quantitative Analysis

The efficiency gains are where SaMam truly shines. Table 1 below lists the performance metrics.

  • LPIPS: Measures content fidelity (Lower is better).
  • FID: Measures style similarity (Lower is better).
  • ArtFID: A combination of both (Lower is better).
  • MACs / Time: Computational cost and speed.

Table 1. Quantitative comparison of the ST methods.

Key Takeaways from the Data:

  1. Speed: SaMam processes an image in 0.034 seconds. Compare this to StyTr2 (Transformer) at 0.385s or ZStar (Diffusion) at a massive 42.439s. SaMam is roughly 1000x faster than diffusion methods.
  2. Quality: Despite the speed, SaMam achieves the lowest (best) ArtFID score (26.305) and LPIPS score (0.3884). It beats models that are orders of magnitude heavier.

Validating the Effective Receptive Field

One of the main claims of Mamba is that it achieves a global receptive field. The researchers visualized the Effective Receptive Field (ERF) to prove this.

Figure 6. The effective receptive field (ERF) visualization for our SaMam.

As seen in Figure 6, the dark areas (representing the receptive field) are widely distributed across the whole image after training. This confirms that SaMam is indeed utilizing global context to make stylization decisions, rather than just looking at small local patches.

Ablation Studies: Do the components matter?

The researchers performed ablation studies to verify the contribution of individual components like the SConv layer and the S7 Block.

Impact of SConv: Figure 10 visualizes the difference between using a standard depth-wise convolution (DWConv) versus the proposed Style-aware Convolution (SConv).

Figure 10. Ablation study on SConv.

Notice the “circuit board” example (top row). The SConv result (middle column) successfully replicates the intricate circuit lines. The DWConv result (right column) looks blurry and fails to capture the sharp geometric structure of the style.


Conclusion

The SaMam paper presents a compelling argument for the use of State Space Models in vision tasks. By adapting Mamba with style-aware mechanisms (specifically the S7 Block) and spatial improvements (Zigzag scan), the authors have created a framework that breaks the traditional trade-off between quality and efficiency.

Key Implications:

  • Linear Complexity Works for Vision: We don’t always need the quadratic cost of Transformers to get global context.
  • Dynamic Parameters: Predicting SSM parameters (\(\mathbf{A}\), \(\mathbf{D}\)) from condition embeddings (like style) is a powerful way to adapt these models for generative tasks.
  • Real-time Potential: With inference times as low as 0.034s, SaMam opens the door for high-quality, real-time video style transfer on consumer hardware, something that remains difficult for Diffusion models.

SaMam demonstrates that Mamba is not just an NLP contender—it is a serious competitor in the world of computer vision and creative AI.