Transformers dominate many sequence-modeling tasks, but their core self-attention scales quadratically with context length. That design choice makes very long contexts expensive in compute and memory. At the same time, structured state-space models (SSMs) — exemplified by S4 and Mamba — offer linear scaling in sequence length and constant state for autoregressive generation. The two model families have matured along largely separate paths: different mathematics, different optimizations, and different engineering tradeoffs.
The paper “Transformers are SSMs” builds a bridge between these paradigms and uses that connection to design a faster, more hardware-friendly SSM variant: Mamba-2. The key insight is a principled correspondence between (i) SSMs, (ii) a generalized family of attention mechanisms called structured masked attention (SMA), and (iii) a classical class of structured matrices known as semiseparable matrices. This Structured State Space Duality (SSD) framework yields algorithms that combine the asymptotic efficiency of SSM recurrences with the hardware friendliness of dense matrix multiplications.
In this article I’ll walk through the main ideas, show where the duality comes from, and explain the SSD algorithm that powers Mamba-2. Throughout, I’ll use figures from the paper to make the connections concrete.
Figure 1: A high-level roadmap. Structured matrices form the bridge between State-Space Models (SSMs) and Attention, creating the Structured State Space Duality (SSD) framework that enables the Mamba-2 architecture.
High-level plan for the article
- Revisit SSMs and attention in a way that makes the connection visible.
- Introduce semiseparable matrices and show SSM → structured matrix.
- Generalize linear attention to structured masked attention (SMA).
- Present the duality: particular SSMs ↔ particular SMAs.
- Explain the hybrid SSD algorithm (block decomposition) that is both linear in sequence length and matmul-friendly.
- Cover architectural choices, system optimizations, and empirical highlights for Mamba-2.
1 — Quick technical refresher
1.1 What is a structured SSM?
A structured state-space model (SSM) is a linear recurrence that updates an internal state and emits outputs. In discrete time, a selective (time-varying) SSM can be written as
\[ \begin{aligned} h_t &= A_t h_{t-1} + B_t x_t, \\ y_t &= C_t^\top h_t, \end{aligned} \]where \(x_t, y_t \in \mathbb{R}^P\) (P is the per-head or channel dimension) and \(h_t \in \mathbb{R}^N\) is the internal state (the state expansion factor). SSMs are called “structured” because the transition matrices \(A_t\) (or their parametrization) are constrained in ways that make the whole model computationally efficient.
Two common computational modes:
- Recurrent (linear) mode: update \(h_t\) step by step. This costs \(O(TN)\) for sequence length \(T\) (per head).
- Convolutional (parallel) mode: when \(A_t\) is time-invariant (LTI), the SSM is equivalent to a convolution and can be computed in parallel via FFTs or other methods.
Selective SSMs (time-varying \(A_t\)) can be more expressive for language but are harder to optimize on hardware because their efficient implementation often needs specialized kernels.
1.2 What is attention / linear attention?
Standard (softmax) self-attention computes
\[ Y = \operatorname{softmax}(Q K^\top)\,V, \]where \(Q,K,V\in\mathbb{R}^{T\times N}\) and the \(T\times T\) Gram matrix incurs a quadratic cost in \(T\). Linear attention methods remove the softmax or replace it by a kernel feature map \(\psi(\cdot)\) so that
\[ QK^\top \approx \psi(Q)\psi(K)^\top, \]and then reorder matrix multiplications using associativity:
\[ (QK^\top)V = Q(K^\top V) \quad\longrightarrow\quad Q \underbrace{(K^\top V)}_{\text{compute once}}. \]With an additional causal mask on the Gram matrix, the reordered computation can lead to a recurrence (a cumulative sum or related structured multiplication), enabling linear complexity in \(T\) for autoregressive generation.
2 — SSMs as matrix transformations (the semiseparable viewpoint)
Rather than thinking of SSMs only as recurrences, we can view the entire \(x\mapsto y\) transformation as a single \(T\times T\) matrix \(M\) acting on the time axis:
\[ y = Mx, \qquad M_{j,i} = C_j^\top A_j A_{j-1}\cdots A_{i+1} B_i \quad\text{for } j\ge i. \]So an SSM defines a structured sequence-to-sequence matrix \(M\). The crucial property: when the internal dimension is \(N\), every submatrix of \(M\) that lies on or below the diagonal has rank at most \(N\). That is exactly the defining property of (lower-triangular) semiseparable matrices.
Intuitively
- The SSM recurrence gives a linear (recurrent) way to multiply \(M\) by a vector in \(O(TN)\).
- Explicitly constructing \(M\) and doing \(Mx\) is quadratic in \(T\) but uses dense matmuls (hardware-friendly).
This two-mode nature (linear recurrence vs. explicit matrix multiplication) is the computational duality that SSD leverages.
Figure 2: State-space models viewed as matrix transformations. The sequence mixer \(M\in\mathbb{R}^{T\times T}\) is semiseparable: every on-and-below diagonal submatrix has rank at most \(N\), the SSM state dimension. This lets us compute \(Mx\) either by a recurrence or via structured matrix algorithms.
A useful special case is the scalar SSM (state dimension \(N=1\)), where \(A_t\) reduces to scalars \(a_t\). The corresponding \(M\) has entries \(M_{j,i}=a_{j}\cdots a_{i+1}\) (for \(j\ge i\)), and multiplying by \(M\) corresponds to the simple scalar recurrence
\[ y_t = a_t y_{t-1} + x_t, \]a cumprod-cum-sum operator (cumprodsum). Many efficient primitives for 1-SS (scalar semiseparable) multiplication are known, and they generalize to the \(N>1\) semiseparable case.
3 — Structured Masked Attention (SMA): generalizing linear attention
Rewriting masked attention as a single tensor contraction clarifies multiple algorithmic orderings. The masked kernel attention:
\[ Y = (L \circ (QK^\top))\,V \]is a contraction of four tensors \((Q,K,V,L)\) over appropriate axes. Two natural contraction orders produce two algorithms:
- Quadratic (attention) mode: compute \(G=QK^\top\), \(M=L\circ G\), then \(Y=MV\).
- Linear (scan) mode: compute \(Z = K^\top V\), then \(H = LZ\), then \(Y = QH\). The step \(H=LZ\) is the bottleneck.
Linear attention is exactly the case where \(L\) is the lower-triangular all-ones causal mask, because multiplying by that mask reduces to a cumulative sum. The insight in this paper: we can replace \(L\) by any structured matrix that admits subquadratic matrix-vector multiplication. This gives Structured Masked Attention (SMA):
Definition (informal). SMA computes the same 4-way contraction \(Y = \operatorname{contract}(Q,K,V,L)\) but allows \(L\) to be an arbitrary structured mask with fast matvecs. The two algorithmic modes (quadratic and linear) are the different contraction orders; the linear mode is efficient when \(L\) is structured.
Figure 3: SMA constructs \(M = QK^\top \circ L\) for an arbitrary structured mask \(L\). Different \(L\)s produce different behaviors (causal, decaying, Toeplitz, 1-SS, Fourier, …). 1-SS masks (purple row) are central to SSD.
Important example classes for \(L\):
- Causal all-ones \(L\): yields standard linear attention.
- Exponential decay / Toeplitz masks: captures structures like RetNet or relative positional decay.
- 1-semiseparable masks: the mask that arises from scalar SSMs (products of scalars \(a_t\)).
4 — The duality: when an SSM is an attention mechanism (and vice versa)
A scalar-identity SSM (each \(A_t = a_t I\), a scalar times identity) produces a semiseparable matrix \(M\) whose rows factor as
\[ M = L \circ (C B^\top), \qquad L_{j,i} = a_j a_{j-1}\cdots a_{i+1}. \]But that is exactly the quadratic masked kernel attention form with \(Q\leftarrow C\), \(K\leftarrow B\), and mask \(L\). So, the quadratic form of that SSM instance is precisely SMA with a 1-semiseparable \(L\).
Conversely, suppose you have an SMA with a structured, causal mask \(L\) that enjoys efficient autoregressive updates of bounded order. Then \(L\) must be semiseparable. Therefore, efficient autoregressive masked attentions correspond to semiseparable SMA — and many of these are SSMs (or closely related variants).
Put succinctly: there is a large intersection of models that have both a linear-time SSM recurrence and a quadratic attention-style form — these are the State-Space Dual (SSD) class.
Corollary (informal). 1-SS SMA (SMA with a 1-semiseparable mask) is a special case of a diagonal SSM (with scalar-identity transitions). The two algorithmic forms are dual: the SSM recurrence is the linear algorithm; forming \(M\) and doing \(Mx\) is the quadratic attention algorithm.
This equivalence is the heart of the paper and explains why we can translate algorithms and optimizations back and forth between the attention and SSM camps.
5 — SSD: a practical, hardware-efficient algorithm
Having two algorithmic forms is useful because each has complementary strengths:
- The SSM recurrence is asymptotically linear in \(T\), but is often scalar or elementwise dominated and thus not always a good match for GPU tensor cores.
- The attention (quadratic) form is hardware friendly: large batched matrix multiplications, great for GPUs and TPUs — but quadratic in \(T\).
SSD’s solution is to combine both via a block decomposition of the semiseparable matrix \(M\). The main idea:
- Partition the sequence into \(B = T/Q\) chunks of length \(Q\). Equivalently, partition \(M\) into a \(B\times B\) block lower-triangular matrix.
- Diagonal blocks (intra-chunk interactions) are small (\(Q\times Q\)) semiseparable matrices — compute them via the quadratic SMA mode using fast batched matmuls (high hardware utilization).
- Off-diagonal blocks (inter-chunk interactions) are low-rank thanks to semiseparability. Factor them into left/center/right factors and:
- compute chunkwise right factors via matmuls (compact per-chunk states),
- run a smaller scalar-SSM (1-SS) scan across the chunk states (linear but on a much shorter sequence),
- convert states back to outputs via matmuls for left factors.
- Sum the diagonal (intra-chunk) outputs and off-diagonal (inter-chunk) outputs to produce the final result.
That hybrid lets SSD:
- keep asymptotically linear complexity in \(T\),
- perform most computation with large matrix multiplications (tensor-core friendly),
- drastically reduce scalar/elementwise work by shrinking the length of scans from \(T\) to \(T/Q\).
Figure 4: SSD algorithm via block decomposition. Diagonal blocks use the quadratic (attention) mode; off-diagonal blocks factor into low-rank terms whose center requires a short (chunkwise) recurrence.
The paper gives a concise PyTorch implementation of SSD (rearrangement + batched einsums). The code is readable and demonstrates the idea end-to-end (below is an adapted and annotated sketch):
|
|
(The paper contains a fully working, vectorized implementation using einops/einsum and an efficient handling of the 1-SS cumprodsum via exponentials of cum-sums.)
Complexity summary (when choosing state size \(N\), head dim \(P\), and chunk length \(Q\) appropriately):
- Training FLOPs: \(O(T N^2)\) (dominated by matmul work inside chunks).
- Inference FLOPs: \(O(N^2)\) per step (state size dependent).
- Memory: \(O(TN)\).
- Most heavy work is batched matrix multiplications — ideal for accelerator tensor cores.
Benchmarks in the paper show SSD is 2–8× faster than the previously optimized Mamba fused scan (for large state expansion sizes), and faster than highly optimized attention (FlashAttention-2) starting around sequence length \(2\text{k}\) and up. SSD is also much less sensitive to increasing state dimension \(N\) compared to scan-based implementations.
6 — Mamba-2: architecture and design choices
SSD supplies a fast core sequence mixer. Mamba-2 is an architecture that uses SSD as the core layer and borrows transformer design idioms to make the block TP-friendly and robust.
Key block-level changes (compared to original Mamba):
- Parallel parameter projections: produce \(A,B,C,X\) via parallel projections at the block input (analogous to Q, K, V in Transformers) rather than computing SSM parameters sequentially from \(X\). That reduces synchronizations and enables standard tensor parallel sharding.
- Extra normalization: an extra normalization (LayerNorm/GroupNorm/RMSNorm) before output improves stability (inspired by NormFormer).
- Head patterns: the authors formalize several multi-head analogues for SSMs:
- Multi-head SSM (MHS) — standard independent heads (like MHA).
- Multi-contract SSM (MCS) — analogous to multi-query attention (shared B across heads).
- Multi-input SSM (MIS) — analogous to multi-value attention (Mamba uses this pattern). Ablations show MIS (multi-value) works best for language tasks.
- Grouped head variants (GVA/GIS) to trade off speed, parallelism, and accuracy.
Figure 5: Mamba-2 block design. Projections that produce SSM parameters are computed in parallel at the block start. An added normalization improves training stability and TP compatibility.
Systemic benefits:
- Tensor parallelism: by computing \(A,B,C,X\) from the raw input \(u\) (and sharding these projections), Mamba-2 only needs the standard one all-reduce at the end of a block, like Transformers. GroupNorm is chosen so normalization can be local to each tensor-parallel shard.
- Sequence (context) parallelism: SSD naturally supports chunking the sequence across devices. Each device computes its local outputs and passes a compact recurrent state to the next device (linear bandwidth), unlike attention which needs expensive all-to-all communication for cross-block interactions.
- Variable length batches: Mamba-2 can handle variable-length examples efficiently by manipulating \(A_t\) across sequence boundaries (set \(A_t=0\) where chunks should not interact), avoiding heavy padding tricks.
7 — Empirical highlights
The paper provides a thorough empirical study. I’ll summarize the most salient outcomes.
7.1 Synthetic associative recall (MQAR)
The Multi-Query Associative Recall task (MQAR) stresses a model’s ability to memorize and retrieve multiple key/value pairs from long contexts — a known challenge for RNN-style recurrences. Mamba-2, with SSD enabling much larger state expansions \(N\) (e.g., 64, 256), substantially outperforms the original Mamba and competes with attention in many regimes. Increasing state size reliably improves recall performance.
Figure 6: MQAR performance. The SSD layer plus Mamba-2 block allows larger states and stronger recall.
7.2 Language modeling and scaling laws
Mamba-2 was trained on The Pile across model sizes from ~125M to 2.7B. Under the Chinchilla training regime the paper reports that Mamba-2 matches or outperforms the original Mamba and a strong Transformer++ (carefully tuned transformer recipe) over a compute range. In many cases Mamba-2 is Pareto-dominant: for a given compute budget it gives lower perplexity.
Figure 7: Scaling behavior. For comparable compute budgets Mamba-2 reaches lower perplexities than baselines in these experiments.
Downstream zero-shot evaluations (LAMBADA, HellaSwag, PIQA, ARC, WinoGrande, OpenBookQA) show that Mamba-2 models at a given parameter count often match or exceed Pythia models at 2× the size.
7.3 Speed benchmarks
SSD is the headline systems contribution: the authors benchmark SSD vs Mamba fused scans and FlashAttention-2 on an A100 GPU. Results show:
- SSD is 2–8× faster than Mamba’s optimized scan for large state expansion \(N\).
- SSD becomes faster than FlashAttention-2 for sequence lengths around \(2\text{k}\) and larger.
- SSD’s runtime scales gently with \(N\) compared to scan-based implementations that slow linearly with \(N\).
Figure 8: Efficiency: SSD (purple) achieves substantial speedups over a fused scan implementation (red) and compares favorably with FlashAttention-2 (blue) as sequence length grows.
8 — Ablations and practical choices
The paper includes many ablations that confirm practical choices:
- Block design: moving ABCX projections to be parallel reduces parameters slightly and improves both TP compatibility and perplexity.
- Normalization: adding a final normalization (as in NormFormer) stabilizes large models and provides a small quality boost.
- Head patterns: multi-input SSM (MIS) / multi-value attention pattern works substantially better than multi-query or multi-key analogues in the SSM setting at fixed parameter/state budgets.
- Kernel feature maps (Swish, ReLU, random features, cosFormer style): the experiments found no clear winner over a simple Swish/SI-LU default for \( \psi(\cdot) \) in the Mamba-2 design; some kernel approximations can introduce numerical instability unless paired with careful normalization.
The authors also show hybrid models (mostly SSD with a small fraction of attention layers, ~10% attention) often perform better than either pure SSD or pure attention models because attention provides a fast retrieval mechanism and SSD provides compressed stateful memory.
9 — Why this matters and open directions
The SSD framework is valuable for several reasons:
- Conceptual unity. It shows SSMs and a large class of masked attention mechanisms are two sides of the same structured-matrix coin. That lets us port ideas (systems, parallelism, head patterns, kernel tricks) between the communities.
- Practical algorithms. SSD achieves strong hardware efficiencies by combining short chunked matmuls with shorter scans. This brings SSMs closer to Transformer-level hardware friendliness while preserving subquadratic scaling.
- Architectural flexibility. With the SMA perspective, one can instantiate many masks \(L\) (decaying, Toeplitz, semiseparable, Fourier, …) and study the efficiency/inductive bias tradeoffs in a principled way.
- Systems friendliness. SSD naturally supports tensor and sequence parallelism patterns used in Transformer training, and it supports variable sequence lengths and efficient autoregressive decoding.
Open questions and future work
- Can we further close the gap (or bridge) to full softmax attention? SSD doesn’t directly model the nonlinearity of softmax; can we find structured-matrix approximations that combine softmax-like normalization with efficient autoregression?
- Are there richer semiseparable or rank-structured factorizations that yield even better hardware tradeoffs?
- How do interpretability and mechanistic insights developed for attention transfer to SSD/SSM architectures?
- Can we extend SSD algorithms and block decompositions to other families of recurrences and structured matrices (beyond semiseparable)?
10 — Takeaway
- The paper rigorously connects selective SSMs and a generalization of linear attention (SMA) via semiseparable matrices.
- This Structured State Space Duality (SSD) yields both a conceptual dictionary and practical tools: new algorithms, better hardware utilization, and a strong SSM architecture, Mamba-2.
- SSD shows we don’t need to choose between the asymptotic efficiency of recurrences and the hardware efficiency of matmuls — we can get both.
If you care about long context efficiency, the practical training of SSMs, or building hybrid models that strike a balance between compressed state memory and fast retrieval, the SSD view (and Mamba-2) are ideas you’ll want to understand and experiment with. The paper includes runnable code, models, and a good set of ablations — a nice resource if you want to try SSD in your own stacks.
(If you want to dig deeper, the paper contains the formal proofs connecting semiseparability, the SSS representation, the tensor contraction derivations for linear attention, and the full PyTorch implementation of SSD used in experiments.)