S4, But Simpler: How Diagonal State Space Models (S4D) Match Performance with Less Complexity

Introduction: The Quest for Efficient Sequence Models

Modeling long sequences of data—whether audio waveforms, medical signals, text, or flattened images—is a fundamental challenge in machine learning. For years, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) were the standard tools. More recently, Transformers have risen to prominence with remarkable results. But all of these models face trade-offs, particularly when sequences get very long.

Enter State Space Models (SSMs). A recent architecture called S4 (Structured State Space for Sequences) emerged as a powerful contender, outperforming previous approaches for tasks requiring long-range memory. Built on a solid mathematical foundation from classical control theory, S4 efficiently models continuous signals with a special state matrix called the HiPPO matrix—a mathematical design aimed at remembering information over long periods.

There’s a catch: the HiPPO matrix is complex. To make it useful in deep learning, S4 uses a diagonal plus low-rank (DPLR) structure. This representation is powerful but makes S4 harder to understand, implement, and customize—feeling at times like a locked black box.

This raises an exciting question: can we simplify it? What if we stripped away the complexity and used a purely diagonal state matrix? This would make both the math and code drastically simpler. Early attempts at such a simplification led to significant performance drops. However, a recent model called DSS showed that a specific diagonal matrix—derived directly from S4’s own HiPPO structure—can work surprisingly well.

This is where the paper On the Parameterization and Initialization of Diagonal State Space Models comes in. The researchers systematically explore how to build, parameterize, and initialize these simpler diagonal SSMs, introducing S4D (“S4 on Diagonals”)—a model that marries the simplicity of a diagonal state matrix with the principled design of S4.

The result is a model that is:

Simple: Its convolutional kernel computation can be expressed in just two lines of code.
Principled: The authors provide the first theoretical explanation for why this diagonal approach works.
Powerful: S4D matches the original S4 in performance across diverse tasks—image, audio, medical time-series—and achieves 85% average accuracy on the challenging Long Range Arena benchmark.

Let’s dive in and unpack how they pulled this off.

S4D is a simple yet powerful diagonal State Space Model. The left panel shows its recurrent view as a collection of independent 1D systems. The right panel shows its interpretable convolution kernel, which can be implemented with just two lines of code.

Figure 1: S4D inherits S4’s strengths while being simpler. (Left) Diagonal structure allows viewing it as independent 1-D SSMs. (Right) The convolution kernel can be implemented in ~2 lines; colors denote independent 1D SSMs, purple denotes trainable parameters.

Background: A Crash Course in State Space Models

At its core, a State Space Model describes a system using a hidden “state” vector \(x(t)\) that evolves over time via a linear ODE:

\[ \begin{array}{l} x'(t) = \mathbf{A} x(t) + \mathbf{B} u(t) \\ y(t) = \mathbf{C} x(t) \end{array} \]

Here:

\(u(t)\) is the input signal.
\(y(t)\) is the output signal.
\(x(t)\) is the hidden state of size \(N\).
\(\mathbf{A} \in \mathbb{C}^{N \times N}\) is the state matrix — the key component dictating internal dynamics.

This continuous-time system can also be viewed as a convolution:

\[ \mathbf{K}(t) = \mathbf{C} e^{t\mathbf{A}} \mathbf{B}, \quad y(t) = (\mathbf{K} * u)(t) \]

S4’s magic comes from a carefully chosen HiPPO \(\mathbf{A}\) matrix that yields a kernel \(\mathbf{K}(t)\) with excellent memory properties over long sequences. But this matrix’s structure forces the use of a complex DPLR representation.

DSS made a surprising discovery: start from the HiPPO matrix, compute its DPLR form, then throw away the “low-rank” part. The remaining diagonal matrix performs nearly as well as full S4. This was an exciting empirical result, raising two questions:

Why does this specific diagonal matrix succeed when random ones fail?
Can we compute it more simply than DSS’s custom “complex softmax” method?

The paper answers both and delivers S4D.

The Core Method: Building S4D Step-by-Step

The authors break the design of a diagonal SSM into three key choices: discretization, kernel computation, and parameterization.

1. Discretization: From Continuous to Discrete

SSMs start in continuous time, but our data is discrete. We must convert the continuous parameters (\(\mathbf{A}, \mathbf{B}\)) into discrete (\(\overline{\mathbf{A}}, \overline{\mathbf{B}}\))—a process called discretization.

Two standard methods are:

Zero-Order Hold (ZOH)
Bilinear Transform

Formulas for the Bilinear and Zero-Order Hold (ZOH) discretization methods. These convert the continuous SSM parameters to discrete ones.

Figure 2: Bilinear and ZOH discretization convert continuous-time SSM parameters to discrete-time equivalents.

Experiments show the choice has little effect on performance for S4D—either works fine. This flexibility further simplifies the model design.

2. The Convolution Kernel: Simplicity Unleashed

In discrete form, the SSM’s kernel \(\overline{\mathbf{K}}\) is:

\[ \overline{\boldsymbol{K}}_{\ell} = \sum_{n=0}^{N-1} \mathbf{C}_n \,\overline{\mathbf{A}}_n^{\ell} \,\overline{\mathbf{B}}_n \]

For dense \(\mathbf{A}\), computing \(\overline{\mathbf{A}}^\ell\) is expensive—hence S4’s complex algorithm. But for diagonal \(\mathbf{A}\), it’s trivial: just raise each diagonal element to the \(\ell\)-th power.

This yields a neat formulation using a Vandermonde matrix: The convolution kernel for a diagonal SSM can be computed efficiently using a Vandermonde matrix.

Figure 3: Vandermonde matrix formulation enables fast kernel computation for diagonal SSMs—matching S4’s complexity without its algorithmic overhead.

A naive implementation is \(O(NL)\) time, but with structure-aware optimizations it can match S4’s near-linear complexity \(\tilde{O}(N+L)\).

3. Parameterization: The Devil is in the Details

With computation simplified, the key design questions are:

Parameterizing \(\mathbf{A}\): Stability demands all eigenvalues have negative real parts. S4D enforces this by setting \(\Re(\mathbf{A}) = -\exp(\mathbf{A}_{\mathrm{Re}})\), ensuring negativity.
\(\mathbf{B}\) and \(\mathbf{C}\): The kernel depends on their elementwise product. DSS learns this product directly, but S4D keeps them separate, initializes \(B=1\) and trains \(C\). Training \(B\) yields small but consistent gains.
Conjugate Symmetry: To ensure real outputs from real inputs, complex eigenvalues and parameters are stored in conjugate pairs—simplifying implementation and halving storage.

S4D: The Best of Both Worlds

By combining DSS’s diagonal \(\mathbf{A}\) with S4’s stable parameterization and efficient Vandermonde computation, S4D becomes a clean, controlled diagonal SSM. Direct comparisons with DPLR S4 are now possible.

This table summarizes the design choices for S4, DSS, and S4D. S4D consolidates the best aspects of the other two models.

Figure 4: S4D merges DSS’s diagonal structure with S4’s robust parameterization and efficient kernel computation.

The Secret Sauce: Why Initialization is Everything

Diagonal SSMs are expressive in theory but perform poorly with random initialization—the issue is optimization. The initial structure of eigenvalues matters immensely.

S4D-LegS: Magic of the HiPPO Matrix

DSS’s diagonal came from S4’s HiPPO-LegS matrix: take \(\mathbf{A} = \mathbf{A}^{(D)} - \mathbf{P}\mathbf{P}^\top\) and keep only the diagonal part \(\mathbf{A}^{(D)}\).

Theorem 3 in the paper proves: as state size \(N \to \infty\), the dynamics of the diagonal \(\mathbf{A}^{(D)}\) converge to those of the full HiPPO \(\mathbf{A}\).

Visualization of Theorem 3. The basis functions of the diagonal approximation (S4D-LegS, panels b and c) converge to the basis functions of the original S4 model (S4-LegS, panel a) as the state size N increases.

Figure 5: As \(N\) increases, S4D-LegS basis functions converge to S4’s—explaining DSS’s success.

This doesn’t hold for arbitrary low-rank perturbations—HiPPO’s structure is special.

S4D-Inv & S4D-Lin: Even Simpler Recipes

Analyzing \(\mathbf{A}^{(D)}\) reveals its imaginary parts follow an inverse scaling law:

\[ \text{S4D-Inv: } \quad A_n = -\frac12 + i\frac{N}{\pi}\left(\frac{N}{2n+1} - 1\right) \]

A simpler S4D-Lin uses equally spaced imaginary parts like Fourier frequencies:

\[ \text{S4D-Lin: } \quad A_n = -\frac12 + i\pi n \]

The imaginary parts of the eigenvalues for different S4D initializations. S4D-LegS (blue) is the original, S4D-Inv (orange) is a close approximation with a simple formula, and S4D-Lin (green) is another effective pattern based on Fourier frequencies. Other simple scaling laws (red, purple) perform worse.

Figure 6: Structured spread of imaginary parts is key; simple variants (red, purple) underperform.

Two principles emerge:

Constant negative real part to control decay.
Structured, spread-out imaginary parts to cover a range of frequencies.

Experiments: Putting S4D to the Test

Initialization Formula Matters

Even small deviations—scaling or randomizing imaginary parts—hurt performance. Ablation results for the S4D initialization. Even small changes, like scaling the imaginary parts or randomizing them, cause a significant drop in performance.

Figure 7: Deviations from derived eigenvalue formulas degrade performance across tasks.

S4D vs S4: Battle of Equals

Benchmarks include:

Sequential CIFAR (image classification, seq. length 1024)
Speech Commands (audio classification, 16000 samples)
BIDMC (medical time-series regression, length 4000)

Results on ablation datasets with larger models. The diagonal S4D variants are highly competitive with the full DPLR S4 models across all tasks.

Figure 8: S4D is highly competitive with full S4; in some cases (Speech Commands), S4D-Inv/Lin outperform.

The Ultimate Test: Long Range Arena (LRA)

LRA is a suite designed for explicit long-range dependency challenges.

Results on the Long Range Arena (LRA) benchmark. S4D variants are highly competitive with S4, and S4D-Inv achieves an average score of 85.50%, far above Transformer baselines.

Figure 9: S4D-Inv averages 85.50%—close to S4’s best (86.09%) and vastly better than the Transformer baseline (53.66%).

Conclusion: Simplicity Unlocks Power

This work gives a clear guide to building powerful, efficient sequence models with diagonal state space matrices. The key takeaways:

Simplicity is viable: Replace S4’s DPLR with a diagonal \(\mathbf{A}\) without losing performance. Core computation is trivial to implement.
Initialization is key: It’s not about capacity; the magic is in structured, principled starting points (S4D-LegS, S4D-Inv, S4D-Lin).
A new default sequence model: With simplicity, theory, and performance, S4D is a compelling alternative to RNNs, CNNs, and Transformers for many domains.

By making these models accessible, S4D opens the door for broader adoption and fresh research directions. This is not just a refinement—it’s a leap towards practical, powerful sequence modeling for everyone.

Introduction: The Quest for Efficient Sequence Models#

Background: A Crash Course in State Space Models#

The Core Method: Building S4D Step-by-Step#

1. Discretization: From Continuous to Discrete#

2. The Convolution Kernel: Simplicity Unleashed#

3. Parameterization: The Devil is in the Details#

S4D: The Best of Both Worlds#

The Secret Sauce: Why Initialization is Everything#

S4D-LegS: Magic of the HiPPO Matrix#

S4D-Inv & S4D-Lin: Even Simpler Recipes#

Experiments: Putting S4D to the Test#

Initialization Formula Matters#

S4D vs S4: Battle of Equals#

The Ultimate Test: Long Range Arena (LRA)#

Conclusion: Simplicity Unlocks Power#