Generating realistic, high-fidelity audio is one of the grand challenges in machine learning.
Think about what a raw audio waveform is: a sequence of tens of thousands of numbers—or samples—for every second of sound.

To produce even a few seconds of coherent music or speech, a model needs to understand intricate local patterns (like the texture of a piano note) while simultaneously maintaining global structure over hundreds of thousands of timesteps (like an evolving melody or a spoken sentence).

For years, this problem has been tackled by specialized versions of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).
Models like SampleRNN and the celebrated WaveNet have pushed the boundaries of what’s possible, but each comes with fundamental trade-offs:

  • RNNs are slow to train because they process data sequentially—one sample at a time.
  • CNNs are faster to train thanks to parallelism, but are limited by their receptive field, making it difficult to capture very long-range dependencies.

What if we could have the best of both worlds?
A model that trains in parallel like a CNN and generates efficiently like an RNN, while modeling truly long-range structure?

A recent paper from Stanford University, It’s Raw! Audio Generation with State-Space Models, introduces exactly such a model: SASHIMI.
It leverages deep State-Space Models (SSMs) to achieve faster training, efficient generation, and audio that humans rate as significantly more musical and coherent than its predecessors.

In this post, we’ll unpack how SASHIMI works:

  1. A refresher on autoregressive audio modeling and its predecessors.
  2. The Structured State-Space sequence model (S4) at SASHIMI’s core.
  3. The stability fix that keeps SASHIMI’s generations sane.
  4. The multi-scale architecture that lets it span from micro-details to macro-structure.
  5. Results from music and speech generation benchmarks.

Background: The Quest for the Perfect Audio Model

Autoregressive Modeling: Predicting the Future One Sample at a Time

At its core, an autoregressive (AR) model learns the probability distribution of a sequence by predicting each timestep based on all previous ones.
Formally, for an audio waveform \(x = (x_0, x_1, \dots, x_{T-1})\):

\[ p(x) = \prod_{t=0}^{T-1} p(x_t \mid x_0, \dots, x_{t-1}) \]

Training: the model sees a real audio sequence and predicts the next sample at every step.
Generation: starting from a short seed of audio (or silence), it samples from its predicted distribution, appends it to the input, and repeats—building an entire waveform one sample at a time.

This approach supports sequences of any length, but hinges on choosing the right architecture for \( p(x_t \mid x_{


CNNs (e.g., WaveNet): Parallel Training, Limited Context

A CNN applies a learned kernel (filter) across the input sequence:

\[ y = K * x \]

WaveNet famously used dilated convolutions, skipping inputs to exponentially expand its receptive field without inflating the parameter count.
Training is highly parallel—perfect for GPUs—but inference is awkward: CNNs can only “see” a fixed past window.

For 16 kHz audio, a WaveNet might only access the last few seconds—limiting its ability to produce melodies or sentences with truly global structure.


RNNs (e.g., SampleRNN): Infinite Context, Slow Training

An RNN processes sequences step-by-step, maintaining a hidden state \(h_t\) summarizing everything seen:

\[ h_t = f(h_{t-1}, x_t) \quad y_t = g(h_t) \]

This stateful design gives RNNs theoretically infinite memory of the past—and inference is fast (just one hidden-state update per sample).
But training is painfully slow: the hidden state must be computed sequentially.


State-Space Models: A New Hope

State-Space Models (SSMs) originate from control theory and are described by continuous-time linear differential equations:

\[ \begin{aligned} h'(t) &= Ah(t) + Bx(t) \\ y(t) &= Ch(t) + Dx(t) \end{aligned} \]

Here, \(x(t)\) is the input, \(h(t)\) the latent state, and \(y(t)\) the output.
When discretized for sequences like audio, SSMs can be computed:

  1. As an RNN: a simple linear recurrence—fast generation.
  2. As a CNN: a single, extremely long convolution—fully parallelizable training.

The convolutional kernel is effectively infinite, overcoming the receptive field limits of traditional CNNs.


The S4 Model

S4 (Structured State-Space Sequence model) is a powerful instantiation of SSMs:

  • Parameterizes \(A\) as diagonal plus low-rank (DPLR) for speed.
  • Initializes using HiPPO theory to model long-range dependencies.

S4 can classify raw audio and generate sequences by switching between convolutional (training) and recurrent (generation) modes.


Building a Better Audio Model with SASHIMI

SASHIMI builds on S4, adding two key innovations tailored for raw audio generation.

1. Stabilizing S4 for Generation

The original S4 worked fine in convolutional mode but sometimes became numerically unstable in recurrent mode—generation would explode into garbage.

Why?
In recurrent updates \( h_k = \overline{A}h_{k-1} + \dots \), stability depends on all eigenvalues of \( \overline{A} \) being inside the unit disk. This requires \(A\) to be Hurwitz (all eigenvalues have negative real parts).

The original parameterization:

\[ A = \Lambda + pq^* \]

didn’t guarantee this, and training often pushed \(A\) out of the Hurwitz space.

The Fix:
Tie p and q and flip the sign:

\[ A = \Lambda - pp^* \]

The term \(-pp^*\) is negative semidefinite, nudging eigenvalues leftward.
If \(\Lambda\) has entries with negative real parts, \(A\) is provably Hurwitz.
In practice, even unconstrained \(\Lambda\) stayed stable.

Comparison of spectral radii for the S4 state matrix with the standard and Hurwitz parameterizations. The dotted line indicates the instability threshold (magnitude 1). The Hurwitz form keeps eigenvalues below 1. A table confirms stable generation without loss in NLL.


2. SASHIMI’s Multi-Scale Architecture

The second innovation is multi-scale processing—capturing audio at different resolutions simultaneously.
Raw audio has structure at multiple scales:
fine texture (milliseconds), notes/phonemes (tens of ms), melodies/sentences (seconds).

Diagram of the SASHIMI architecture. Input waveform flows through S4 blocks, down-pools to lower resolution with expanded channels, processes through more S4 blocks, then up-pools and merges with high-resolution features—U-Net style.

How it works:

  1. S4 Blocks: Residual blocks with stabilized S4 layers, LayerNorm, GELU, and linear transforms.
  2. Down-Pooling: Reshape & project to shorter, wider sequences (e.g., \(L \to L/4\), \(H \to 2H\)), condensing local context.
  3. Hierarchical Processing: Stacks of S4 blocks at multiple tiers, each modeling longer-range dependencies at coarser resolutions.
  4. Up-Pooling & Skip Connections: Expand back to finer resolution and merge with higher tier outputs—combining global context with local detail.

This design efficiently models long-term dependencies without losing per-sample precision.


Experiments: Benchmarking SASHIMI

The authors tested SASHIMI against WaveNet and SampleRNN on unconditional music and speech generation.


Datasets

Summary table of music and speech datasets used for AR generation: Beethoven piano, YouTubeMix piano, SC09 spoken digits. Includes duration, chunk length, sampling rate, quantization, splits.


Unbounded Music Generation

Music is a natural fit for AR models: it spans long time scales and can be generated indefinitely.

Beethoven Piano Sonatas:
SASHIMI achieved lower NLL than both WaveNet and SampleRNN.

Bar chart/table: Beethoven dataset results. SASHIMI achieves lowest NLL (0.946) and trains faster.

It also benefited dramatically from training on longer contexts:

Line plot: SASHIMI NLL drops from 1.364 (1s context) to 1.007 (8s context), showing gains from longer training contexts.


YouTubeMix Piano Dataset:
Listening tests (MOS) measured fidelity and musicality for 16-second clips.

MOS results: All models similar in fidelity (~2.9), but SASHIMI scores much higher in musicality (3.11) vs. WaveNet (2.71) and SampleRNN (1.82).

This shows that SASHIMI’s statistical gains translate into more coherent, pleasing music.


Efficiency:
A tiny 1.29 M-parameter SASHIMI beats a 4.24 M WaveNet on NLL, while training 3× faster.

Architectural ablations/efficiency table: Small SASHIMI outperforms bigger baselines; multi-scale pooling boosts speed/quality over isotropic S4.


Unconditional Speech Generation

SC09 Spoken Digits:
One-second clips require modeling words, speaker variability, accents, and noise.

SASHIMI AR models outperformed baselines in automated (FID ↓, IS ↑) and human (Quality ↑, Intelligibility ↑) metrics.

SC09 results table: SASHIMI achieves much better automated and human scores than SampleRNN/WaveNet.


SASHIMI as a Drop-in Backbone

DiffWave is a non-AR diffusion model with a WaveNet backbone.
The authors swapped WaveNet for a same-size SASHIMI—no tuning—and saw state-of-the-art SC09 performance.

DiffWave with SASHIMI backbone: beats WaveNet-based DiffWave on all metrics (FID, IS, MOS), sets new SC09 state-of-the-art.


Sample Efficiency:
SASHIMI-DiffWave matched WaveNet-DiffWave’s best performance in half the training time and scaled better to small models.

Training curves: SASHIMI (green) reaches low NLL faster than WaveNet (orange) and SampleRNN (blue).


Conclusion: Why SASHIMI Matters

The SASHIMI paper delivers a new architecture that wins on multiple fronts:

  1. Performance: State-of-the-art music & speech generation, judged better by both humans and metrics.
  2. Efficiency: Parallel training like CNNs, fast generation like RNNs—often with fewer parameters.
  3. Long-Range Modeling: Handles >100,000-step contexts for superior global coherence.
  4. Versatility: Drop-in replacement for WaveNet, improving models like DiffWave without extra tuning.

By diagnosing and fixing S4’s stability issues and embedding it in a smart multi-scale architecture, the authors have created a robust, efficient tool for audio ML.

SASHIMI not only pushes the limits of raw audio generation—it’s poised to reshape the landscape of audio synthesis systems entirely.
The next time you hear AI-generated music or speech, there’s a good chance a State-Space Model is playing under the hood.