Generating realistic, high-fidelity audio is one of the grand challenges in machine learning.
Think about what a raw audio waveform is: a sequence of tens of thousands of numbers—or samples—for every second of sound.
To produce even a few seconds of coherent music or speech, a model needs to understand intricate local patterns (like the texture of a piano note) while simultaneously maintaining global structure over hundreds of thousands of timesteps (like an evolving melody or a spoken sentence).
For years, this problem has been tackled by specialized versions of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).
Models like SampleRNN and the celebrated WaveNet have pushed the boundaries of what’s possible, but each comes with fundamental trade-offs:
- RNNs are slow to train because they process data sequentially—one sample at a time.
- CNNs are faster to train thanks to parallelism, but are limited by their receptive field, making it difficult to capture very long-range dependencies.
What if we could have the best of both worlds?
A model that trains in parallel like a CNN and generates efficiently like an RNN, while modeling truly long-range structure?
A recent paper from Stanford University, It’s Raw! Audio Generation with State-Space Models, introduces exactly such a model: SASHIMI.
It leverages deep State-Space Models (SSMs) to achieve faster training, efficient generation, and audio that humans rate as significantly more musical and coherent than its predecessors.
In this post, we’ll unpack how SASHIMI works:
- A refresher on autoregressive audio modeling and its predecessors.
- The Structured State-Space sequence model (S4) at SASHIMI’s core.
- The stability fix that keeps SASHIMI’s generations sane.
- The multi-scale architecture that lets it span from micro-details to macro-structure.
- Results from music and speech generation benchmarks.
Background: The Quest for the Perfect Audio Model
Autoregressive Modeling: Predicting the Future One Sample at a Time
At its core, an autoregressive (AR) model learns the probability distribution of a sequence by predicting each timestep based on all previous ones.
Formally, for an audio waveform \(x = (x_0, x_1, \dots, x_{T-1})\):
Training: the model sees a real audio sequence and predicts the next sample at every step.
Generation: starting from a short seed of audio (or silence), it samples from its predicted distribution, appends it to the input, and repeats—building an entire waveform one sample at a time.
This approach supports sequences of any length, but hinges on choosing the right architecture for \( p(x_t \mid x_{ A CNN applies a learned kernel (filter) across the input sequence: WaveNet famously used dilated convolutions, skipping inputs to exponentially expand its receptive field without inflating the parameter count. For 16 kHz audio, a WaveNet might only access the last few seconds—limiting its ability to produce melodies or sentences with truly global structure. An RNN processes sequences step-by-step, maintaining a hidden state \(h_t\) summarizing everything seen: This stateful design gives RNNs theoretically infinite memory of the past—and inference is fast (just one hidden-state update per sample). State-Space Models (SSMs) originate from control theory and are described by continuous-time linear differential equations: Here, \(x(t)\) is the input, \(h(t)\) the latent state, and \(y(t)\) the output. The convolutional kernel is effectively infinite, overcoming the receptive field limits of traditional CNNs. S4 (Structured State-Space Sequence model) is a powerful instantiation of SSMs: S4 can classify raw audio and generate sequences by switching between convolutional (training) and recurrent (generation) modes. SASHIMI builds on S4, adding two key innovations tailored for raw audio generation. The original S4 worked fine in convolutional mode but sometimes became numerically unstable in recurrent mode—generation would explode into garbage. Why? The original parameterization: didn’t guarantee this, and training often pushed \(A\) out of the Hurwitz space. The Fix: The term \(-pp^*\) is negative semidefinite, nudging eigenvalues leftward. The second innovation is multi-scale processing—capturing audio at different resolutions simultaneously. How it works: This design efficiently models long-term dependencies without losing per-sample precision. The authors tested SASHIMI against WaveNet and SampleRNN on unconditional music and speech generation. Music is a natural fit for AR models: it spans long time scales and can be generated indefinitely. Beethoven Piano Sonatas: It also benefited dramatically from training on longer contexts: YouTubeMix Piano Dataset: This shows that SASHIMI’s statistical gains translate into more coherent, pleasing music. Efficiency: SC09 Spoken Digits: SASHIMI AR models outperformed baselines in automated (FID ↓, IS ↑) and human (Quality ↑, Intelligibility ↑) metrics. DiffWave is a non-AR diffusion model with a WaveNet backbone. Sample Efficiency: The SASHIMI paper delivers a new architecture that wins on multiple fronts: By diagnosing and fixing S4’s stability issues and embedding it in a smart multi-scale architecture, the authors have created a robust, efficient tool for audio ML. SASHIMI not only pushes the limits of raw audio generation—it’s poised to reshape the landscape of audio synthesis systems entirely.CNNs (e.g., WaveNet): Parallel Training, Limited Context
Training is highly parallel—perfect for GPUs—but inference is awkward: CNNs can only “see” a fixed past window.RNNs (e.g., SampleRNN): Infinite Context, Slow Training
But training is painfully slow: the hidden state must be computed sequentially.State-Space Models: A New Hope
When discretized for sequences like audio, SSMs can be computed:The S4 Model
Building a Better Audio Model with SASHIMI
1. Stabilizing S4 for Generation
In recurrent updates \( h_k = \overline{A}h_{k-1} + \dots \), stability depends on all eigenvalues of \( \overline{A} \) being inside the unit disk. This requires \(A\) to be Hurwitz (all eigenvalues have negative real parts).
Tie p
and q
and flip the sign:
If \(\Lambda\) has entries with negative real parts, \(A\) is provably Hurwitz.
In practice, even unconstrained \(\Lambda\) stayed stable.2. SASHIMI’s Multi-Scale Architecture
Raw audio has structure at multiple scales:
fine texture (milliseconds), notes/phonemes (tens of ms), melodies/sentences (seconds).Experiments: Benchmarking SASHIMI
Datasets
Unbounded Music Generation
SASHIMI achieved lower NLL than both WaveNet and SampleRNN.
Listening tests (MOS) measured fidelity and musicality for 16-second clips.
A tiny 1.29 M-parameter SASHIMI beats a 4.24 M WaveNet on NLL, while training 3× faster.Unconditional Speech Generation
One-second clips require modeling words, speaker variability, accents, and noise.SASHIMI as a Drop-in Backbone
The authors swapped WaveNet for a same-size SASHIMI—no tuning—and saw state-of-the-art SC09 performance.
SASHIMI-DiffWave matched WaveNet-DiffWave’s best performance in half the training time and scaled better to small models.Conclusion: Why SASHIMI Matters
The next time you hear AI-generated music or speech, there’s a good chance a State-Space Model is playing under the hood.