For the past several years, the Transformer architecture has been the undisputed champion of language modeling. From GPT-3 to PaLM, massive Transformer models have redefined the state of the art. But this power comes at a cost: the attention mechanism—at the heart of the Transformer—scales quadratically with sequence length. Processing a sequence twice as long takes four times the computation and memory. This makes working with very long documents, codebases, or audio files a significant challenge.
What if there were another way? An architecture that scales nearly linearly with sequence length—making it incredibly efficient for long sequences—while still matching the modeling power of attention?
Enter State Space Models (SSMs). SSMs have excelled in domains like audio generation and time-series analysis, but they’ve consistently lagged behind Transformers in the complex domain of language. A new paper from Stanford, Hungry Hungry Hippos: Towards Language Modeling with State Space Models, takes a hard look at this performance gap, diagnoses the underlying reasons, and proposes a novel architecture that not only closes the gap but, in some cases, surpasses the Transformer.
This paper makes two major contributions:
- H3 (Hungry Hungry Hippo): A new SSM-based layer designed to address the specific weaknesses of previous SSMs on language tasks.
- FLASHCONV: A hardware-aware algorithm that makes training and running SSMs dramatically faster, overcoming the practical efficiency hurdles that have often held them back.
Let’s dive in and see how these hungry hippos are challenging the Transformer’s reign.
Background: What Are State Space Models?
Before we get to the hippos, we need to understand the fundamentals of State Space Models. Originating in control theory, SSMs are powerful tools for modeling systems that evolve over time.
At their core, discrete-time SSMs map an input sequence \(u_1, u_2, \ldots\) to an output sequence \(y_1, y_2, \ldots\) through a hidden “state” vector \(x_i\). The state acts as a compressed summary of the sequence’s history up to that point. The process is governed by two simple linear equations:
A discrete-time SSM: the hidden state \(x_i\) is updated from \(x_{i-1}\) using the current input \(u_i\), and the output \(y_i\) is computed from the state. The matrices A, B, C, and D are learned.
This formulation is inherently recurrent. To compute the output at step \(i\), you only need the current input \(u_i\) and the state from the previous step \(x_{i-1}\). This makes SSMs efficient for inference (like generating text one word at a time), since they don’t need to re-read the entire history at each step.
The Convolution Connection
While the recurrent view is excellent for generation, it’s slow for training, where we process whole sequences at once. Luckily, there’s a neat mathematical trick: any linear time-invariant system like an SSM can also be represented as a convolution.
Unrolling the recurrence, the output \(y\) can be written as the convolution of the input sequence \(u\) with a special filter \(f\) determined by the learned matrices:
The convolution filter \(f = [CB, CAB, CA^2B, \dots]\) encodes how each past input influences the current output.
This means we can compute the entire output sequence in parallel by performing \(y = f * u\). And how do we compute convolutions efficiently? With the Fast Fourier Transform (FFT). Using FFTs, the computation scales at \(O(N \log N)\) with sequence length \(N\)—a massive asymptotic improvement over the Transformer’s \(O(N^2)\).
So, if SSMs are so efficient, why haven’t they taken over language modeling?
The Problem: Why SSMs Struggle with Language
To find out why SSMs underperform attention, the researchers turned to synthetic language tasks—simple toy problems designed to test specific capabilities. If a model fails on these, it likely lacks fundamental building blocks needed for real language complexity.
They focused on two tasks:
Synthetic language tasks target specific memory and comparison skills.
- Induction Head: The model sees a sequence of random tokens, followed by a special token (e.g.,
vdash
), and then more random tokens. The goal is to predict the token that came immediately after the special token earlier in the sequence. - Associative Recall: The model sees a series of key–value pairs (e.g.,
a 2 c 4 b 3
). At the end, it’s given a key (saya
) and must output the corresponding value (2
).
These are trivial for Transformers, thanks to attention’s ability to directly “look back” and compare tokens. How did existing SSMs fare?
Existing SSMs fail on tasks requiring precise recall and cross-sequence comparison.
Popular SSM variants like S4D and Gated State Spaces (GSS) failed badly. This points to two missing capabilities:
- Recapping specific tokens from the past.
- Comparing tokens across the sequence.
Attention excels at both: it builds a full \(N \times N\) matrix of comparisons (\(QK^T\)) and directly copies via the value vectors (\(V\)). The goal was to give SSMs similar abilities.
The Solution, Part 1: Hungry Hungry Hippos (H3)
The H3 layer is a new SSM architecture designed explicitly to solve these shortcomings. The core idea: combine two different SSMs with multiplicative interactions to mimic attention’s compare–store–recall mechanics.
In H3, a shift SSM and a diagonal SSM work together with multiplicative gates to store and retrieve information over sequences.
High-level intuition:
- Shift SSM: Applied to the K projection, with A as a shift matrix. This creates a short-term memory of recent inputs—perfect for remembering what came just before the current token.
- Diagonal SSM: Based on S4D, this remembers information over the entire sequence.
- Multiplicative gating (\(\odot\)) provides selective storage and retrieval:
- First gate:
SSM_shift(K) ⊙ V
stores a value only if the previous token matches a given key. - Second gate:
Q ⊙ SSM_diag(...)
recalls the value only if the current token matches the query key.
- First gate:
The core H3 operation: multiplicative interactions make the memory conditional on matching keys.
As shown in Figure 1 (middle), this mechanism exactly solves the associative recall task: the shift SSM + first gate act as store, and the diagonal SSM + second gate act as recall.
Real-language Benchmarks
Does this improved expressivity matter for actual language modeling? Yes—dramatically.
On OpenWebText, H3 nearly matches Transformer perplexity, and a hybrid model beats it.
The pure H3 model closes the gap with Transformers to just 0.4 perplexity points, a huge improvement over prior SSMs. A simple hybrid H3–attention model, with only 2 attention layers and the rest H3, outperforms a pure Transformer by a full point.
The Solution, Part 2: FLASHCONV for Hardware Efficiency
H3 has the modeling power—but what about speed? Asymptotically, SSMs are faster (\(O(N \log N)\) vs. \(O(N^2)\)), but on real GPUs they’re often slower. Why? The hardware lottery: modern GPUs are optimized for attention’s giant matrix multiplies, not FFTs.
The answer: FLASHCONV, an I/O-aware convolution algorithm inspired by FlashAttention, with two pillars:
- Fused Block FFTConv — for sequences up to 8K tokens.
- Kernel Fusion: Combine FFT, multiply, and inverse FFT in a single kernel to minimize costly trips between fast SRAM and slower HBM.
- Block FFT: Rewrite FFT as a series of small matrix multiplies to exploit Tensor Cores.
- State Passing — for sequences beyond 8K tokens.
- Break the input into SRAM-sized chunks.
- Pass the final state of each chunk forward as the initial state for the next.
- Repeat until the full sequence is processed.
State passing lets FLASHCONV handle arbitrarily long sequences efficiently.
FLASHCONV Performance
The speedups are huge. On the Long Range Arena (LRA) benchmark, S4 with FLASHCONV is 5.8× faster than a Transformer.
When benchmarking H3 against optimized FlashAttention, FLASHCONV maintains near-linear scaling and overtakes attention for long sequences.
Kernel fusion speeds short sequences, block FFT helps with medium ones, and state passing dominates for long sequences.
For sequences ≥16K, SSMs with state passing are dozens of times faster than even the fastest attention implementations.
Putting It All Together: Scaling Up H3
With an expressive new layer (H3) and a fast algorithm (FLASHCONV), the team scaled hybrid H3–attention models to sizes from 125M to 2.7B parameters, trained on the massive Pile dataset.
Perplexity Wins
On The Pile, OpenWebText, and WikiText-103, H3 hybrids beat same-sized Transformers.
Zero-/Few-Shot SuperGLUE
H3 hybrids match or exceed Transformer performance across most SuperGLUE tasks.
Faster Inference
Thanks to recurrence, H3 hybrids generate text 2.4× faster than Transformers—critical for real-time systems.
Conclusion: A New Contender for the Throne
Hungry Hungry Hippos makes a persuasive case that pure attention-based models may no longer be the sole rulers of sequence modeling. By diagnosing SSM weaknesses and designing H3 to fix them, the researchers created a powerful new block for language models.
With FLASHCONV, they removed efficiency barriers—making SSMs not just theoretically faster but practically faster.
A simple hybrid—mostly SSMs with just a pinch of attention—outperforms a pure Transformer. The future may well be about combining complementary strengths, not sticking to one architecture.
With better scaling, faster inference, and state-of-the-art performance, these hungry hippos have staked a serious claim to the sequence modeling crown.