Longhorn: Reimagining State Space Models as Online Learners

For years, the Transformer has been the undisputed champion of sequence modeling, powering everything from large language models like GPT to breakthroughs in scientific and multimodal AI. Yet even kings have weaknesses—Transformers struggle with efficiency. Their computational cost grows quadratically with sequence length, meaning that processing a book is vastly more expensive than processing a sentence. This limitation has become a severe bottleneck as researchers push toward models that can understand entire codebases, long conversations, or even persistent streams of sensory data.

Enter State Space Models (SSMs). These architectures promise the performance of Transformers but with a crucial advantage: linear scaling with sequence length. In other words, doubling the sequence length merely doubles the computational cost—rather than quadrupling it. Recent advances such as Mamba have shown that SSMs can match or surpass Transformers in language modeling, vision, and beyond.

However, designing SSMs has remained something of an art form. Developers tweak parameters and recurrences based on heuristics, without a unifying theory to explain why certain designs work. The process often feels intuitive rather than principled.

A new paper from researchers at The University of Texas at Austin and Helixon—“LONGHORN: STATE SPACE MODELS ARE AMORTIZED ONLINE LEARNERS”—introduces a solid theoretical foundation for SSM design. The paper proposes a striking new perspective: viewing SSMs as online learning systems that continuously update their internal state while processing data, one token at a time.

This idea leads to a new, elegant architecture called Longhorn, derived directly from the mathematics of online associative recall. The results are impressive—Longhorn not only surpasses state-of-the-art models like Mamba, but does so with greater sample efficiency and outstanding length generalization.

Key results showing Longhorn’s superior sample efficiency and length extrapolation capabilities.

Figure 1: (Left) Longhorn demonstrates a 1.8× improvement in sample efficiency over Mamba on downstream tasks. (Right) When trained on a context length of 2048, Longhorn generalizes to sequences 16× longer (32,768 tokens) during inference, showing remarkable length extrapolation.

A refresher on State Space Models
How SSMs can be framed as online learners
The Longhorn architecture and its closed-form update
Experimental results and comparisons to leading models
Future directions and the broader impact

Background: A Quick Tour of State Space Models

Before diving into Longhorn, let’s briefly recap what makes modern sequence models work.

Most large models—including Transformers and SSM-based networks—are built from stacked blocks, each performing two critical operations:

Sequence Mixing: Information flows across the positions of a sequence. In Transformers, this step is achieved via self-attention, which lets each token interact with every other token.
Channel Mixing: Information is processed within each token’s representation. Typically, a Multi-Layer Perceptron (MLP) performs this step.

SSMs are designed as an efficient alternative to the self-attention mechanism, focusing on the sequence mixing component.

A high-level view of a Mamba block, showing the separate paths for sequence and channel mixing.

Figure 2: A typical block in an SSM-based model like Mamba. The SSM (red path) handles sequence mixing, while an MLP-like component (blue path) performs channel mixing.

At the heart of SSMs lies a simple but powerful recurrence:

\[ S_t = A(x_t) \cdot S_{t-1} + B(x_t) \]

Here, \(x_t\) is the input token at time \(t\), and \(S_t\) summarizes all previous information. The matrices \(A_t\) and \(B_t\) dictate how the state evolves—how much to retain, and what new information to add.

Although this recurrence looks sequential, during training SSMs exploit parallelism through a prefix-scan algorithm, allowing efficient computation of all states at once. This property enables them to train like Transformers (full parallelism) but decode like RNNs (linear efficiency).

Different SSMs vary mainly in how they design \(A_t\), \(B_t\), and their update mechanism. Many designs are heuristic, balancing performance with computational feasibility. The Longhorn paper seeks to replace this ad hoc process with a general principle.

The Core Idea: State Space Models as Online Learners

The authors argue that every SSM’s recurrence can be interpreted as the solution to an online learning problem.

In online learning, an agent makes predictions sequentially. At each step, it observes new data, incurs a loss, and updates its internal state to better predict future data. Crucially, online learners must balance stability (not forgetting past knowledge) and plasticity (adapting to new information).

This trade-off can be formalized through Online Convex Programming (OCP):

\[ s_t = \underset{s}{\arg\min}\; L_t(s), \quad L_t(s) = D_{\phi}(s, s_{t-1}) + \beta_t \ell_t(s) \]

The two components serve complementary roles:

Stability \(D_{\phi}(s, s_{t-1})\): Keeps the new state close to the old one, avoiding catastrophic forgetting.
Plasticity \(\beta_t \ell_t(s)\): Encourages learning from new data, controlled by a learning-rate-like term \(\beta_t\).

Longhorn treats the SSM update as an implicit online learning step—the state \(S_t\) optimizes such an objective with respect to sequence information.

A conceptual diagram showing how sequence mixing can be framed as an online learning problem.

Figure 3: Diagram of the Longhorn framework. (Left) Information mixing in sequence models. (Middle) Framing this update as an online learning process. (Right) Longhorn’s recurrence is derived from an online associative recall objective.

By interpreting the SSM’s update as solving a specific online optimization, the design of the model becomes both interpretable and mathematically grounded. Instead of guessing how to mix information, we design a meaningful learning objective—and the update rule arises naturally from solving it.

The Longhorn Architecture: Learning to Recall

Guided by this principle, the researchers chose a simple yet powerful objective: online associative recall.

This relates directly to the Transformer’s “induction head” capability—the pattern responsible for in-context learning. The model encounters (key, value) pairs and learns to predict the correct value when given a key. Longhorn explicitly embeds this behavior in its recurrence.

At each step, it observes a key \(k_t\) and value \(x_t\), and updates its state \(S_t \in \mathbb{R}^{d \times m}\) according to:

\[ S_t = \underset{S \in \mathbb{R}^{d \times m}}{\arg\min} \left\{ \|S - S_{t-1}\|_F^2 + \|S k_t - x_t\|_{\mathrm{diag}(\beta_t)}^2 \right\} \]

Here, \(\| \cdot \|_F\) is the Frobenius norm, and \(\beta_t\) controls how strongly new information influences the update.

This objective has a closed-form solution:

\[ S_{t,i} = (I - \varepsilon_{t,i} k_t k_t^\top) S_{t-1,i} + \varepsilon_{t,i} k_t x_{t,i}, \quad \varepsilon_{t,i} = \frac{\beta_{t,i}}{1 + \beta_{t,i} k_t^\top k_t} \]

For computational efficiency, the authors replace \(k_t k_t^\top\) with its diagonal approximation \(k_t^{\odot 2}\), aligning the update with standard SSM parallelization. The final form fits the common SSM template:

\[ S_t = A_t \odot S_{t-1} + B_t, \quad A_t = (1_{d \times m} - \varepsilon_t \otimes k_t^{\odot 2}), \quad B_t = (\varepsilon_t \odot x_t) \otimes k_t \]

A particularly elegant consequence: the forget gate in Longhorn emerges naturally from the math—it is not manually parameterized. Forgetting becomes dynamically linked to the current key, seamlessly balancing remembering and adapting.

Experiments and Results

Multi-Query Associative Recall (MQAR)

To validate its theoretical foundation, Longhorn was first tested on the MQAR benchmark, which measures the ability to retrieve stored (key, value) pairs.

Accuracy on the Multi-Query Associative Recall benchmark.

Figure 4: Longhorn (cyan) achieves near-perfect accuracy on MQAR, outperforming Mamba and other SSMs at longer sequences and smaller dimensions.

Longhorn delivers nearly perfect recall even for sequence lengths of 512 and small hidden dimensions, confirming that its update rule embodies effective associative memory.

Language Modeling Scaling Laws

Next, the researchers evaluated Longhorn on the OpenWebText dataset, training models between 120M and 350M parameters using context lengths of 1024 and 4096.

Validation loss versus model size on the OpenWebText dataset.

Figure 5: Longhorn consistently achieves lower validation loss than other SSMs, matching the strong LLaMA Transformer baseline.

Table of validation losses for different models and context lengths.

Table 1: Detailed results on OpenWebText show Longhorn achieving the best validation loss among all models at the 350M scale.

Across all configurations, Longhorn delivers superior performance to Mamba, RWKV, and GLA—sometimes even outperforming the Transformer-based LLaMA.

Large-Scale Training on SlimPajama

Scaling up further, the team trained a 1.3B-parameter Longhorn on 100 billion tokens from the SlimPajama dataset, comparing results on eight downstream tasks.

Downstream benchmark results for 1.3B models.

Table 2: Longhorn achieves the highest average score across eight downstream tasks, surpassing Mamba with fewer parameters.

In these large-scale evaluations, Longhorn not only maintained strong general performance but also demonstrated 1.8× better sample efficiency than Mamba—reaching competitive perplexity with nearly half the training data.

Length Extrapolation

Transformers notoriously struggle to generalize beyond trained context lengths, but Longhorn’s online formulation yields exceptional extrapolation capability. When trained on 2048 tokens, Longhorn maintains stable perplexity up to 32K tokens—16× longer than its training context (see Figure 1, right).

Vision Tasks

To test cross-domain applicability, the authors adapted Longhorn for image classification (“Vision Longhorn” or ViL) and compared it to Vision Mamba (ViM) on ImageNet.

Results on ImageNet classification.

Table 3: Vision Longhorn (ViL) achieves slightly higher Top-1 accuracy than Vision Mamba (ViM) on ImageNet.

Even without additional tuning, Vision Longhorn matched or exceeded ViM’s accuracy, showing the model’s robustness beyond text.

Conclusion and Future Directions

The Longhorn paper introduces more than a new architecture—it establishes a principled framework for designing State Space Models through online learning theory.

Key takeaways:

Unified Design Principle: SSMs can be understood as online learners optimizing stability–plasticity trade-offs.
Simplicity and Efficiency: Longhorn’s recurrence arises from a closed-form solution, removing handcrafted gates and reducing parameters.
Superior Performance: Longhorn achieves state-of-the-art results across language, vision, and synthetic tasks, with exceptional sample efficiency and context extrapolation.

Looking ahead, this viewpoint opens exciting avenues: exploring other online objectives for reasoning, tool use, or continual learning. Longhorn is not just another competitor to Transformers—it’s a glimpse of what comes next in the evolution of efficient, long-context models.

The field now has a guiding principle. Instead of designing by intuition, we can learn to learn—exactly as Longhorn does.

Table of Contents#

Background: A Quick Tour of State Space Models#

The Core Idea: State Space Models as Online Learners#

The Longhorn Architecture: Learning to Recall#

Experiments and Results#

Multi-Query Associative Recall (MQAR)#

Language Modeling Scaling Laws#

Large-Scale Training on SlimPajama#

Length Extrapolation#

Vision Tasks#

Conclusion and Future Directions#

Table of Contents