For most of the past decade, Transformers have defined the frontier of sequence modeling. Their ability to process long contexts in parallel unlocked the era of large language models (LLMs). But this progress also shifted attention away from the original sequential engines: recurrent neural networks and, in particular, LSTMs — the architecture Sepp Hochreiter helped invent.
xLSTM: Extended Long Short-Term Memory revisits that lineage and asks a deceptively simple question: if we scale LSTMs using modern engineering and remove their known weaknesses, how far can they go? The short answer: a long way. The paper introduces a family of LSTM extensions that restore decisive memory updates, massively expand storage capacity, and—crucially—make parts of the architecture parallelizable and competitive with modern alternatives. In several language modeling benchmarks and synthetic tests, the resulting xLSTM models match or exceed state-of-the-art Transformers and State Space Models.
This article walks through the core ideas, the motivation, and the experiments behind xLSTM. I’ll break the complex pieces into intuitive building blocks, show the architecture at a glance, and summarize the empirical evidence that gives xLSTM its weight.
What you’ll get out of this post
- A crisp review of LSTM essentials so the extensions make sense.
- Intuition for exponential gating and why it fixes “indecisive” memory updates.
- How scalar and matrix memories (sLSTM and mLSTM) address different limitations.
- The residual block recipe that turns those cells into scalable architectures.
- Key experimental takeaways: synthetic tasks, memory tests, language modeling at scale, and inference properties.
A visual roadmap first: the xLSTM family in one figure.
Figure 1: The extended LSTM (xLSTM) family. From left to right: 1) the classic LSTM memory cell (constant error carousel + sigmoid gating), 2) two new memory cells, sLSTM and mLSTM, both equipped with exponential gating (sLSTM keeps scalar memory with new mixing; mLSTM replaces the scalar with a matrix memory and a covariance update), 3) those cells are wrapped in residual blocks (xLSTM blocks), and 4) stacked residual blocks yield xLSTM architectures.
1 — A quick LSTM refresher (the essentials)
At the heart of standard LSTMs are two intertwined ideas:
- The constant error carousel: additive updates to the cell state let gradients flow across many time steps.
- Gates (sigmoid functions) that control how much information to forget, write, or expose.
Concretely, at time t the classical scalar LSTM update is:
\[ c_t = f_t \, c_{t-1} + i_t \, z_t, \qquad h_t = o_t \, \psi(c_t), \]where f (forget), i (input), and o (output) are sigmoid gates, z is a candidate cell input (usually tanh-transformed), and ψ is a squashing/nonlinearity on the cell state.
These mechanisms powered LSTMs across many applications: language generation, sequence-to-sequence translation, reinforcement learning agents, and long time-series prediction. Yet three limitations ultimately curtailed their dominance in large-scale language modeling:
- Inability to revise storage decisions decisively. Sigmoid gates live in (0, 1), which makes it hard to sharply replace an existing memory item with a new one.
- Limited storage capacity. Each memory cell is scalar (or a vector of scalars) and compresses a lot of information, hurting fidelity for rare, specific tokens.
- Lack of parallelizability. Recurrent hidden-to-hidden connections require sequential processing, which GPUs exploit poorly compared to fully parallel attention.
xLSTM addresses these three points at once.
2 — Two core innovations
xLSTM introduces two orthogonal but complementary ideas:
- Exponential gating: replace (or augment) sigmoid gates with exponential activations and stabilize them.
- New memory structures: keep a scalar-flavored variant (sLSTM) that preserves expressive recurrence and introduce a matrix memory variant (mLSTM) that stores associations (keys ↔ values) and can be parallelized.
I’ll walk through each of these variants and the intuition behind them.
2.1 Exponential gating — making memory updates decisive
Sigmoid gates are bounded between 0 and 1, and they tend to produce “soft” updates. If you want to completely overwrite an old memory, you need a gate value near 1 for the input and near 0 for the forget gate simultaneously; that’s not easy to achieve robustly.
Exponential gating changes the input (and optionally the forget) gate activation to an exponential:
\[ i_t = \exp(\tilde i_t), \quad f_t = \sigma(\tilde f_t)\ \text{ or }\ \exp(\tilde f_t), \]so the effective contribution of the new content can be dramatically larger (or smaller). Intuitively, because exp() spans (0, ∞), the model can make updates that truly dominate past content when needed — enabling decisive revision of stored information.
Caveat: exp() can overflow and cause numerical instability. The paper introduces a light-weight stabilization trick: maintain an additional stabilizer state m_t that keeps running log-scale normalization of gate contributions and use it to renormalize i_t and f_t so the forward pass remains numerically safe. Importantly, that stabilizer has no gradient influence on the rest of the network; it only rescales the composed terms to avoid overflow.
2.2 sLSTM — scalar memory with better mixing
sLSTM is the “recurrent specialist”: it keeps a scalar (or vector) cell state but augments LSTM with exponential gating and a normalizer state that tracks gate strength. The core sLSTM forward rules are:
\[ c_t = f_t c_{t-1} + i_t z_t, \qquad n_t = f_t n_{t-1} + i_t, \qquad \tilde h_t = \frac{c_t}{n_t}, \qquad h_t = o_t \, \tilde h_t. \]Here n_t is a normalizer that accounts for the cumulative gate strength; dividing by it stabilizes the hidden output. Exponential gating lets sLSTM aggressively revise stored scalars when needed. sLSTM retains hidden-to-hidden recurrence (memory mixing), and it can be configured with multiple heads (block-diagonal recurrent matrices) so that different heads track different state aspects while remaining efficient in parameterization.
When should you use sLSTM? For tasks requiring fine-grained state tracking and sequential reasoning — e.g., formal languages, code evaluation, and other problems where explicit state must be updated and queried step-by-step.
2.3 mLSTM — matrix memory for capacity and parallelism
sLSTM solves decisiveness but still compresses information into scalars. mLSTM answers the storage-capacity problem by upgrading the cell state to a matrix \(C_t \in \mathbb{R}^{d\times d}\). The memory acts like a correlation/associative store: for a key vector k_t and a value vector v_t, the simplest covariance update rule is
\[ C_t = f_t \, C_{t-1} + i_t \, v_t k_t^\top, \]which stores the outer product v k^T into the matrix. Retrieval uses a query q:
\[ \tilde h_t = \frac{C_t q_t}{\max\{n_t^\top q_t, 1\}}, \qquad h_t = o_t \odot \tilde h_t, \]with n_t a normalizer (a weighted sum of keys) and a small lower bound in the denominator to avoid instability. This is closely related to classic associative memories and to “fast weights” / outer-product memories explored in earlier work. But the crucial twist is packaging it into an LSTM-like framework where the forget gate acts as decay, and the input gate scales the learning rate of the outer-products.
Why a matrix? A matrix memory can store many different key-value associations without compressing them into a single scalar slot. This yields much higher capacity, especially for recalling rare tokens or large numbers of key-value pairs.
Why is mLSTM parallelizable? Because the covariance-style updates can be reformulated as matrix operations across an entire sequence and stabilized with the same log-normalizer trick used for sLSTM. That lets training be done in parallel over time steps (similar to attention implementations), while generation can still use the sequential recurrence form for efficient autoregressive decoding.
3 — Building xLSTM models (residual blocks + stacking)
A single cell is only part of a deep model. To scale, the authors integrate sLSTM and mLSTM into residual blocks that mirror contemporary LLM engineering (LayerNorm, MLPs, up/down projections). They use two block patterns:
- Post up-projection block (sLSTM): input → sLSTM → gated MLP → residual add. This is similar in spirit to Transformer blocks where nonlinearity happens in the original embedding space before projecting up.
- Pre up-projection block (mLSTM): input is projected up first, then mLSTM is applied in the high-dimensional space and projected back. The matrix memory benefits from operating in higher-dimensions.
A full xLSTM is then a residual stack of blocks. Notation xLSTM[a:b] indicates ratio of mLSTM-based to sLSTM-based blocks — e.g., xLSTM[7:1] means 7 mLSTM blocks and 1 sLSTM block per 8-block chunk.
Schematic view of the two block types:
Figure 3: xLSTM blocks. Left: a residual sLSTM block with post up-projection (like Transformers)—optionally with a small causal convolution and followed by a gated MLP. Right: a residual mLSTM block with pre up-projection (like some State Space Models)—mLSTM is wrapped inside MLPs, normalization, and component-wise gating.
4 — Why these choices matter (intuitions)
- Exponential gating gives the model the ability to “overwrite” memories decisively when new evidence arrives. This is essential for tasks like nearest-neighbor search within a sequence: when you find a better match, you want to replace the stored best value, not blend it softly.
- Matrix memory removes the bottleneck of scalar compression: rare tokens and large associative storages become feasible without exploding dimension sizes.
- Mixing sLSTM and mLSTM yields a practical hybrid: use mLSTM as the scalable parallel backbone for bulk memorization and sLSTM punctuated blocks where complex sequential reasoning is required.
- By reformulating parts of the mLSTM recurrence as parallel matrix ops (with stabilization), training becomes GPU-friendly.
5 — Experiments: probing capabilities and scale
The paper evaluates xLSTM across an extensive experimental suite: focused synthetic tasks, associative recall (MQAR), Long Range Arena, ablations, and large-scale language modeling (SlimPajama training at 15B and 300B token scales). I’ll summarize the most relevant findings.
5.1 Synthetic tasks — state-tracking and formal languages
Formal-language tasks probe the ability to maintain and update structured state (parity, context-free languages, stack-like behavior). Models without hidden-hidden recurrence (Transformers, SSMs) struggle on many of these tasks due to the need for genuine state tracking.
The xLSTM variants that include sLSTM (memory mixing + exponential gating) solve many of these tasks reliably, while plain Transformers and parallel SSMs often fail. This validates the hypothesis that memory mixing (the recurrent interactions across time) is necessary for some forms of algorithmic generalization.
A representative result: xLSTM models with sLSTM components solve parity-like and other Chomsky-hierarchy tasks that Transformers and many SSMs cannot.
5.2 Associative Recall (MQAR) — memory capacity
To test storage capacity, the authors used a Multi-Query Associative Recall (MQAR) benchmark where the model must store and retrieve many key-value pairs. Transformers can perform extremely well due to attention’s high-capacity behavior, serving as the gold standard. Among non-Transformer models, xLSTM variants with mLSTM (matrix memory) lead the pack — particularly xLSTM[1:1] and xLSTM[1:0]. As key-value count increases (e.g., up to 256 pairs) and sequence lengths grow (up to 2048), the mLSTM-based architectures maintain strong recall performance, showing the practical benefits of matrix memory.
Representative visualization:
Figure 5: Multi-Query Associative Recall (MQAR) experiments. Each panel shows accuracy for various models across model dimensions for different numbers of key-value pairs. xLSTM[1:1] and xLSTM[1:0] perform best among non-transformer models.
5.3 Long Range Arena — varied long-context tasks
On Long Range Arena benchmarks (retrieval, list operations, pixel-based image tasks), xLSTM shows consistent and strong performance, matching or exceeding other linear-time or long-context architectures.
5.4 Ablations on components
The authors performed careful ablations to separate the contributions:
- Adding a modern residual backbone (LayerNorm + skip connections) to vanilla LSTM improves training dramatically.
- Introducing exponential gating provided a large jump in perplexity reduction.
- Replacing some sLSTM blocks with mLSTM (matrix memory) delivered further gains.
- Ablations also show that making gates learnable and input-dependent yields incremental improvements: the full expressive gating setup works best.
The bottom line: both exponential gating and matrix memory are necessary contributors to xLSTM’s strong performance.
5.5 Language modeling at scale (15B tokens)
A controlled comparison trained many methods on the same 15B-token subset (SlimPajama). xLSTM variants achieved the best validation perplexities among the considered alternatives (Transformers, SSMs, RWKV, linear attention methods).
Selected results (validation perplexity on SlimPajama, 15B tokens):
Model | ~#Params (M) | Perplexity |
---|---|---|
Llama | 407 | 14.25 |
Mamba | 423 | 13.70 |
RWKV-5 | 456 | 14.25 |
xLSTM[1:0] | 409 | 13.43 |
xLSTM[7:1] | 408 | 13.48 |
And when scaling along parameter budgets, xLSTM stayed consistently ahead across sizes.
A scaling visualization (validation perplexity vs parameter count) shows xLSTM with a favorable offset compared to other architectures, suggesting the trend likely holds at larger scales.
Figure 6: Scaling behavior on 15B SlimPajama tokens. xLSTM variants dominate across parameter counts.
5.6 Full LLM runs (300B tokens): extrapolation and downstream tasks
The paper scales representative models (xLSTM, Llama, Mamba, RWKV) to train on 300B tokens and evaluates multiple model sizes (125M–1.3B) on:
- Sequence-length extrapolation (trained at context=2048, tested to 16k),
- Validation perplexity,
- Downstream reasoning tasks (LAMBADA, HellaSwag, PIQA, ARC, Winogrande),
- PALOMA domain-specific next-token perplexity across 571 domains.
Two headline results:
Length extrapolation. xLSTM models trained at 2048 tokens show robust perplexity behavior up to 16k tokens, while Transformer perplexities explode beyond the training context. This echoes the theoretical advantage of recurrent state for some long-context generalization.
Downstream and domain performance. Across model sizes, xLSTM variants (both [1:0] and [7:1]) achieved the best validation perplexities and competitive or better downstream accuracy across common benchmarks. PALOMA evaluations showed xLSTM often had lower perplexities on the vast majority of individual domains.
Sequence extrapolation example:
Figure 7: Sequence extrapolation (trained context 2048, evaluated to 16k) for 1.3B-size models trained on 300B tokens. xLSTM maintains low perplexity across long contexts.
5.7 Inference speed and throughput
xLSTM brings practical inference advantages:
- Recurrent decoding is linear in generation length; Transformer caching (KV cache) still incurs quadratic or growing costs to maintain attention over the entire context depending on implementation.
- Memory (RAM/GPU) footprint is constant w.r.t sequence length for xLSTM (the matrix memory is fixed per head), enabling larger batch sizes at inference and higher tokens-per-second throughput.
Empirical generation-time plots show linear scaling of generation time for xLSTM and other RNN-style models, versus super-linear growth for Transformer generation.
Figure 9: Inference generative speed and throughput for 1.3B models. Left: generation time (prefill 16) scales linearly for recurrent models; the Transformer shows larger growth. Right: tokens/s at different batch sizes; xLSTM supports larger batches thanks to constant memory footprint.
6 — Limitations and practical considerations
The authors acknowledge several practical limitations:
- sLSTM is not parallelizable (by design) because of recurrent mixing; their CUDA kernel is optimized, but it’s still somewhat slower than a fully parallel implementation.
- mLSTM’s matrix memory involves d × d matrix operations, which are computationally heavy; efficient GPU kernels or specialized implementations (e.g., FlashAttention-like optimizations) can reduce this gap.
- Careful initialization of forget gate biases matters for training stability.
- Large-scale experiments (especially larger xLSTM models) could benefit from more hyperparameter optimization and kernel engineering than was feasible in the study.
In short: the architecture shows strong promise, but there is room for engineering improvements to close implementation and runtime gaps with tuned Transformer kernels.
7 — Takeaways: where xLSTM fits in the landscape
- When a problem requires explicit, revisable state (algorithmic tasks, some forms of reasoning, long-context state tracking), sLSTM-style recurrence remains uniquely powerful.
- When capacity is critical (many key-value associations, rare-token memorization), matrix memory (mLSTM) offers a compelling way forward.
- The hybrid xLSTM family trades off expressivity and parallelism: use mostly mLSTM blocks for bulk processing and a sprinkle of sLSTM blocks for hard sequential phenomena.
- At scale, xLSTM matches or exceeds competing architectures on language modeling perplexity and generalization, while offering practical inference benefits (constant memory, linear generation cost).
8 — Final thought: are RNNs back?
xLSTM is not a nostalgic re-run of old architectures. Instead, it demonstrates that core LSTM ideas (the constant error carousel and gating) paired with modern innovations (exponential gating, stable normalization, matrix memories, and residual backbones) produce a viable and competitive family of sequence models for the modern LLM era.
If you think in terms of “design primitives,” xLSTM brings two that are worth remembering:
- Make memory updates decisive when the task benefits from revision (exponential gating).
- Give your model the right memory substrate: scalar recurrence for stateful reasoning, and matrix associative memory for large-capacity recall — then combine them in residual stacks.
The Transformer revolution was built on a specific balance of parallelism and associative capability. xLSTM shows that a different balance — smarter recurrence plus associative matrix memory — can reach the same heights and, in some cases, exceed them. That opens new directions for architecture design, especially in settings where long contexts, high-capacity memory, or constant-memory inference are desirable.
If you’d like to explore further, the xLSTM code is available from the authors (linked in the paper) and the paper contains detailed pseudocode, CUDA implementation remarks, and exhaustive ablations.
Acknowledgment: the research paper contains many detailed experimental tables and appendices that I summarized here to focus on the core ideas and practical implications. For low-level implementation details and full numerical results, consult the original paper.
References and notes
- The content and experimental results discussed above are from: “xLSTM: Extended Long Short-Term Memory” (Maximilian Beck et al.). The paper includes more comprehensive derivations, pseudocode, and appendices on numerical stability and parallel formulations.
- For historical context on LSTM core ideas, see Hochreiter & Schmidhuber (1997) and follow-up literature on attention, State-Space Models, and Fast Weights.
- Datasets mentioned: SlimPajama, PALOMA, MQAR setups, Long Range Arena.
Further reading (selected)
- Original LSTM papers (Hochreiter & Schmidhuber).
- Fast Weight Programmers, Hopfield networks and outer-product associative memory literature.
- Recent long-context and SSM papers (S4, Mamba, Retention).
- RWKV and other modern RNN-based LLM efforts.
If you want a guided walkthrough of the math and a code-level showcase of sLSTM and mLSTM implementations (including the stabilizer tricks), tell me which level you prefer — pseudocode, PyTorch sketch, or fully worked derivations — and I’ll prepare it.