Beyond Transformers: How MesaNet Learns In-Context by Optimizing on the Fly

For years Transformers have been the dominant architecture for sequence modeling — from language to code to long documents. Their softmax self-attention gives them unmatched flexibility to access arbitrary past tokens, but that flexibility comes at a cost: during inference memory and compute scale linearly with sequence length. That makes very long-context problems expensive, and it motivates revisiting an older idea: recurrent-style layers that maintain a fixed-size state and update it as tokens arrive.

A recent wave of models (Mamba, xLSTM, DeltaNet, and others) showed that carefully designed linearized attention or “fast-weight” RNNs can hit high-quality language-modeling performance while keeping per-token memory and compute constant. These models share a useful unifying perspective: they can be read as performing test-time training — updating a small internal linear model (fast weights) as new key–value pairs arrive.

MesaNet pushes that idea further. Instead of making a small gradient step on a per-token loss, its Mesa layer solves a (regularized) least-squares problem optimally at each time step, producing a linear mapping from keys to values that minimizes cumulative squared error on the entire history. Computing that optimal solution naively is impractical, but the MesaNet paper introduces a numerically stable, chunkwise-parallel formulation that leverages conjugate gradient (CG) and hardware-friendly kernels. The result is a recurrent-layer architecture that:

Solves an explicit in-context optimization at every step (optimal test-time training).
Is chunkwise parallelizable for efficient training on GPUs/TPUs.
Dynamically allocates test-time compute by early-stopping the iterative solver.
Matches or outperforms other linear RNNs on synthetic tasks and becomes a competitive language model at the billion-parameter scale — while exposing an important trade-off: you can spend extra compute at inference to improve predictions.

In this post I’ll unpack the core ideas, show how the Mesa layer is implemented efficiently, and walk through the experiments and their implications for the RNN vs Transformer trade-offs.

Figure: overall architecture (channel mixing + Mesa layer). MesaNet architecture: stacked residual blocks with a channel-mixing MLP and a sequence-mixing Mesa layer. The Mesa layer produces keys, queries and values and computes an optimal fast-weight readout per timestep.

Figure 1: MesaNet architecture. Each residual block contains a channel-mixing block (SwiGLU MLP) and a sequence-mixing block (the Mesa layer). The Mesa layer computes keys, queries, values and input/forget gates and then applies its optimal update/readout rule.

Why revisit RNN-style designs? Because their constant-memory, constant-per-token compute is attractive when context lengths get very large. The trick in many modern linearized-attention RNNs is to view the recurrent state as a linear model Φ that maps keys to values and to update Φ online when a new (k, v) pair arrives. Different update rules correspond to different local learning rules:

Hebbian-like updates (fast weights, GLA) add outer products v k^T (and optionally scale them).
Delta-rule updates (DeltaNet) are like a single gradient-descent step on a squared-error loss for the current token.
Mesa: solve the cumulative regularized least-squares problem optimally at each timestep.

The Mesa intuition is remarkably simple and powerful: if your layer’s state Φ is supposed to model the relationship v ≈ Φ k across the context, why not set Φ to the best possible linear map given all observed (k, v) pairs so far? That is the Mesa objective.

What the Mesa layer optimizes

At timestep t the Mesa layer defines the cumulative objective (regularized least squares):

\[ \hat\Phi_t^{\text{mesa}} \;=\; \arg\min_{\Phi}\; \mathcal{L}_t(\Phi) \quad\text{with}\quad \mathcal{L}_t(\Phi) \;=\; \frac{1}{2}\sum_{t'=1}^t \zeta_{t t'}\|v_{t'}-\Phi k_{t'}\|^2 \;+\; \frac{1}{2}\mathrm{Tr}(\Phi^\top\Lambda\Phi). \]

Here:

\(k_{t'}, v_{t'}\) are past keys and values produced from token embeddings,
\(\zeta_{t t'}\) is an optional time-dependent weighting factor (e.g., from forget gates \(\gamma\)),
\(\Lambda\) is a positive definite regularizer (diagonal in practice, learned),
\(\hat\Phi_t^{\text{mesa}}\) is the optimal fast-weight matrix at time t.

The Mesa output for a query \(q_t\) is then the optimal linear readout

\[ \Delta e_t^{\text{mesa}} \;=\; \hat\Phi_t^{\text{mesa}} q_t. \]

Because \(\mathcal{L}_t\) is quadratic in \(\Phi\), the optimal \(\hat\Phi_t\) can be written in closed form in terms of two sufficient statistics:

\[ G_t = \sum_{t'=1}^t \zeta_{t t'} v_{t'} k_{t'}^\top, \qquad H_t = \sum_{t'=1}^t \zeta_{t t'} k_{t'} k_{t'}^\top, \]

so that formally

\[ \hat\Phi_t^{\text{mesa}} = G_t (H_t + \Lambda)^{-1}. \]

Directly computing and inverting \(H_t+\Lambda\) per time step would be expensive and numerically tricky. The innovations in MesaNet make this practical.

A numerically stable, chunkwise-parallel Mesa layer

The paper avoids explicit dense inverses and solves the linear systems with conjugate gradient (CG). The principal forward computation becomes

\[ \Delta e_t^{\text{mesa}} = G_t\,\mathrm{linsolve}(H_t + \Lambda,\, q_t) \;=\; G_t\,x_t^*,\quad\text{where }(H_t+\Lambda)x_t^*=q_t. \]

Two crucial observations enable an efficient implementation:

Both \(G_t\) and \(H_t\) follow simple linear recurrences with forget and input gates:
\[ G_t = \gamma_t G_{t-1} + \beta_t v_t k_t^\top, \qquad H_t = \gamma_t H_{t-1} + \beta_t k_t k_t^\top. \]
These are rank-one updates per step and can be accumulated without storing the entire history.
The dominant operation inside CG is a matrix-vector multiply with \(H_t\), i.e. computing \((\sum_i \zeta_{t i} k_i k_i^\top)p\). That sum has the same algebraic structure as gated linear attention (GLA): a sum of weighted outer products applied to a vector. Therefore we can reuse the same chunkwise-parallel, matrix-matrix primitives used to accelerate linear attention to evaluate those products across many time steps in parallel.

Practically, training runs in a chunked fashion: split the training sequence into chunks of size C, precompute chunk-level accumulators, and run parallel matrix-matrix multiplications inside each chunk for both the GLA-like forward pass and for the matrix-vector products required by CG. The conjugate-gradient iterations themselves are also amenable to chunkwise-parallel evaluation because each CG matrix–vector product has the GLA structure.

Putting it together:

Training is parallelizable over time (chunkwise), so GPU/TPU matrix-multiply units are well utilized.
At inference time you can choose to run a fixed number k of CG steps (deterministic cost) or use a stopping tolerance ε (dynamic cost that depends on the conditioning of \(H_t+\Lambda\) and the data).
Using CG makes the layer numerically stable even with nontrivial forget dynamics, at the cost of extra flops for the iterative solve.

Figure: train/inference timing and throughput comparisons. Train-time, inference-time and token throughput comparisons for the chunkwise-parallel Mesa (Mesa-CG) vs baselines. Mesa-CG remains competitive thanks to chunked matrix kernels despite extra CG iterations.

Figure 2: Training and inference time and token throughput. Left: per-layer training timings on TPUv5 across sequence length and different CG steps. Center: per-layer inference timing. Right: token throughput on an H100 GPU for 1B models. Mesa with 15–30 CG steps remains competitive in throughput thanks to chunkwise parallelization.

Design choices and stability fixes

A few practical design choices are important to make the Mesa layer robust:

Normalize keys and queries (RMSNorm + SiLU + L2 normalization). This bounds magnitudes and stabilizes conditioning.
Parameterize \(\Lambda\) via a softplus and keep a positive lower bound (the authors use a lower bound like 0.25). This prevents the condition number from exploding.
Cap forget gates \(\gamma_t\) (e.g., slightly below 1) and modulate them relative to the input gate to avoid pathological cases with many identical repeated tokens (the authors observed issues with “screaming” sequences like long runs of the same token in code/text).
Initialize the CG solver smartly (use a diagonal preconditioner / good initial guess), and limit the maximum number of iterations K. In practice, the paper trains with a fixed number of CG steps (e.g., 30) and shows good robustness; at inference, they demonstrate effective dynamic stopping based on a tolerance ε.

Why this is conceptually appealing

Interpreting sequence layers as local learners gives us a clean explanatory lens for many designs:

GLA and Hebbian fast weights correspond to simple online accumulation strategies (one-shot adds).
DeltaNet-like rules correspond to single-step gradient descent (first-order) updates on per-step squared error.
Mesa corresponds to solving the full cumulative regularized least-squares objective — a second-order, locally optimal learner.

Viewed this way, Mesa is the optimal linear associative memory in the squared-error sense: it stores associations from keys to values and instantly retrieves the best linear mapping to answer queries given the entire past (subject to the capacity limits of the key dimension).

Experiments: synthetic benchmarks and language modeling

The paper evaluates MesaNet across a broad experimental suite: synthetic algorithmic tasks, in-context learning benchmarks, perplexity on real data, and downstream reasoning/recall tasks. I’ll summarize the salient findings and their implications.

Synthetic tasks (MAD & RegBench)

On MAD (a set of token manipulation and memory tasks), MesaNet achieves the highest average accuracy, matching Transformers and outperforming other linear RNNs. This highlights the Mesa layer’s ability to form and use tight in-context associations.
On RegBench (in-context grammar inference tasks), MesaNet surpasses other linear architectures and closes the gap to Transformers, demonstrating strong in-context generalization.

Figure: MAD & RegBench highlights. MesaNet achieves top average performance on the MAD benchmark and strong performance on RegBench, closing the gap to Transformers in in-context rule inference.

Figure 3: MAD benchmark — MesaNet achieves the highest average accuracy among the compared linear recurrent models, matching the Transformer.
Figure 4: RegBench — MesaNet outperforms other linearized architectures and approaches Transformers in in-context grammar inference.

Large-scale language modeling (SlimPajama; up to ~1B params)

MesaNet (and a hybrid Hawk–Mesa variant) are trained at scales up to ~940M parameters and evaluated on SlimPajama (and downstream splits).
On average perplexity (PPL) across standard validation sets, MesaNet matches or slightly improves on many linear-RNN baselines and achieves comparable scores to the Transformer baseline when measured by aggregate PPL.
But a crucial deeper analysis reveals a consistent pattern: MesaNet and other RNNs are much better than Transformers on early tokens in a sequence (e.g., up to 64 tokens) but fall behind for later tokens. The Transformer’s global access to full context allows it to gain an advantage as sequences grow longer.

Token-position analysis: where the models differ Measuring average PPL alone hides this qualitative behavior. The authors therefore compute token-position-dependent negative log-likelihoods (NLL_k) and compare each model to a Transformer baseline:

\[ \Delta\mathrm{NLL}_k^{\text{model}} = \mathrm{NLL}_k^{\text{model}} - \mathrm{NLL}_k^{\text{MHA}}. \]

For small k (early tokens), \(\Delta\mathrm{NLL}_k\) is negative for Mesa and many RNNs — they predict early tokens better.
For larger k, \(\Delta\mathrm{NLL}_k\) becomes positive: Transformers overtake and eventually surpass RNNs for late tokens.
Mesa and Hawk–Mesa keep the early advantage longer than other RNNs (often beyond 512 tokens), but the trend remains: Transformers win at very long-range prediction.

This suggests an inductive-bias split: RNN-style linear fast-weight models are excellent at local adaptation and quick in-context memorization, while Transformers remain superior when global, arbitrary-access recall across long past contexts matters.

Figure: per-token NLL difference relative to Transformer. NLL difference by token position: recurrent models often predict early-token distributions better than Transformers, but Transformers dominate later in the sequence. Mesa and Hawk–Mesa keep the lead longest among RNNs.

Figure 5: Token-position NLL differences relative to a Transformer for 1B models. Negative values indicate better performance than the Transformer at that position. Most RNNs beat Transformers on early tokens; Transformers surpass RNNs at larger token positions. MesaNet extends the RNN advantage further into the sequence than many alternatives.

Long-context extrapolation and sliding-window baselines

The paper also evaluates extrapolation to much longer contexts (up to 32k tokens). Some RNNs catastrophically fail; MesaNet maintains reasonable extrapolation and compares favorably with many RNN baselines.
But the authors highlight a sobering baseline: a Transformer with sliding-window attention of size 1024 (SWA-1024) is surprisingly competitive even at very long evaluation lengths. This underlines two points: (1) in certain tasks local context suffices, and (2) perplexity alone may not always measure truly long-range understanding.

Downstream tasks: global vs local requirements The authors split common benchmarks into “Global” and “Local” subsets based on how much performance improves with larger attention windows (SWA experiments). The results are revealing:

On Local reasoning tasks (benchmarks where short windows suffice), MesaNet and many RNNs perform on par with Transformers.
On Global reasoning and in-context recall tasks (benchmarks that benefit significantly from more context), all RNNs, including MesaNet, have a substantial gap vs the full Transformer baseline. MesaNet is the strongest RNN family member on those tasks, but it does not close the gap.

Figure: downstream benchmark summary. Group benchmark results on reasoning, in-context recall and few-shot tasks. MesaNet is the strongest of the recurrent models but a gap remains to the Transformer on global/recall tasks.

Figure 6: Downstream aggregated scores (400M and 1B). MesaNet leads recurrent families but does not match the Transformer on tasks that require long-range global context.

Few-shot learning and word-scrambling tasks

MesaNet shows strong few-shot performance on token-manipulation tasks (word-scrambling) and is competitive on many few-shot tasks, occasionally outperforming the Transformer on synthetic token-manipulation tasks.
For translation and other tasks that heavily rely on global alignment and complex multi-token mappings, Transformers remain superior.

Dynamic test-time compute: save flops without losing accuracy

One of the Mesa layer’s most attractive practical properties is dynamic compute allocation. Because CG is iterative, you can:

Run a fixed number k of CG steps (deterministic cost), or
Use a stopping tolerance ε and let each head/time-step stop individually when the residual is small enough.

The experiments show that:

Reducing CG steps uniformly across heads increases NLL, especially for late tokens (those that need more iterations to converge).
Using a dynamic stopping criterion (ε) can yield the same accuracy as a fixed 30-CG-step model while reducing the average number of CG steps — the paper reports reducing the mean from 30 to ≈9 steps when using ε = 1e−4 in one setting.

Figure: dynamic CG stopping behavior & performance trade-offs. Effect of number of conjugate gradient (CG) steps at inference: fixed vs dynamic stopping. Using a tolerance (ε = 1e-4) matches the fixed-30-step performance while reducing average steps drastically.

Figure 7: Test-time compute allocation. Fixed-k vs dynamic ε stopping. With ε = 1e−4 the MesaNet achieves comparable NLL to k=30 while using far fewer average CG iterations.

Interpretation, limitations, and practical takeaways

MesaNet shows how much mileage is available from the test-time training perspective: designing layers that explicitly solve an optimization problem at inference can be powerful. But the empirical story is nuanced:

Strengths
- Mesa solves a principled, optimal in-context regression problem at every step, yielding strong in-context capabilities on synthetic tasks.
- Chunkwise parallelization makes the approach competitive for training throughput, even though inference can be more expensive.
- Dynamic stopping allows graceful trade-offs between compute and performance at inference.
Limitations and open issues
- Test-time compute: the Mesa layer requires additional flops per token compared to simpler linearized-attention RNNs (roughly proportional to the number of CG iterations). The advantage of constant memory remains, but you pay in extra compute when you ask the layer to converge tightly.
- Global long-range abilities: all modern RNNs in this family (including Mesa) still fall behind Transformers on tasks that need global access to arbitrary past tokens. The very regime where RNNs should shine (very long contexts) is where their predictive power is relatively weaker.
- Engineering complexity: the chunkwise parallel CG implementation and practical stability fixes (λ lower bounds, gating caps, normalization) add engineering overhead compared to plain attention layers.

Paths forward suggested by the paper

Warm-starting CG across neighboring timesteps: many heads change their linear systems slowly over time, so using previous solutions as initial guesses could reduce CG iterations.
Hybrid runtime strategies: train with chunkwise-parallel CG, but at inference use Sherman–Morrison updates (or other sequential recursions) in regimes where forgetting is small and numerical stability permits.
Learned or amortized solvers: instead of plain CG, learn a small neural solver or parameterized preconditioner to accelerate the linear solve.
Architecture co-design: exploring backbones that are better matched to RNN-style sequence mixing (e.g., different MLP/positioning choices) could unlock further gains.

Final thoughts

MesaNet is a clear conceptual and engineering advance: it brings a numerically stable, chunk-parallel, test-time-optimization-based layer to large-scale language model training and evaluation. The Mesa layer concretely implements the elegant idea of locally-optimal test-time training: at each token, find the best linear model that explains the context so far.

The experiments reveal a deeper lesson: RNN-like fast-weight programmers and linear attention models remain excellent at local, fast adaptation and early-token prediction, while Transformers keep their edge at late-token prediction and global recall — the exact regime where constant-memory RNNs are supposed to shine. That tension is not a failure of Mesa; rather, it’s a diagnostic. MesaNet suggests targeted research directions: warm-started or learned solvers, hybrid recurrence/attention hybrids, and architecture changes that help recurrent compressors preserve the right global information.

MesaNet is a compelling reminder that the space of sequence-modeling primitives is still rich: optimization problems, iterative inference, and hardware-aware parallelization are all first-class ingredients that can reshape the performance/efficiency tradeoff. If you’re building models for very long contexts or want test-time adaptive compute, MesaNet is a model family worth studying and building on.

Acknowledgements and references are omitted here; the paper is “MesaNet: Sequence Modeling by Locally Optimal Test-Time Training” (von Oswald et al., 2024) and contains the full experimental details, proofs, and open-source implementation notes.