Recurrent Neural Networks (RNNs) have long lived in the shadow of Transformers. Transformers dominate modern sequence modeling because they can effectively use long contexts—predicting future tokens becomes easier the more history they have to condition on. The drawback is their quadratic complexity, which makes them slow and memory-hungry for long sequences. RNNs, in contrast, have linear complexity, but have historically struggled to take advantage of more context.

As the 2020 OpenAI scaling law paper famously showed, classic RNNs like LSTMs could not scale or leverage long-context data as Transformers did. But today’s modern RNN architectures, such as Mamba, are closing the gap. Or so it seemed—until researchers discovered that even the best RNNs plateau when sequence lengths grow very long.

In contrast, Transformers just keep improving.

A graph showing that TTT layers and Transformers keep improving with more context, while Mamba’s performance plateaus.

Comparing to Mamba, TTT-Linear and TTT-MLP continue to reduce perplexity with longer context, while Mamba stops improving beyond 16k tokens.

This creates an odd paradox: RNNs are most computationally efficient precisely in the regime where they perform worst. The culprit lies in how RNNs compress information.


The Problem with Fixed-Size Hidden States

At a high level, every sequence model processes a stream of input tokens \(x_1, x_2, ..., x_t\) and emits output tokens \(z_1, z_2, ..., z_t\). To do this, it maintains a hidden state \(s_t\) that summarizes what’s been seen so far.

A diagram showing the generic structure of a sequence model with an input, hidden state, and output at each timestep.

Sequence models process tokens one by one, updating a hidden state that captures prior context. Each architecture differs in how that state is represented and updated.

Different architectures tackle this hidden state differently:

  • RNNs (like LSTM or Mamba):
    The hidden state \(s_t\) is a fixed-size vector. Computation per token stays constant, which is efficient—but compressing thousands of tokens into a tiny vector inevitably loses information.
  • Self-Attention (Transformers):
    The hidden “state” is actually the Key-Value (KV) cache, which grows with sequence length. Every previous token remains accessible for future computations, enabling high expressiveness but incurring quadratic cost.

The challenge is clear: retain the RNN’s efficiency while achieving the Transformer’s ability to remember details over long sequences. But how can one compress thousands—or millions—of tokens into a fixed-size representation without losing meaning?

The key insight from the paper is remarkably simple and elegant: treat the hidden state as a machine learning model itself.


The Core Concept: Hidden States That Learn

Instead of a static vector, imagine the hidden state as the weights of a small model \(f\). The model itself evolves as it processes the sequence—learning in real time.

A schematic of the TTT layer, where the hidden state is a set of model weights W, and the update rule is a gradient descent step.

The hidden state \(W_t\) is treated as the parameters of an inner model \(f\). Each input token triggers an update step, like performing one iteration of training at test time.

The Test-Time Training (TTT) layer turns this idea into a concrete mechanism:

  1. Hidden State = Model Weights:
    At time \(t\), the hidden state is \(W_t\), the weights of \(f\).
  2. Output Rule (Prediction):
    Each output token is produced by the current model: \[ z_t = f(x_t; W_t) \]
  3. Update Rule (Learning Step):
    The hidden state updates by taking one gradient descent step on a self-supervised loss \(\ell\): \[ W_t = W_{t-1} - \eta \nabla \ell(W_{t-1}; x_t) \] Here, \(\eta\) is the learning rate.

Every sequence input triggers an online learning process. Even during inference—when traditional networks merely “apply” learned patterns—the TTT layer is learning from the test sequence itself.

A simple version of the self-supervised loss is a reconstruction task:

\[ \ell(W; x_t) = \|f(\tilde{x}_t; W) - x_t\|^2 \]

where \(\tilde{x}_t\) is a corrupted version of the input. The model must learn relationships among dimensions to reconstruct the original token accurately.

As the sequence progresses, the reconstruction error steadily decreases—meaning the hidden state model truly learns and refines its understanding of the data.

Graphs showing the self-supervised TTT loss decreasing as more tokens are processed, indicating that the inner model is learning.

The TTT loss \(\ell(W_t; x_t)\) decreases along the sequence, showing the hidden model learns during test-time inference.


Making It Work in Practice

The raw concept is thrilling, but practical performance demands more. The paper introduces key innovations to turn TTT from a thought experiment into a scalable architecture.

1. Learning the Self-Supervised Task Itself

Rather than hand-designing the self-supervised objective, the authors make the task learnable. The outer model learns which features the inner model should focus on.

They introduce three learnable projections: \(\theta_K\), \(\theta_V\), and \(\theta_Q\)—evocative of attention’s Key, Value, and Query.

  • Training View: \( \theta_K x_t \) — the “corrupted” version fed to \(f\).
  • Label View: \( \theta_V x_t \) — the target output used to train \(f\).
  • Test View: \( \theta_Q x_t \) — the input for prediction.

The loss becomes:

\[ \ell(W; x_t) = \|f(\theta_K x_t; W) - \theta_V x_t\|^2, \]

and predictions use:

\[ z_t = f(\theta_Q x_t; W_t). \]

By optimizing these views during main model training, the system effectively learns how to learn—choosing which self-supervised signals best support eventual language modeling.


2. Making TTT Efficient: Mini-Batch and Dual Form

Naive TTT updates each hidden state sequentially, creating a bottleneck. To accelerate this, the authors apply two complementary techniques.

Mini-Batch Gradient Descent at Test Time

Instead of single-step updates per token, gradients are computed across small batches of sequential tokens, all with respect to the state at the start of the batch. This parallelization dramatically improves speed while only slightly reducing quality.

A diagram of the computation graph for a TTT mini-batch, showing how gradients can be computed in parallel.

Mini-batch update scheme: gradients from multiple tokens are computed in parallel, enabling efficient updates for each batch.

As seen in Figure 7, increasing batch size raises throughput but slightly impacts perplexity. The authors found b = 16 to be the optimum trade-off.

Graphs showing the trade-off between TTT mini-batch size, perplexity, and computation time.

Mini-batch size trade-off: small batches yield better accuracy, large batches yield faster processing. The authors choose \(b=16\).

The Dual Form

The second breakthrough is the dual formulation of weight updates. Instead of explicitly computing massive gradient tensors, the dual form rewrites the computation to use matrix-matrix multiplications—operations perfectly suited for modern GPUs and TPUs.

This simple algebraic rework preserves correctness while improving performance by over , allowing large-scale training and fast inference.


A Unified View of Sequence Modeling

The most striking success of the TTT framework is how it unifies ideas from RNNs and Transformers.

TTT = Linear Attention

If you choose:

  1. A linear inner model \(f(x) = W x\), and
  2. Batch gradient descent (where each update sees the full sequence),

then the TTT equations simplify to those of linear attention, an efficient variant of self-attention:

\[ z_t = \sum_{s=1}^t (\theta_V x_s)(\theta_K x_s)^T (\theta_Q x_t) \]

This equivalence provides a baseline from which the authors iteratively add their improvements—mini-batch updates, residuals, learnable initializations, and others.

A table showing ablation study results, starting from linear attention and adding TTT features to improve perplexity.

Ablation study starting from linear attention: adding TTT features like mini-batched updates drastically improves perplexity.

TTT = Self-Attention

Replace the parametric inner model \(f\) with a nonparametric estimator—specifically, the Nadaraya-Watson kernel regression—and TTT becomes standard self-attention.

Here, the hidden state itself is the list of all past tokens \(x_1, ..., x_t\), and the update rule merely appends new entries. This captures attention as a special case of TTT: the inner learner stores all training examples instead of compressing them into weights.

A Venn diagram illustrating that TTT layers can represent both RNNs (with parametric learners) and self-attention (with non-parametric learners).

TTT layers can unify RNNs (parametric learners with fixed-size hidden states) and self-attention (nonparametric learners with expanding states).

This framing doesn’t just connect models—it provides a conceptual bridge between efficiency and expressiveness, showing they are two ends of the same learning spectrum.


Experiments: Putting TTT to the Test

The authors introduce two practical versions:

  • TTT-Linear: The inner model \(f\) is linear.
  • TTT-MLP: The inner model \(f\) is a two-layer MLP with GELU activations.

They compare these against strong Transformer and Mamba baselines across model sizes from 125M to 1.3B parameters, evaluating both short and long context settings.

Short Context (2k–8k Tokens, The Pile)

TTT-Linear performs comparably to Mamba and Transformer at 2k context, and slightly better at 8k—especially in the Mamba backbone configuration that includes temporal convolutions.

Scaling plots on the Pile dataset for 2k and 8k context, showing competitive performance for TTT layers.

On short contexts (2k, 8k tokens), TTT layers match Mamba and Transformer performance, improving steadily as context increases.

Long Context (Up to 32k Tokens, Books3)

In long-context scenarios, TTT layers stand out clearly. Both TTT-Linear and TTT-MLP continue improving with longer sequences, while Mamba stagnates beyond 16k tokens.

Scaling plots on the Books dataset for 2k and 32k context, showing TTT layers outperforming Mamba at long context.

At 32k tokens, TTT layers outperform Mamba. Transformers improve too, but at significantly higher FLOP costs.

When evaluating perplexity over token position, TTT models behave like Transformers—the more they read, the better they predict—while Mamba plateaus around halfway through long sequences.


Efficiency: Retaining RNN Speed

Despite the inner learning loop, TTT layers stay efficient. Their per-token latency remains constant as context grows, unlike Transformers whose cost scales linearly.

Latency comparison graphs showing that TTT layers have constant time per token, unlike Transformers.

Latency comparison on NVIDIA A100 GPU. TTT layers maintain constant time per token with increasing context length, like RNNs.

In fact, TTT-Linear runs faster than Mamba, thanks to the optimized dual implementation. TTT-MLP is modestly slower due to its larger model, but still linear in time.


Rethinking Network Design

The “Learning to (Learn at Test Time)” paper is not just another RNN variant—it’s a shift in paradigm. By treating the hidden state as an active learner, the authors reframe sequence modeling as a nested learning problem:

  • The inner loop learns from the test sequence itself, updating weights as it goes.
  • The outer loop learns how that inner loop should learn—its self-supervised task, projections, and hyperparameters.

This new lens offers several exciting directions:

  • Explore richer inner learners (e.g., CNNs for vision, policy networks for robotics).
  • Use multi-level nesting—inner learners themselves performing TTT.
  • Optimize systems for parallelism through time, enabling million-token contexts.

Fundamentally, making hidden states learners unlocks a fascinating hybrid: models that are efficient like RNNs and expressive like Transformers.


The Takeaway

By merging dynamic learning into the hidden state itself, Test-Time Training layers redefine what it means for a network to remember and adapt. TTT-Linear and TTT-MLP show that by training during test time, we can have models that:

  • Scale linearly in time,
  • Continue improving with longer context, and
  • Unify RNNs and attention under one conceptual umbrella.

Perhaps the boundary between “efficient” RNNs and “expressive” Transformers isn’t fundamental after all.
With the right architecture, RNNs can learn like Transformers—and maybe surpass them.