Recurrent Neural Networks (RNNs) have long lived in the shadow of Transformers. Transformers dominate modern sequence modeling because they can effectively use long contexts—predicting future tokens becomes easier the more history they have to condition on. The drawback is their quadratic complexity, which makes them slow and memory-hungry for long sequences. RNNs, in contrast, have linear complexity, but have historically struggled to take advantage of more context.
As the 2020 OpenAI scaling law paper famously showed, classic RNNs like LSTMs could not scale or leverage long-context data as Transformers did. But today’s modern RNN architectures, such as Mamba, are closing the gap. Or so it seemed—until researchers discovered that even the best RNNs plateau when sequence lengths grow very long.
In contrast, Transformers just keep improving.
Comparing to Mamba, TTT-Linear and TTT-MLP continue to reduce perplexity with longer context, while Mamba stops improving beyond 16k tokens.
This creates an odd paradox: RNNs are most computationally efficient precisely in the regime where they perform worst. The culprit lies in how RNNs compress information.
The Problem with Fixed-Size Hidden States
At a high level, every sequence model processes a stream of input tokens \(x_1, x_2, ..., x_t\) and emits output tokens \(z_1, z_2, ..., z_t\). To do this, it maintains a hidden state \(s_t\) that summarizes what’s been seen so far.
Sequence models process tokens one by one, updating a hidden state that captures prior context. Each architecture differs in how that state is represented and updated.
Different architectures tackle this hidden state differently:
- RNNs (like LSTM or Mamba):
The hidden state \(s_t\) is a fixed-size vector. Computation per token stays constant, which is efficient—but compressing thousands of tokens into a tiny vector inevitably loses information. - Self-Attention (Transformers):
The hidden “state” is actually the Key-Value (KV) cache, which grows with sequence length. Every previous token remains accessible for future computations, enabling high expressiveness but incurring quadratic cost.
The challenge is clear: retain the RNN’s efficiency while achieving the Transformer’s ability to remember details over long sequences. But how can one compress thousands—or millions—of tokens into a fixed-size representation without losing meaning?
The key insight from the paper is remarkably simple and elegant: treat the hidden state as a machine learning model itself.
The Core Concept: Hidden States That Learn
Instead of a static vector, imagine the hidden state as the weights of a small model \(f\). The model itself evolves as it processes the sequence—learning in real time.
The hidden state \(W_t\) is treated as the parameters of an inner model \(f\). Each input token triggers an update step, like performing one iteration of training at test time.
The Test-Time Training (TTT) layer turns this idea into a concrete mechanism:
- Hidden State = Model Weights:
At time \(t\), the hidden state is \(W_t\), the weights of \(f\). - Output Rule (Prediction):
Each output token is produced by the current model: \[ z_t = f(x_t; W_t) \] - Update Rule (Learning Step):
The hidden state updates by taking one gradient descent step on a self-supervised loss \(\ell\): \[ W_t = W_{t-1} - \eta \nabla \ell(W_{t-1}; x_t) \] Here, \(\eta\) is the learning rate.
Every sequence input triggers an online learning process. Even during inference—when traditional networks merely “apply” learned patterns—the TTT layer is learning from the test sequence itself.
A simple version of the self-supervised loss is a reconstruction task:
\[ \ell(W; x_t) = \|f(\tilde{x}_t; W) - x_t\|^2 \]where \(\tilde{x}_t\) is a corrupted version of the input. The model must learn relationships among dimensions to reconstruct the original token accurately.
As the sequence progresses, the reconstruction error steadily decreases—meaning the hidden state model truly learns and refines its understanding of the data.
The TTT loss \(\ell(W_t; x_t)\) decreases along the sequence, showing the hidden model learns during test-time inference.
Making It Work in Practice
The raw concept is thrilling, but practical performance demands more. The paper introduces key innovations to turn TTT from a thought experiment into a scalable architecture.
1. Learning the Self-Supervised Task Itself
Rather than hand-designing the self-supervised objective, the authors make the task learnable. The outer model learns which features the inner model should focus on.
They introduce three learnable projections: \(\theta_K\), \(\theta_V\), and \(\theta_Q\)—evocative of attention’s Key, Value, and Query.
- Training View: \( \theta_K x_t \) — the “corrupted” version fed to \(f\).
- Label View: \( \theta_V x_t \) — the target output used to train \(f\).
- Test View: \( \theta_Q x_t \) — the input for prediction.
The loss becomes:
\[ \ell(W; x_t) = \|f(\theta_K x_t; W) - \theta_V x_t\|^2, \]and predictions use:
\[ z_t = f(\theta_Q x_t; W_t). \]By optimizing these views during main model training, the system effectively learns how to learn—choosing which self-supervised signals best support eventual language modeling.
2. Making TTT Efficient: Mini-Batch and Dual Form
Naive TTT updates each hidden state sequentially, creating a bottleneck. To accelerate this, the authors apply two complementary techniques.
Mini-Batch Gradient Descent at Test Time
Instead of single-step updates per token, gradients are computed across small batches of sequential tokens, all with respect to the state at the start of the batch. This parallelization dramatically improves speed while only slightly reducing quality.
Mini-batch update scheme: gradients from multiple tokens are computed in parallel, enabling efficient updates for each batch.
As seen in Figure 7, increasing batch size raises throughput but slightly impacts perplexity. The authors found b = 16 to be the optimum trade-off.
Mini-batch size trade-off: small batches yield better accuracy, large batches yield faster processing. The authors choose \(b=16\).
The Dual Form
The second breakthrough is the dual formulation of weight updates. Instead of explicitly computing massive gradient tensors, the dual form rewrites the computation to use matrix-matrix multiplications—operations perfectly suited for modern GPUs and TPUs.
This simple algebraic rework preserves correctness while improving performance by over 5×, allowing large-scale training and fast inference.
A Unified View of Sequence Modeling
The most striking success of the TTT framework is how it unifies ideas from RNNs and Transformers.
TTT = Linear Attention
If you choose:
- A linear inner model \(f(x) = W x\), and
- Batch gradient descent (where each update sees the full sequence),
then the TTT equations simplify to those of linear attention, an efficient variant of self-attention:
\[ z_t = \sum_{s=1}^t (\theta_V x_s)(\theta_K x_s)^T (\theta_Q x_t) \]This equivalence provides a baseline from which the authors iteratively add their improvements—mini-batch updates, residuals, learnable initializations, and others.
Ablation study starting from linear attention: adding TTT features like mini-batched updates drastically improves perplexity.
TTT = Self-Attention
Replace the parametric inner model \(f\) with a nonparametric estimator—specifically, the Nadaraya-Watson kernel regression—and TTT becomes standard self-attention.
Here, the hidden state itself is the list of all past tokens \(x_1, ..., x_t\), and the update rule merely appends new entries. This captures attention as a special case of TTT: the inner learner stores all training examples instead of compressing them into weights.
TTT layers can unify RNNs (parametric learners with fixed-size hidden states) and self-attention (nonparametric learners with expanding states).
This framing doesn’t just connect models—it provides a conceptual bridge between efficiency and expressiveness, showing they are two ends of the same learning spectrum.
Experiments: Putting TTT to the Test
The authors introduce two practical versions:
- TTT-Linear: The inner model \(f\) is linear.
- TTT-MLP: The inner model \(f\) is a two-layer MLP with GELU activations.
They compare these against strong Transformer and Mamba baselines across model sizes from 125M to 1.3B parameters, evaluating both short and long context settings.
Short Context (2k–8k Tokens, The Pile)
TTT-Linear performs comparably to Mamba and Transformer at 2k context, and slightly better at 8k—especially in the Mamba backbone configuration that includes temporal convolutions.
On short contexts (2k, 8k tokens), TTT layers match Mamba and Transformer performance, improving steadily as context increases.
Long Context (Up to 32k Tokens, Books3)
In long-context scenarios, TTT layers stand out clearly. Both TTT-Linear and TTT-MLP continue improving with longer sequences, while Mamba stagnates beyond 16k tokens.
At 32k tokens, TTT layers outperform Mamba. Transformers improve too, but at significantly higher FLOP costs.
When evaluating perplexity over token position, TTT models behave like Transformers—the more they read, the better they predict—while Mamba plateaus around halfway through long sequences.
Efficiency: Retaining RNN Speed
Despite the inner learning loop, TTT layers stay efficient. Their per-token latency remains constant as context grows, unlike Transformers whose cost scales linearly.
Latency comparison on NVIDIA A100 GPU. TTT layers maintain constant time per token with increasing context length, like RNNs.
In fact, TTT-Linear runs faster than Mamba, thanks to the optimized dual implementation. TTT-MLP is modestly slower due to its larger model, but still linear in time.
Rethinking Network Design
The “Learning to (Learn at Test Time)” paper is not just another RNN variant—it’s a shift in paradigm. By treating the hidden state as an active learner, the authors reframe sequence modeling as a nested learning problem:
- The inner loop learns from the test sequence itself, updating weights as it goes.
- The outer loop learns how that inner loop should learn—its self-supervised task, projections, and hyperparameters.
This new lens offers several exciting directions:
- Explore richer inner learners (e.g., CNNs for vision, policy networks for robotics).
- Use multi-level nesting—inner learners themselves performing TTT.
- Optimize systems for parallelism through time, enabling million-token contexts.
Fundamentally, making hidden states learners unlocks a fascinating hybrid: models that are efficient like RNNs and expressive like Transformers.
The Takeaway
By merging dynamic learning into the hidden state itself, Test-Time Training layers redefine what it means for a network to remember and adapt. TTT-Linear and TTT-MLP show that by training during test time, we can have models that:
- Scale linearly in time,
- Continue improving with longer context, and
- Unify RNNs and attention under one conceptual umbrella.
Perhaps the boundary between “efficient” RNNs and “expressive” Transformers isn’t fundamental after all.
With the right architecture, RNNs can learn like Transformers—and maybe surpass them.