LaCT: Why Bigger Is Better for Test-Time Training and Long-Context AI

The ability to process and understand long sequences of information—be it a lengthy document, a high-resolution image collection, or a long video—is one of the defining frontiers in artificial intelligence. Transformers have transformed how neural networks handle sequential data, yet their core self-attention mechanism scales quadratically with sequence length, making it inefficient for long contexts. This has fueled a wave of research aimed at finding faster, more memory-efficient architectures.

One promising direction is Test-Time Training (TTT), which draws inspiration from Recurrent Neural Networks (RNNs). TTT models include a small, adaptive sub-network whose parameters—called fast weights—are updated dynamically during inference. These fast weights act as a contextual memory, helping the model remember information from previous tokens in the sequence. Unfortunately, most existing TTT methods update these fast weights far too frequently—every few tokens—causing poor GPU utilization and limiting their scalability. Peak hardware usage often drops to below 5%, making large-context modeling prohibitively slow.

In their paper “Test-Time Training Done Right”, researchers from MIT and Adobe Research propose flipping this conventional wisdom. Their approach, called Large Chunk Test-Time Training (LaCT), updates memory using large chunks of data—ranging from thousands to over a million tokens—rather than tiny minibatches. This simple but transformative shift drastically improves efficiency, enabling larger memory states and scaling to multimodal tasks such as million-token image synthesis and 14-billion-parameter video generation. Let’s explore how LaCT works and why this “go big or go home” approach revolutionizes long-context AI.


A Quick Refresher: What Is Test-Time Training?

A conventional neural network learns parameters (“slow weights”) during training and keeps them frozen at inference. TTT introduces a secondary neural network with fast-adapting parameters, called fast weights, that change during inference. This fast-weight learner acts like a temporary memory buffer that stores relationships between elements of the current sequence.

At its core, TTT performs two operations repeatedly:

  1. Update Operation:
    The fast-weight network \(f_W\) updates its weights so that an input key vector \(k\) aligns with its associated value vector \(v\):

    \[ W \leftarrow W - \eta \nabla_W \mathcal{L}(f_W(k), v) \]

    where \(\mathcal{L}\) is a self-supervised loss (typically mean-squared or dot-product loss) and \(\eta\) is the learning rate. This step “writes” information into memory.

  2. Apply Operation:
    Once updated, \(f_W\) processes a query vector \(q\) to produce an output:

    \[ o = f_W(q) \]

    This corresponds to reading from memory.

Like RNNs, this process allows the model to accumulate contextual knowledge over long sequences. However, updating for every few tokens introduces severe inefficiencies, as GPUs thrive on large, parallel workloads—something small TTT batches cannot provide.


The Bottleneck of Traditional TTT

Traditional TTT’s poor hardware utilization stems from its small update batch sizes. The compute-to-memory ratio of a simple fast-weight update can be written as:

\[ r = \frac{2h^2b}{2h^2 + 4hb} = \frac{b}{1 + \frac{2b}{h}} \leq \min\left(\frac{h}{2}, b\right) \]

where \(h\) is the hidden dimension and \(b\) the chunk size. With tiny \(b\), the GPU spends most of its time moving data rather than computing.

By contrast, LaCT dramatically increases \(b\) to thousands or even millions of tokens, making operations compute-bound instead of memory-bound. The result: orders of magnitude higher throughput with clean, native PyTorch code—no custom kernels required.

LaCT achieves substantially higher GPU throughput than traditional TTT methods. (a) Hardware utilization grows drastically with chunk size, (b) enabling efficient scaling to larger memory states, (c) improved validation losses, and (d) higher PSNR in less training time.

Larger chunk sizes in LaCT lead to far better GPU utilization and scale to larger fast-weight memory states, improving both speed and accuracy across diverse benchmarks.


The LaCT Architecture: Hybrid Design for Local and Global Context

LaCT’s architecture fuses two complementary elements—local attention and long-range memory—into a single, coherent framework. Each LaCT block includes three main components:

Diagram of a LaCT block showing parallel Window Attention, Large-Chunk TTT Layer, and Feed-Forward Layer, with fast weight W flowing between chunks.

A single LaCT block. The window attention captures local details; large-chunk TTT handles long-range context with massive efficiency; feed-forward layers mix channels as in Transformers.

  1. Window Attention: Captures local dependencies within small regions of the input, such as nearby words or pixels, using standard attention but limited window size.
  2. Large-Chunk TTT Layer: The core of LaCT. It splits the input into large chunks, updates its fast weights using all tokens in a chunk, and applies the updated weights to generate outputs.
  3. Feed-Forward Network: A conventional Transformer channel-mixing layer.

This hybrid configuration combines the local precision of attention with the global memory of large-chunk TTT, achieving both high accuracy and linear-time scalability.


Inside the Large-Chunk TTT Layer

Instead of per-token updates, the large-chunk TTT layer computes a single gradient across an entire chunk of tokens and then performs one weight update:

\[ g = \nabla_W \sum_{i=1}^{b} \eta_i \, \mathcal{L}(f_W(k_i), v_i) \]

\[ W \leftarrow \text{weight-update}(W, g) \]

Here all tokens in the chunk share the same updated fast weight when applied to their queries. The fast-weight network uses a SwiGLU-MLP structure:

\[ f_W(x) = W_2[\mathrm{SiLU}(W_1x) \circ (W_3x)] \]

and employs a negative dot-product loss to associate keys and values:

\[ \mathcal{L}(f_W(k_i), v_i) = - f_W(k_i)^{\top} v_i \]

Flexible Update–Apply Orders

One of LaCT’s most powerful features is its flexible ordering of update and apply operations. This makes it possible to simulate various attention patterns depending on the task.

Illustration of different Update–Apply scheduling resulting in different effective attention masks (full, block-wise causal, shifted causal, and strided causal).

Different update-apply sequencing styles yield effective attention masks that suit diverse data modalities.

  • Full Mask: Applying after updating—bidirectional within chunks.
  • Block-Wise Causal: Alternating update-apply—causal across chunks.
  • Shifted Block-Wise Causal: Applying before updating—perfect for language models to avoid future-token leakage.
  • Strided Block-Wise Causal: Updating selectively on context chunks—ideal for tasks like novel view synthesis.

Smarter Memory Updates: The Muon Optimizer

With large chunks, updates are infrequent yet substantial—enabling use of more sophisticated, non-linear optimizers without performance penalties. Among these, Muon stands out. It orthogonalizes gradients before applying updates and normalizes their spectral norms, improving stability:

\[ \text{weight-update}(W, g) = \mathrm{L2\text{-}Normalize}(W - \mathrm{Muon}(g)) \]

\[ \mathrm{Muon}(g) \approx U V^{T} \quad \text{for } g = U \Sigma V^{T} \]

This normalization stabilizes learning, reduces sensitivity to step size, and enhances retention of long-range information—all made practical by LaCT’s chunk-level efficiency.


Scaling Further: Context Parallelism

LaCT’s design naturally supports Context Parallelism, where a long sequence is partitioned across multiple GPUs. Each device handles a portion of tokens within a chunk, and the gradients are aggregated via distributed all-reduce:

\[ g = \sum_{j=1}^{\text{shards}} \nabla_W \sum_{i=1}^{s} \eta_i \, \mathcal{L}_i \]

This design scales seamlessly to million-token contexts and extremely large models, including billion-parameter video transformers—without custom kernels or hardware-specific optimizations.


LaCT in Action Across Modalities

The authors tested LaCT on three distinct domains to highlight its flexibility.

Summary table of LaCT setups across Novel View Synthesis, Language Modeling, and Video Diffusion, showing chunk sizes and parallelism strategies.

LaCT adapts efficiently to very different data types by adjusting chunk sizes, state sizes, and parallelism modes.

1. Novel View Synthesis (Image Sets)

Task: Render new views of a 3D scene from multiple posed input images.
LaCT Design: Treat all input images as a single huge chunk—up to one million tokens. Update fast weights once using all inputs, then apply those weights to generate novel views.
Result: Comparable rendering quality to full-attention models, but more than ten times faster, and outperforming 3D Gaussian Splatting on dense scene datasets.

Performance comparisons for novel view synthesis—LaCT matches full attention quality with far lower latency, and scales to 1M tokens outperforming Gaussian Splatting.

LaCT matches full attention in quality while vastly reducing prefill time, scaling to million-token contexts.


2. Language Modeling (Text Sequences)

Task: Autoregressive next-token prediction across long contexts.
LaCT Design: Split text into large fixed-size chunks (2K–4K tokens). Use “shifted block-wise causal” scheduling to avoid future-token leakage and supplement with sliding window attention for fine-grained local information.
Result: In both 760M and 3B parameter models, LaCT (especially with Muon updates) achieves lower loss on long sequences and higher retrieval accuracy in needle-in-a-haystack tasks than efficient baselines DeltaNet and Gated Linear Attention.

Language modeling results showing lower validation loss at longer sequence positions and higher retrieval accuracy compared with baselines.

LaCT models sustain lower loss and higher retrieval accuracy as sequence length grows, outperforming established sub-quadratic baselines.


3. Autoregressive Video Diffusion (Image Sequences)

Task: Generate long, coherent videos via autoregressive denoising of frame chunks.
LaCT Design: Interleave clean and noisy frame chunks:

\[ S = [X_1^{\mathrm{noise}}, X_1, X_2^{\mathrm{noise}}, X_2, \dots, X_N^{\mathrm{noise}}] \]

Update fast weights only on clean frames, then apply them to denoise subsequent noisy chunks—ensuring causal consistency.
Result: A fine-tuned 14B-parameter video diffusion model achieved validation losses matching full attention while outperforming Mamba-based and sliding-window baselines. It generates stable videos with up to 56K visual tokens efficiently.

Video generation results—LaCT matches full-attention loss while surpassing efficient baselines across window sizes and long video lengths.

LaCT attains full-attention-level quality while maintaining high efficiency, extending to long clips of over 50K tokens.


Why LaCT Works: Insights from Ablations

Careful analysis highlights the design elements driving LaCT’s success.

  • Bigger State Is Better: Scaling up fast-weight memory (state size) directly improves performance across all tasks. The largest configuration uses fast weights equal to 40% of total model parameters.
  • Muon Optimizer Shines: Muon consistently surpasses vanilla gradient descent and momentum optimizers, achieving faster convergence and better stability.
  • Nonlinear Fast Weights: SwiGLU-MLP fast weights perform better than simple linear mappings, even when linear models use more parameters.
  • Large-Chunks Beat Per-Token Recurrence: Large-chunk recurrence outperforms per-token linear recurrent models like Mamba-2 on image tasks. Combined with Muon and nonlinear states, it even surpasses per-token baselines in language modeling.

(a) Performance improves with larger fast-weight states; (b) Muon optimizer consistently outperforms gradient descent methods.

Scaling the fast-weight state and using Muon updates both lead to substantial accuracy gains.

Comparisons of fast-weight design and recurrence regime: nonlinear SwiGLU fast weights outperform linear ones; large-chunk recurrence beats per-token recurrence when combined with large nonlinear state and Muon.

Nonlinear memory networks and large-chunk recurrence outperform traditional per-token recurrence across modalities.


Conclusion: Rethinking Efficiency for Long Context AI

LaCT redefines efficiency in long-context modeling. By moving from frequent, tiny weight updates to fewer, massive chunk updates, it unlocks huge gains in parallel hardware utilization. This efficiency enables larger, more expressive memory states—allowing non-linear fast weights and advanced optimizers like Muon to shine.

Successfully applied across image, text, and video domains, LaCT’s hybrid design—combining local attention with large-chunk memory—shows remarkable generality. Its simple, kernel-free PyTorch implementation makes experimentation accessible to everyone, democratizing research in this emerging space.

“Test-Time Training Done Right” isn’t just a performance improvement; it’s a paradigm shift in how we think about memory and efficiency. LaCT demonstrates that in the quest for smarter long-context AI, sometimes the best shortcut is simply to go bigger.