Train on the Fly: How LLMs Can Continuously Improve Themselves During Testing

Large Language Models (LLMs) are undeniably powerful, but they share a fundamental limitation: they are typically static. Once trained on massive datasets, their parameters are frozen, and they are deployed into the world—never learning again. This “train once, test forever” paradigm works well when test data resembles the training distribution. But what happens when the model encounters something new—a novel question type or a subtle domain shift? Its performance can drop dramatically.

Imagine a student who memorizes the answers for an exam but never learns how to solve unfamiliar problems. They ace the test if the questions are familiar but freeze when confronted with something different. Most LLMs today behave like that student.

This is where a long-standing idea in machine learning reemerges with new relevance: Test-Time Training (TTT). Instead of keeping the model frozen, what if we allow it to adapt on the fly using the very data it is being tested on? Doing so would enable the model to dynamically adjust to new domains and environments, leading to more robust and effective reasoning.

But TTT poses a central challenge: at test time, there are no ground-truth labels. How can a model learn without knowing the correct answers? A recent research paper, Continuous Self-Improvement of Large Language Models by Test-Time Training with Verifier-Driven Sample Selection, proposes an elegant solution called Verifier-Driven Sample Selection for Test-Time Training (VDS-TTT). This framework creates a self-improvement loop where the LLM generates candidate answers, a verifier evaluates them, and the model fine-tunes itself using only the most reliable, high-confidence samples.

Let’s explore how this works—and why it’s an important step toward more intelligent, self-adaptive AI models.

The Challenge of Learning Without Labels

Before diving into the VDS-TTT mechanism, it’s helpful to understand the landscape of test-time adaptation and why it’s such a hard problem.

At test time, models lack labels—the “answer key”—required for standard supervised learning. Researchers have developed several methods to approximate learning signals in this context:

Self-Supervised Tasks: Vision models often use auxiliary tasks like predicting image rotations to adapt to new data. But in language models, modifying word order or structure breaks meaning, limiting this strategy’s usefulness.
Entropy Minimization: Methods such as TENT adjust the model to produce more confident predictions by minimizing entropy. However, without regularization, models risk collapsing into oversimplified outputs.
Retrieval-Based Methods: These give the model an “open-book” of similar examples. While effective, they demand enormous memory—terabytes of storage—and heavy computational overhead.
Reinforcement Learning (RL): RL approaches use a reward model at test time. While flexible, RL is complex, unstable, and sensitive to task difficulty.

Each approach is valuable but also costly in terms of computation, memory, or stability. This is the gap VDS-TTT addresses, offering a simpler, verifier-guided way for an LLM to refine itself intelligently and efficiently.

The VDS-TTT Framework: A Three-Step Loop for Self-Improvement

The core innovation behind VDS-TTT lies in generating high-confidence pseudo-labels at test time and using them for efficient fine-tuning. The pipeline unfolds in three sequential stages, shown schematically below.

A diagram showing the VDS-TTT framework. An input query passes through a Base LLM that generates multiple candidate responses. A verifier scores each response, selects the one with the highest score above a threshold, and uses that query-response pair to fine-tune the LLM in a continuous adaptation loop.

Figure 1: Overview of the VDS-TTT workflow. Candidate responses are generated, scored by a verifier, filtered by confidence, and used for lightweight LoRA-based adaptation.

Stage 1: Candidate Generation for Self-Annotation

For each input query—say, a math word problem—the LLM doesn’t produce just one answer. Instead, it generates a set of \( N \) diverse candidate responses. This diversity arises from temperature sampling, a technique that controls randomness in generation. A temperature of 0 makes the output deterministic and repetitive; a positive temperature introduces variability, allowing the model to explore different reasoning trajectories.

This step collects a rich pool of hypotheses that capture possible solutions to the input problem.

Stage 2: Confidence-Guided Annotation

Now comes the critical “Verifier-Driven” component. To decide which candidate responses are worth learning from, the system uses a verifier—a separate, pre-trained evaluation model capable of judging correctness or reliability.

Each candidate is scored by the verifier. The process has two levels of filtering:

Best-of-N Selection: The verifier assigns a confidence score to each candidate, and the one with the highest score, \( r^* \), is chosen.
Threshold Filtering: Even the best candidate must pass a confidence threshold \( \tau \) (for example, 0.99). If no candidate meets the threshold, the model skips training on that query.

This two-step quality gate ensures the model learns only from trustworthy, high-confidence examples. The surviving pair \( (Q_i, r^*) \) forms a pseudo-labeled instance for adaptation.

Stage 3: Test-Time Training

With this pseudo-labeled example, the model then fine-tunes itself—but only partially. Updating all parameters of an LLM during test time is inefficient and can cause catastrophic forgetting, where existing knowledge erodes.

Instead, VDS-TTT uses LoRA (Low-Rank Adaptation)—a technique for parameter-efficient fine-tuning. LoRA adds small trainable adapter matrices (denoted \( \Delta \)) within the model layers and updates only these lightweight components while keeping the base model frozen.

This approach delivers both efficiency and stability:

Fine-tuning millions instead of billions of parameters dramatically reduces compute requirements.
Freezing the base ensures that learned generalizations remain intact.

The optimization objective minimizes the standard language modeling loss:

The loss function for VDS-TTT, showing the minimization over the LoRA adapter parameters Δ.

Equation (1): During test-time training, only LoRA adapter parameters \( \Delta \) are updated while core LLM weights remain fixed.

Through this generate–verify–train loop, VDS-TTT empowers LLMs to continually refine their reasoning under domain shifts—without the need for labeled data or external supervision.

Putting VDS-TTT to the Test

The authors rigorously evaluated VDS-TTT on three demanding mathematical reasoning benchmarks:

GSM8K — grade-school arithmetic word problems.
Math-500 — complex competition-level math problems.
AIME1983–2024 — challenging questions from decades of American Invitational Mathematics Exams.

They applied the method across multiple modern LLMs, such as Llama-3 and Qwen variants, comparing different adaptation strategies.

VDS-TTT Consistently Boosts Performance

The comparison among three setups—Base, Verifier-Based (VB), and VDS-TTT—reveals the power of this framework.

Table 1 showing accuracy improvements for different models and datasets. VDS-TTT consistently surpasses both the Base and Verifier-Based (VB) versions across all benchmarks.

Table 1: Accuracy gains across Math-500, GSM8K, and AIME1983–2024 tasks for various LLMs. VDS-TTT consistently delivers strong improvements over baselines.

Key takeaways:

VDS-TTT outperforms both baselines. While VB alone gives notable gains, the additional fine-tuning step boosts accuracy further, proving that active adaptation is the cornerstone of self-improvement.
Largest gains appear on the hardest tasks. For example, the Qwen-1.5B model improves from 0.54% base accuracy to 4.22% after one iteration of VDS-TTT—a substantial jump even in a domain with minimal pretrained competence.
Diminishing returns for large sample counts. Moving from \( N=2 \) to \( N=4 \) gives sizable boosts; beyond \( N=8 \), benefits taper off. This suggests small candidate pools are adequate for strong adaptation.

Training stability is also impressive. Figure 2 shows that loss curves during test-time training steadily decline and stabilize, signaling smooth convergence.

Three plots showing test-time training loss curves for different model–dataset pairs. All curves fall monotonically, indicating stable optimization.

Figure 2: Examples of TTT loss progression for different configurations, demonstrating robust and consistent convergence.

Iterative Improvement: Surpassing Even a Perfect Verifier

Could further iterations of the VDS-TTT loop yield additional improvements? The authors explored exactly this—applying the process repeatedly to the same model to test cumulative adaptation.

Two plots showing the accuracy of VDS-TTT across multiple iterations. The blue lines indicate rising accuracy over iterations, eventually surpassing the performance of the Oracle Verifier (red dashed line).

Figure 3: Iterative VDS-TTT adaptation across successive rounds. The blue accuracy trajectories exceed even the Oracle Verifier baseline (red dashed line).

The outcome is remarkable: after several iterations, VDS-TTT surpasses the performance of an Oracle Verifier—a hypothetical verifier with access to perfect ground-truth labels. This means the model not only learns to pick the right answers but also internalizes deeper reasoning patterns, producing better responses from scratch in later rounds.

To control computation, the authors introduce an early stopping criterion: halt iterations when accuracy improvement between consecutive passes becomes negligible.

Equation for the early stopping criterion, which halts further iterations once accuracy gains are smaller than a set threshold ε for two successive steps.

Equation (2): Formal stopping criterion for iterative VDS-TTT, balancing computational efficiency and performance gains.

Comparing Against Reinforcement Learning-Based Methods

Finally, VDS-TTT was compared to TTRL, a recent reinforcement learning approach to test-time adaptation.

Table 2 comparing VDS-TTT and TTRL methods on three benchmarks. VDS-TTT achieves consistently higher average performance and significant gains on tough datasets like AIME and AMC.

Table 2: Comparison of VDS-TTT and TTRL on AIME, AMC, and Math-500 tasks. VDS-TTT delivers higher average improvements and proves especially effective on difficult reasoning problems.

Despite the complexity of RL approaches, VDS-TTT achieves comparable or better results—particularly on out-of-distribution benchmarks—while remaining simpler to implement, more stable, and much cheaper computationally. The results illustrate a key point: strategic verifier-driven selection can rival (and exceed) reward-based optimization at test time.

Conclusion and Future Directions

VDS-TTT offers a practical yet powerful framework for the continuous self-improvement of LLMs. By merging three core ideas—temperature-based generation, verifier-guided sample selection, and LoRA-powered adaptation—it enables models to learn dynamically from unlabeled data during real-time inference.

Key insights include:

Verifier-driven pseudo-labeling works. It produces high-quality training signals from unlabeled inputs.
Fine-tuning during test time is crucial. Even with minimal data, lightweight adaptation yields tangible self-improvement.
Iterative self-training fosters true learning. Across rounds, models internalize better reasoning patterns and surpass static baselines—including oracle-level selectors.

There are, however, clear limitations. Current verifiers are trained on math reasoning tasks, restricting cross-domain application. Extending the concept to coding, commonsense reasoning, or multi-modal tasks will require either general-purpose verifiers or modular, mixture-of-experts architectures capable of switching between domains dynamically.

Ultimately, VDS-TTT marks an important shift—from static models locked after training to adaptive systems capable of refinement during deployment. It brings us closer to a vision where AI systems continually learn from their environment, achieving greater robustness and intelligence over time.

The Challenge of Learning Without Labels#

The VDS-TTT Framework: A Three-Step Loop for Self-Improvement#

Stage 1: Candidate Generation for Self-Annotation#

Stage 2: Confidence-Guided Annotation#

Stage 3: Test-Time Training#

Putting VDS-TTT to the Test#

VDS-TTT Consistently Boosts Performance#

Iterative Improvement: Surpassing Even a Perfect Verifier#

Comparing Against Reinforcement Learning-Based Methods#

Conclusion and Future Directions#