Large Language Models (LLMs) are incredibly powerful, yet they struggle with one subtle weakness—complex, multi-step reasoning. Ask a model to solve an Olympiad-level math question or a competitive programming puzzle, and its first attempt is often wrong. The challenge isn’t generating an answer—it’s learning from failure effectively.

Humans learn from mistakes. We rarely repeat the same error twice because we internalize what went wrong. Could LLMs do something similar? Could they learn from feedback while they’re being tested—improving with every iteration?

Traditionally, there have been two main strategies for enabling multiple attempts:

  1. Sequential Revision: The model sees its previous answers and feedback as part of a growing context—like a student reviewing a marked exam. This approach promotes learning but quickly becomes inefficient and unwieldy, as long contexts cause attention noise and “lost-in-the-middle” problems.
  2. Parallel Sampling: Instead of learning, the model tries multiple independent answers at once—like a group of students each taking the same test separately and hoping one gets it right. It’s fast, but it doesn’t improve reasoning.

What if there were a third way—something both efficient and memory-aware? That’s exactly what the paper “Learning to Reason from Feedback at Test-Time” by Yanyang Li et al. proposes. Their method, Feedback-based Test-Time Training (FTTT), lets LLMs update their internal weights—their “memory”—in real time, learning directly from feedback. They further enhance this approach with OPTUNE, a miniature, trainable optimizer crafted to improve how models update themselves.


The Problem with Trial and Error

To visualize the difference, imagine how each method handles repeated attempts.

A diagram comparing three methods: (a) Sequential Revision shows a single chain of attempts, (b) Parallel Sampling shows multiple independent attempts, and (c) FTTT shows attempts with feedback loops that update the model.

Figure 1: Comparison of reasoning strategies. Sequential revision builds long dependency chains, parallel sampling generates isolated attempts, and FTTT introduces feedback-driven weight updates.

  • (a) Sequential Revision: Each new attempt depends on all prior ones, creating lengthy contexts that are difficult for LLMs to process.
  • (b) Parallel Sampling: All attempts are independent. It’s efficient but wastes valuable information from previous errors.
  • (c) Feedback-based Test-Time Training (FTTT): After each failed attempt, the model adjusts its weights, learning internally before producing the next answer.

Humans don’t rehash every mistake in memory, nor do they start fresh every time—they internalize their experiences. FTTT aims to do the same: store knowledge from failed attempts in the model’s parameters instead of its context.


FTTT: Training on the Fly

The magic of FTTT lies in treating every failed attempt as a micro training opportunity. When a model answers a question \(Q\) and produces an attempt \(A_n\), a verifier checks whether \(A_n\) is correct, providing binary feedback.

If the answer is wrong, FTTT formulates this scenario as a supervised learning task. The model is trained to predict verbal feedback \(F\), which here is simply “Your answer is incorrect.” This is defined by the FTTT loss:

\[ \mathcal{L}_{\text{FTTT}}(Q, A_n) = -\frac{1}{l_0} \log M_{n-1}(F \mid Q, A_n) \]

This might look trivial, but it’s conceptually deep. To correctly predict its own failure, the model must understand why it was wrong—building internal representations of the error patterns that caused failure.


Adding Depth with Self-Reflection

Binary feedback is minimal—just “right” or “wrong.” To provide richer learning signals, the authors introduce self-reflection. After an incorrect attempt, they use the unmodified original model \(M_0\) to generate a brief explanation of what went wrong, such as:

“Here is the summary of the mistakes in the previous solution…”

This generated reflection \(R_n\) becomes a silver-standard training label. The current model is then trained to reproduce this reflection, using an auxiliary distillation loss:

\[ \mathcal{L}_{\text{aux}}(Q, A_n, R_n) = -\frac{1}{l_n} \log M_{n-1}(R_n \mid Q, A_n, F) \]

The final loss combines the two objectives:

\[ \mathcal{L}_{\text{final}} = \mathcal{L}_{\text{FTTT}} + \mathcal{L}_{\text{aux}} \]

Each failed attempt teaches the model incrementally. The sequence works like this:

  1. Generate an answer \(A_n\).
  2. Check correctness:
    • If correct → Stop.
    • If incorrect → run FTTT update.
  3. Generate reflection \(R_n\) using \(M_0\).
  4. Compute final loss (\(\mathcal{L}_{\text{final}}\)) and update weights.
  5. Produce a new answer \(A_{n+1}\) using the updated model.

By integrating reflection, the model not only learns that it failed, but why it failed—without requiring long context memory.

A table comparing FTTT to other methods across features like Self-Reflection, Memory, and Length Generalization. FTTT is the only method that checks all three boxes.

Table 1: FTTT uniquely combines self-reflection, internal memory, and scalability across longer contexts.


OPTUNE: A Smarter Way to Update Weights

FTTT can use any standard optimizer (like Adam) for updates. But what if we could design an optimizer specifically tailored for reasoning feedback?

That’s where OPTUNE enters—a lightweight neural network trained to predict optimal parameter updates. Instead of relying on fixed rules for gradient descent, OPTUNE learns how to adjust the model’s weights based on feedback and context. It’s a concrete application of the Learning to Optimize (L2O) paradigm.

However, optimizing millions of parameters directly is infeasible. The authors solve this with two innovations:

The architecture of OPTUNE, showing a bottleneck design with a residual connection that processes gradients to predict weight updates.

Figure 2: The OPTUNE architecture operates in gradient space, using compression and decomposition to generate efficient weight updates.

  1. Gradient-based Input Compression:
    OPTUNE takes gradients (not raw text) as input. These gradients naturally summarize how the current parameters influenced the error. This compresses variable-length textual data into a fixed-size numerical tensor.

  2. Gradient Decomposition:
    Large gradients (\(\nabla W_i \in \mathbb{R}^{d \times d}\)) are decomposed into two smaller vectors \(u_i\) and \(\delta_{i+1}\), reducing dimensions from \(d^2\) to \(2d\).
    OPTUNE then predicts modified vectors \(\tilde{u}_i\) and \(\tilde{\delta}_{i+1}\) which reconstruct the update \(\tilde{\nabla}_{W_i} = \tilde{\delta}_{i+1} \tilde{u}_i^{T}\).

  3. Residual Bottleneck Design:
    OPTUNE applies normalization, dropout, and linear projections in a bottleneck pattern to prevent overfitting. The architecture is parameter-efficient yet expressive:

\[ \begin{array}{c} [\bar{u}_i, \bar{\delta}_{i+1}] = \operatorname{Norm}([u_i, \delta_{i+1}]) \\ h_i = \theta_2 \text{Dropout}(\theta_1[\bar{u}_i, \bar{\delta}_{i+1}]) \\ [\tilde{u}_i, \tilde{\delta}_{i+1}] = h_i + [\bar{u}_i, \bar{\delta}_{i+1}] \end{array} \]

This design closely resembles a bottleneck adapter in PEFT (parameter-efficient fine-tuning), but instead of fine-tuning activations, OPTUNE fine-tunes gradients—creating smarter, feedback-driven updates.


Do These Ideas Work? The Results

The authors evaluated both FTTT and OPTUNE on two models—Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3—across four reasoning benchmarks:

  • Math: MATH, GSM8K
  • Code: MBPP, HumanEval

FTTT vs. Other Test-Time Methods

A table of results showing that FTTT outperforms baselines like Revision, Beam Search, and Best-of-N across four datasets for both Llama-3.1-8B and Mistral-7B models.

Table 2: FTTT surpasses state-of-the-art test-time baselines across multiple datasets, even without self-reflection.

FTTT consistently beats baselines such as Best-of-N and Self-Refine. In GSM8K, for example, FTTT requires only 3–4 GPU hours versus 20 for sequential revision, while achieving higher accuracy. It blends learning from past attempts with computational efficiency.

Line graphs showing performance of different methods as the attempt budget increases. FTTT (blue line) consistently performs at or near the top, especially at lower budgets.

Figure 3: As test-time budgets increase, FTTT scales effectively and maintains top performance with minimal computational cost.

Self-reflection improves results when the model can accurately analyze its failures (as with Llama-3.1). Overall, FTTT scales smoothly with more attempts—unlike revision-based methods that struggle with context limitations.


OPTUNE vs. Other Fine-Tuning Methods

OPTUNE was also compared against traditional PEFT methods—LoRA, Adapter, IA³, LN-Tuning—and full fine-tuning.

A table showing that OPTUNE outperforms other PEFT methods like LoRA and Adapter, as well as full fine-tuning, on average across three datasets.

Table 3: OPTUNE achieves the best average performance while being the most parameter-efficient (only 439K trainable parameters).

Despite being lightweight, OPTUNE exceeds LoRA and Adapter in accuracy, outperforming even full fine-tuning while using an order of magnitude fewer parameters. It fine-tunes smarter, not harder.

Line graphs showing fine-tuning performance versus budget. OPTUNE (light blue line) starts slightly lower but overtakes other methods as the budget increases beyond two attempts.

Figure 4: OPTUNE scales strongly as more feedback is available. It quickly outperforms other fine-tuning methods at moderate budgets.

At small budgets (two attempts), OPTUNE starts slower because its initial attempt uses the raw LLM output. But as soon as feedback is integrated, it achieves superior reasoning accuracy.


A Case Study: Deeper Reasoning

A table with two math word problems. For the first problem, LoRA makes a calculation error, while OPTUNE correctly reasons through the steps.

Table 4: OPTUNE correctly interprets problem constraints (left), while LoRA miscalculates due to superficial understanding.

In GSM8K examples, LoRA misjudged key problem details—such as misreading “to the 40-yard line and back.” OPTUNE correctly reasoned through the logic, arriving at the right answer. These cases illustrate OPTUNE’s improved interpretive depth in reasoning tasks.


Why This Matters

This work addresses one of the most important gaps in modern AI: the ability to learn from mistakes without retraining the whole model.

  • FTTT enables LLMs to integrate feedback directly into their weights at test time.
  • OPTUNE learns how to optimize—producing efficient, high-quality updates tailored to reasoning.

Together, they deliver faster convergence, better scaling, and richer understanding—all from sparse feedback like “Your answer is incorrect.”


Looking Ahead

While this study focuses on binary feedback, the same framework could extend naturally to continuous signals from reward models or human evaluations. Imagine LLMs that learn dynamically from nuanced, graded feedback—improving reasoning skill by reasoning itself.

For now, FTTT and OPTUNE mark substantial progress toward that vision: LLMs that not only answer questions but learn from every attempt.


In short: Instead of seeing test time as the end of learning, this research turns it into the beginning of a smarter, reflective reasoning process.