Can Agents Teach Themselves? Mastering Self-Training with Reflection

The landscape of Large Language Models (LLMs) has shifted rapidly from simple chatbots to Language Agents—systems capable of reasoning, planning, and interacting with external environments to solve complex tasks. Whether it’s browsing the web to answer multi-hop questions or writing code to pass unit tests, agents represent the next frontier of AI utility.

However, building these agents presents a significant bottleneck: Data.

To make a generic LLM (like Llama-2 or Llama-3) act as a competent agent, we typically need to fine-tune it on high-quality “trajectories”—step-by-step examples of reasoning and acting. Historically, there have been two ways to get this data: human annotation (slow and expensive) or “distillation,” where we ask a massive, proprietary model like GPT-4 to generate examples for the smaller model to learn from.

But what if we want to build powerful open-source agents without relying on OpenAI’s APIs or massive human effort? What if the model could teach itself?

In this post, we dive into Re-ReST (Reflection-Reinforced Self-Training), a paper from UCLA that proposes a novel framework for autonomous agent improvement. We will explore how combining self-training with a mechanism called reflection allows smaller models to achieve performance comparable to—or even better than—models trained on GPT-4 data.

The Problem with Teacher-Student Learning

Before we get into the solution, we need to understand the status quo. The dominant paradigm for training open-source agents has been Knowledge Distillation.

In this setup, a “Teacher” model (usually GPT-4) generates successful trajectories for a specific task. A “Student” model (like Llama-2) is then fine-tuned on this data to mimic the teacher. While effective, this approach has limitations:

Cost & Dependency: It relies on closed-source APIs, which can be expensive and creates a dependency on proprietary tech.
Upper Bound: The student is usually limited by the teacher’s capabilities and the specific distribution of the teacher’s data.

The alternative is Self-Training. Here, the model generates its own samples, checks if they are correct (using an environment reward), and retrains on its own successes. Ideally, this creates a virtuous cycle of improvement.

Figure 1: Comparison of Previous Work vs. Re-ReST.

As shown in Figure 1, previous methods (left) rely on a one-way flow of knowledge from GPT to Llama. Re-ReST (right), however, creates a closed loop. The model generates samples, but importantly, it handles unsuccessful trajectories differently. Instead of discarding failures, it uses a Reflector to fix them, turning trash into treasure before feeding them back into the training loop.

The Challenge of Self-Training

Pure self-training sounds ideal, but it often fails in practice for complex reasoning tasks. Why?

The “Low-Quality Sample” Trap. For a language agent to learn effectively via self-training, it needs to generate successful trajectories to learn from. In difficult tasks (like writing complex code or solving multi-step logic puzzles), a base model might fail 95% of the time. If the model only learns from that tiny 5% slice of successes, improvement is slow and unstable.

Furthermore, simply discarding the 95% of failures is a massive waste of compute and information. The model tried to solve the problem; it often just made a small mistake in reasoning or syntax.

This is where Re-ReST introduces its core innovation: utilizing a Reflector to rescue those failed samples.

The Re-ReST Method

Re-ReST stands for Reflection-Reinforced Self-Training. The core idea is to separate the generation of data from the refinement of data.

The framework consists of two distinct LLMs (though they can be initialized from the same base model):

The Agent (\(\mathcal{M}\)): The model trying to solve the task (e.g., generate code).
The Reflector (\(\mathcal{R}\)): A model trained to look at the Agent’s failed attempt and the environment’s error message, and then generate a corrected solution.

How the Process Works

Let’s break down the workflow illustrated in Figure 2.

Figure 2: Overview of the Re-ReST method workflow.

Sampling (The Agent): The agent receives a task input. It attempts to solve it, generating multiple candidate trajectories (Outputs 1 through K).
Environment Feedback: Each output is tested against the environment (e.g., running the code, checking the answer).

Success (\(\checkmark\)): If an output is correct, it is immediately added to the training set.
Failure (\(\times\)): If an output fails, it is not discarded. It is passed to the next step along with the feedback (e.g., “SyntaxError on line 5”).

Reflection (The Reflector): The Reflector takes the original input, the failed attempt, and the specific error feedback. It “reflects” on what went wrong and produces a refined, corrected output (Output K’).
Verification: If the refined output is now correct, it is added to the training set.

This process significantly enriches the training data. Instead of learning only from the easy problems the agent could already solve, the agent also learns from difficult problems that required correction.

The Math Behind the Training

The training process occurs in two phases. First, we must ensure the Reflector is actually good at fixing mistakes. Then, we use the Reflector to train the Agent.

Phase 1: Training the Reflector The Reflector is trained using pairs of (Failed Attempt \(\rightarrow\) Corrected Attempt). We can gather these pairs by letting the base model generate samples, filtering for success/failure, and using the successes as ground truth for what the failed samples should have looked like.

The objective function for the Reflector is a standard Maximum Log-Likelihood Estimation (MLE), conditioned on the failure (\(y^l\)) and the input (\(x\)):

Equation 1: Reflector Training Objective.

Phase 2: Training the Agent Once we have a capable Reflector, we run the Re-ReST loop to generate a massive dataset of successful trajectories (both originally successful and reflector-corrected). The Agent is then fine-tuned on this combined dataset.

Equation 2: Agent Training Objective.

The beauty of this equation is that the dataset \(\mathcal{D}_{\mathcal{R}}\) (data from the reflector) often contains “harder” examples than \(\mathcal{D}_{\mathcal{M}}\) (data the agent generated itself), pushing the agent to learn more robust reasoning patterns.

Why Reflection Beats “Just Sampling More”

A common counter-argument to complex pipelines is: “Why not just sample 100 times instead of 10? Eventually, you’ll get a right answer.”

The researchers investigated this by comparing Re-ReST against standard self-training with increasing sample counts.

Figure 3: Performance comparison vs. Number of Samples.

Figure 3 reveals a crucial insight. In standard self-training (the blue line), increasing the number of samples per instance does improve performance initially, but it quickly plateaus. The model runs out of “luck”—it simply lacks the capability to solve the harder instances, no matter how many times it guesses.

Re-ReST (the red dashed line) breaks this ceiling. By actively correcting errors using feedback, the Reflector “unlocks” training examples that were previously inaccessible via random sampling. It is not just about more data; it is about better data covering a wider distribution of problem difficulties.

Experimental Results

The authors tested Re-ReST across a wide variety of domains: Multi-hop QA, Decision Making, Code Generation, and Image Generation. The results consistently show that Re-ReST allows open-source models (like Llama-2-13B) to punch above their weight class.

1. Multi-Hop Reasoning (HotpotQA)

HotpotQA requires an agent to search Wikipedia, find multiple documents, and reason across them to answer a question.

Table 1: Results on HotpotQA.

Table 1 highlights three key findings:

Self-Training works: Just letting Llama-2 teach itself improves Exact Match (EM) scores from 20.0% (Few-Shot) to 27.6%.
Re-ReST works better: Adding reflection boosts performance further to 29.6%.
Beating GPT-4 Distillation: Look at the comparison with FireAct. FireAct uses GPT-4 to generate training data. Re-ReST, using only Llama-2 and Wikipedia API feedback, achieves comparable results. When Re-ReST is allowed a tiny amount of GPT-4 seed data (0.5k), it outperforms the pure GPT-4 distillation methods (35.8 vs 34.4).

2. Sequential Decision Making (ALFWorld)

ALFWorld is a text-based simulation where an agent acts as a household robot (e.g., “go to the kitchen, find the apple, wash it, put it in the fridge”). This requires long-horizon planning where one mistake ruins the whole trajectory.

Table 2: Results on ALFWorld.

The results in Table 2 are dramatic. The base Few-Shot model fails almost completely (8.9% success). Self-training helps (37.3%), but Re-ReST jumps to 51.4%. This suggests that for tasks requiring strict sequential logic, the ability to “debug” a trajectory via reflection is incredibly potent.

3. Code Generation (MBPP)

In coding, “feedback” is very precise: unit tests either pass or fail. This makes it an ideal domain for Re-ReST.

Table 3: Results on Code Generation and Visual Programming.

Table 3 shows consistent gains. On the MBPP (Python programming) dataset, Re-ReST improves the Pass@1 rate from 54.5% (standard self-training) to 56.4%. While the gap is smaller here than in ALFWorld, it confirms the method’s versatility.

Is the Reflector Actually Learning?

A critical question is whether we actually need to train the Reflector. Can’t we just prompt Llama-2 to “fix the error”?

Table 5: Impact of Training the Reflector.

Table 5 answers this. Using a base LLM as a reflector (without training it specifically to reflect) does help—it boosts the agent to 28.8% on HotpotQA. However, training the reflector on success/failure pairs pushes that score to 29.6%. This indicates that “correcting errors” is a specific skill that models can improve upon with practice.

Beyond Training: Inference and Optimization

The paper concludes with two fascinating extensions of the Re-ReST framework.

Inference Without Ground Truth

A major limitation of reflection methods (like Reflexion) is that they usually need ground-truth feedback (e.g., a unit test or a reward signal) during test time to work. In the real world, we often don’t have an answer key.

The authors propose using Self-Consistency with reflection. They let the Reflector “fix” the agent’s output even without knowing if it’s wrong, effectively generating alternative reasoning paths. They then vote on the final answer.

Table 6: Inference time results using Self-Consistency.

Table 6 shows that this method allows the benefits of reflection to bleed into inference time, improving accuracy without needing a teacher or an environment oracle.

Direct Preference Optimization (DPO)

Since Re-ReST generates pairs of (Incorrect Attempt, Corrected Attempt), it creates a perfect dataset for DPO, a method that aligns models by training them on “A is better than B” examples.

Table 7: Compatibility with DPO.

Table 7 demonstrates that Re-ReST is compatible with DPO. By treating the corrected trajectory as the “winner” and the failed one as the “loser,” the model can be aligned even more effectively than with standard supervised fine-tuning.

Conclusion

Re-ReST offers a compelling blueprint for the future of open-source AI. It demonstrates that we don’t always need a bigger model or a smarter teacher to improve. By structuring the learning process to value mistakes—and systematically fixing them through reflection—smaller models can bootstrap their own intelligence.

For students and researchers, this highlights a shift in focus: from simply curating static datasets to designing dynamic training loops where the model interacts, fails, learns, and evolves.

Key Takeaways:

Self-Training allows models to improve without external supervision but suffers from a lack of high-quality data.
Re-ReST fixes this by using a Reflector to correct failed attempts using environment feedback.
This approach outperforms standard self-training and rivals methods that rely on expensive GPT-4 distillation.
It is effective across reasoning, coding, and decision-making tasks, proving that autonomous self-improvement is a viable path for Language Agents.

The Problem with Teacher-Student Learning#

The Challenge of Self-Training#

The Re-ReST Method#

How the Process Works#

The Math Behind the Training#

Why Reflection Beats “Just Sampling More”#

Experimental Results#

1. Multi-Hop Reasoning (HotpotQA)#

2. Sequential Decision Making (ALFWorld)#

3. Code Generation (MBPP)#

Is the Reflector Actually Learning?#

Beyond Training: Inference and Optimization#

Inference Without Ground Truth#

Direct Preference Optimization (DPO)#

Conclusion#