Large Language Models (LLMs) are breaking out of the chatbot box. We’re increasingly seeing them power autonomous agents that can interact with software, play games, and browse the web to accomplish complex goals. But there’s a catch: when these agents make a mistake, how do they learn not to repeat it?

Traditionally, the answer in AI has been Reinforcement Learning (RL)—a process of trial and error where an agent is rewarded for good actions and penalized for bad ones. However, applying traditional RL to massive LLMs is incredibly slow and computationally expensive, often requiring months of training and enormous GPU resources to fine-tune billions of parameters. As a result, most LLM agents today learn only from a handful of carefully designed examples in their prompt.

What if there were a better way?
What if an agent could learn from its mistakes almost instantly, without any expensive retraining?

This is the core idea behind a fascinating new paper, “Reflexion: Language Agents with Verbal Reinforcement Learning.” The researchers propose a framework where an LLM agent doesn’t just try, fail, and try again. Instead, it tries, fails, pauses to think about why it failed, and then uses that self-reflection to guide its next attempt.

This simple but powerful idea of “verbal reinforcement” leads to astonishing results. On a challenging coding benchmark, a Reflexion agent achieved a 91% pass@1 accuracy, soaring past the 80% score of even GPT-4.

Let’s dive in.


The Problem with Learning from Experience

Imagine telling an LLM-powered agent to perform a task in a text-based adventure game, like:

Task: “Clean the pan and put it on the countertop.”

The agent might generate:

  1. > take pan from stoveburner
    Obs: Nothing happens.
  2. > clean pan with sink
    Obs: Nothing happens.

The agent has failed. A standard agent might simply try again with a slightly different random approach, potentially repeating the same mistake. It receives a blunt “fail” signal but struggles with credit assignment—figuring out which action in a long sequence caused the failure.

In this example, the agent hallucinates that a pan is on the stove when it’s not. The “fail” at the end doesn’t tell it that its very first assumption was wrong.

Traditional RL would fix this by running thousands or millions of trials, slowly nudging internal model weights away from unlucky choices. But Reflexion asks: can we do this more efficiently, like a human?
When humans fail, we often reflect:

“Ah, I see—the pan wasn’t on the stove. Next time, I’ll look around to find the pan first.”

Reflexion is designed to give LLM agents exactly this capacity.


A table showing examples of the Reflexion process across three tasks: decision making, programming, and reasoning. It lists the initial task, the agent’s trajectory, evaluation, and the resulting self-reflection.

Figure 1: Examples of Reflexion in decision-making, programming, and reasoning tasks. Each shows how a task is attempted, evaluated, and distilled into a valuable, reusable reflection.


The Reflexion Framework: A Three-Part Mind

The authors designed Reflexion as a modular system with three LLM-based components operating in a loop:

  • Actor – the doer
  • Evaluator – the critic
  • Self-Reflection model – the thinker

Central to the process is memory, which allows the agent to improve across trials.

A diagram of the Reflexion agent architecture and reinforcement algorithm. Components include Actor, Evaluator, Self-Reflection, short-term (Trajectory) and long-term (Experience) memory, interacting with an Environment.

Figure 2: Reflexion architecture and its iterative reinforcement algorithm.

1. The Actor — The Doer

The Actor interacts with the environment. It’s an LLM prompted to generate text and actions to solve a task, like clicking buttons in a web navigation scenario or writing Python code. Actor decisions are based on the current state plus memory from prior trials. The team leverages advanced prompting strategies such as Chain-of-Thought (CoT) and ReAct to encourage reasoning and planning.

2. The Evaluator — The Critic

Once the Actor finishes a trial, the Evaluator scores its performance. Depending on the task, the Evaluator might:

  • Exact Match: Does the final answer match ground truth? (QA tasks)
  • Heuristics: Detect loops or excessive steps in decision-making environments.
  • LLM-Based Judgment: Prompt another LLM to assess the quality of the output.
  • Unit Tests: Run produced code against a suite of checks (programming tasks).

The output is a simple success/fail or scalar score.

3. The Self-Reflection Model — The Thinker

When a trial fails, Reflexion triggers the Self-Reflection model. This LLM sees:

  • The Actor’s trajectory (short-term memory)
  • The evaluation signal
  • Past reflections (long-term memory)

It generates a concise, natural-language summary explaining what went wrong and suggesting how to improve.

For example:

“I tried to pick up the pan from the stove, but it wasn’t there. This led to failed actions. Next time, I should explore the room to find the pan before interacting.”


Memory and The Learning Loop

These reflections are stored in a long-term memory buffer.

The loop:

  1. Trial t – Actor uses instructions + memory to create trajectory τ_t.
  2. Evaluate – Evaluator scores τ_tr_t.
  3. Reflect – If failed, Self-Reflection model writes sr_t.
  4. Update Memory – Append sr_t to memory (keep last 1–3).
  5. Repeat – Actor tries again with updated context.

Over a few trials, the Actor learns effective new strategies—without touching its weights.


Putting Reflexion to the Test

The researchers challenged Reflexion in three domains:
(1) Sequential decision-making, (2) Reasoning, (3) Programming.


1. Sequential Decision-Making: ALFWorld

ALFWorld is a suite of text-based simulations where agents perform household tasks like moving objects or cleaning items. Baseline agents used ReAct.

Two line graphs showing Reflexion’s performance in ALFWorld. Left: success rates over trials, where Reflexion agents quickly reach ~97% while baseline plateaus at ~75%. Right: hallucination error rates drop near zero with Reflexion.

Figure 3: Reflexion rapidly boosts success rates in ALFWorld and nearly eliminates “hallucination” failure modes.

The baseline ReAct agent plateaued at ~75% success and never solved certain tasks. Reflexion agents climbed to 97% success over 12 trials.
Self-reflection diagnosed mistakes like “I thought I had the knife, but I never picked it up”, drastically reducing hallucinations.


2. Reasoning: HotPotQA

HotPotQA requires multi-hop reasoning over Wikipedia content. The team tested both CoT and ReAct agents with Reflexion.

Three line charts showing Reflexion’s boost on HotPotQA. All Reflexion variants outperform baselines across trials.

Figure 4: Across all setups, Reflexion agents continually improve, unlike baselines which remain flat.

Key finding: Baselines never solved any failed task in later trials. Reflexion agents steadily learned.

An ablation study showed that just feeding the last failed trajectory (episodic memory) gave minimal gains. Adding explicit self-reflection provided a much larger boost—confirming it’s the critical ingredient.


3. Programming: State-of-the-Art Results

Programming tasks included HumanEval, MBPP, and a new LeetcodeHardGym benchmark. Reflexion agents first write unit tests for the task, then write code to pass those tests.

A table comparing pass@1 accuracy for Reflexion and prior SOTA models across Python and Rust programming benchmarks.

Table 1: Reflexion sets new SOTA on HumanEval Python and Rust.

On HumanEval (Python): Reflexion hit 91.0% pass@1, beating GPT-4’s 80.1%.
But on MBPP, it slightly underperformed. Why?

The breakdown below reveals the bottleneck.

Unit test performance analysis table showing True Positive, False Negative, False Positive, and True Negative rates for Reflexion’s self-tests.

Table 2: High false positive rates (weak test suites) limit MBPP performance.

For MBPP Python, Reflexion’s false positive rate was 16.3%—weak tests let incorrect code slip through. In contrast, HumanEval’s rate was just 1.4%. This shows the agent’s reflection quality is bound by its evaluative accuracy.


To prove both core components matter, the team ran an ablation on tough Rust problems.

A table with ablation study results: Reflexion without self-reflection or test generation performs worse than full Reflexion on HumanEval Rust.

Table 3: Both self-reflection and test generation are essential for gains.

Results:

  • No Self-Reflection: Stuck at baseline accuracy (60%).
  • No Test Generation: Drops below baseline (52%).
  • Full Reflexion: 68% accuracy.

Without grounded feedback (tests) or structured reasoning (reflections), learning collapses.


Conclusion: A Human-Like Path to Smarter Agents

Reflexion is a lightweight, interpretable, and effective way to make LLM agents smarter:

  1. Efficiency – No weight updates, cheaper and faster than RL fine-tuning.
  2. Effectiveness – Achieves SOTA across reasoning and code generation.
  3. Interpretability – We can read the agent’s self-reflections to understand why it changes behavior.

Limitations:
The method leans heavily on an LLM’s ability to produce useful reflections. Poor evaluators (or weak test suites) bottleneck performance—as seen in MBPP. Agents can also get stuck in local minima without creative exploration.


Despite these challenges, Reflexion demonstrates that the path to more capable AI agents might not be just about scaling models. It’s about giving them robust, human-like mechanisms to reason, reflect, and learn from experience. In doing so, we may unlock agents that improve themselves—not just in performance, but in transparency and alignment with our intentions.