Introduction

Imagine you are trying to learn a new, complex video game. You play a level, make a mistake, and lose. The next time you play, you remember that mistake and try a different strategy. Over time, you develop a “gut feeling” for which actions lead to victory and which lead to defeat.

Now, consider Large Language Models (LLMs). They possess incredible encyclopedic knowledge and commonsense reasoning. However, when acting as autonomous agents in interactive environments, they suffer from a critical flaw: they struggle to learn from their own past experiences effectively. Standard approaches either fine-tune the model only on “perfect” demonstrations (ignoring the educational value of failure) or try to stuff past experiences into the model’s context window (which quickly hits memory limits).

What if we could give an LLM that “gut feeling” based on past experience without cluttering its immediate thought process?

This is the core proposition of Retrospex, a novel framework proposed by researchers at Nanjing University. Retrospex separates the agent’s general reasoning capabilities from its experience-based value judgment. It combines a standard LLM with a specialized Reinforcement Learning (RL) Critic. This critic is trained offline to judge actions based on past successes and failures, guiding the LLM toward better decisions without requiring massive context windows.

In this post, we will tear down the architecture of Retrospex, exploring how it marries the linguistic prowess of LLMs with the strategic foresight of offline Reinforcement Learning.

The Context: Why LLM Agents Struggle with Experience

To understand why Retrospex is necessary, we first need to look at how current LLM agents operate.

The earliest iteration of LLM agents, such as ReAct, operates in a simple loop: observe, reason, and act. While effective for simple tasks, these agents suffer from amnesia. They don’t carry long-term lessons from one task to another.

To solve this, researchers developed architectures like Reflexion and Rememberer. These systems introduce a form of memory. When an agent fails, it records the experience. In future attempts, the agent retrieves these memories and adds them to the LLM’s prompt (context).

Figure 1: Comparing different architectures for LLM-based Agents

As shown in Figure 1 above, the evolution has been distinct:

  1. ReAct: No long-term memory.
  2. Reflexion: Uses a self-reflection loop to update context.
  3. Rememberer: Retrieves past experiences from a database.
  4. Retrospex (The focus of this post): Takes a different approach. Instead of feeding raw text experiences back into the LLM (which consumes expensive tokens), it trains a separate module—an RL Critic—to evaluate the value of an action.

The limitations of previous methods are clear. LLMs have a fixed context window. If an agent runs for thousands of steps or attempts hundreds of tasks, you cannot possibly fit all that experience into the prompt. Furthermore, simply reading about a past mistake is not the same as mathematically weighting an action as “bad.” Retrospex aims to solve this by condensing experience into a lightweight neural network.

The Retrospex Methodology

The Retrospex framework operates in three distinct phases: the Warm-up Stage, the Retrospection Stage, and the Inference Stage. This separation allows the system to gather data, learn from it offline, and then apply that knowledge dynamically.

Figure 2: The training process of Retrospex includes two stages

Let’s break down each stage of the pipeline illustrated in Figure 2.

Phase 1: The Warm-up Stage (Imitation Learning)

Before an agent can learn from its own history, it needs a baseline capability. You wouldn’t teach advanced strategy to someone who doesn’t know the rules of the game.

In this stage, the researchers fine-tune a base LLM (like Flan-T5 or LLaMA) using Imitation Learning (IL). They take “golden trajectories”—sequences of actions taken by humans or expert algorithms that successfully complete a task—and treat them as a text generation problem.

The objective is to minimize the difference between the LLM’s predicted action and the expert’s action. The mathematical objective is the standard Negative Log-Likelihood (NLL) loss:

Equation for Imitation Learning Loss

Here, the model \(\pi\) tries to maximize the likelihood of the expert action \(\pi^*(x)\) given the context \(x\).

Once this base agent is trained, it is deployed into the environment to perform tasks. Crucially, Retrospex collects everything this agent does—both the successful trajectories and the failures. This creates a rich dataset of experiences, denoted as memory \(\mathcal{D}\). This dataset contains tuples of \((state, action, next\_state, reward)\).

Phase 2: The Retrospection Stage (Offline RL)

This is where Retrospex gets its name. The agent “looks back” at its gathered experiences to learn.

The goal here is to train a Critic. In Reinforcement Learning terms, a Critic is a function (usually a neural network) that estimates the Q-value of an action. A Q-value, \(Q(s, a)\), represents the expected total reward an agent will get if it is in state \(s\), takes action \(a\), and acts optimally thereafter.

If the Q-value is high, the action is good (likely leads to success). If low, the action is bad (likely leads to failure).

Why Offline RL?

Standard RL (Online RL) updates the model while the agent interacts with the environment. This is slow, expensive, and unstable. Retrospex uses Offline RL, meaning it learns strictly from the static dataset collected in Phase 1.

The specific algorithm used is Implicit Q-Learning (IQL). Why IQL? In offline RL, a common problem is “overestimation.” If the model sees a state it hasn’t encountered before, it might guess a wildly high reward for an action. IQL is designed to be conservative, only estimating values for actions that are actually supported by the data.

The Training Objectives

The Critic network (implemented here as a lightweight GRU network) is trained using three loss functions derived from the Bellman equation.

1. The Q-Function Objective (TD Error): The Critic attempts to minimize the temporal difference (TD) error. It wants the Q-value of the current step to match the reward plus the value of the next step.

TD Error Equation

2. The Value Function Objective: To stabilize training, IQL introduces a state-value function \(V(s)\). This estimates how good a state is, regardless of the specific action taken.

Value Function Loss

3. The Final Q-Update: Using the estimated value function \(V\), the Q-network is updated to ensure consistency.

Q Update Loss

By the end of this stage, we have a specialized neural network—the RL Critic—that can look at a situation and an action, and output a number representing how “smart” that move is based on historical data.

Phase 3: The Inference Stage (Dynamic Action Rescoring)

Now we have two brains:

  1. The LLM: Good at language, commonsense, and generating candidate actions.
  2. The RL Critic: Good at long-term planning and avoiding past mistakes based on rewards.

How do we combine them? Retrospex uses a technique called Dynamic Action Rescoring.

Figure 3: Dynamic Action Rescoring in Retrospex

The process works as follows:

  1. Action Generation: The LLM looks at the current context and generates the Top-\(K\) candidate actions.
  2. LLM Scoring: We calculate the probability score \(p\) for each action based on the LLM’s confidence.
  3. Critic Scoring: We feed the same actions into the RL Critic to get the Q-value \(q\) for each action.

Both scores are normalized to be on the same scale:

Normalization of LLM probabilities Normalization of Q-values

The Dynamic Weight \(\alpha(t)\)

This is the clever twist. The authors realized that the importance of the LLM vs. the Critic changes depending on where you are in the task.

At the beginning of a task (\(t=0\)), the history is short. The RL Critic (which relies on state history) might not have enough specific information. However, the LLM has strong commonsense priors. Therefore, early in the task, we should trust the LLM.

As the task progresses and the trajectory gets longer, the specific history becomes more important than general commonsense. Here, the RL Critic’s insight into long-term rewards becomes crucial.

Retrospex defines a dynamic weight \(\alpha(t)\) that decays over time:

Alpha decay equation

Here, \(d\) is a decay factor (e.g., 0.97) and \(b\) is a lower bound (e.g., 0.6) to ensure the LLM is never completely ignored.

The final score \(S(a)\) for an action is a weighted combination:

Final Score Equation

As shown in Figure 4 below, the weight \(\alpha(t)\) starts high (trusting the LLM) and drops as steps increase (trusting the RL Critic more), eventually plateauing at the lower bound \(b\).

Figure 4: Alpha(t) with different values of steps t

The agent simply picks the action with the highest combined score \(S(a)\).

Experiments and Results

The researchers evaluated Retrospex on three challenging text-based simulation environments: ScienceWorld (scientific reasoning), ALFWorld (household tasks), and Webshop (e-commerce navigation).

Training Data

A key advantage of Retrospex is that the RL Critic is lightweight. As shown in Table 1, the Critic uses a GRU with only ~2.7 million parameters, compared to the billions of parameters in the LLM. This means the inference overhead is negligible.

Table 1: Training data used in the warmup and retrospection stages

Performance on ScienceWorld

ScienceWorld is complex, requiring agents to perform multi-step scientific experiments (e.g., “measure the conductivity of a fork”).

The results in Table 3 are telling. Retrospex significantly outperforms the base Imitation Learning agent (IL-T5).

Table 3: The AS and SR on ScienceWorld

  • IL-T5 (The base model): Achieves a success rate (SR) of 27.0.
  • Retrospex: Jumps to a success rate of 36.0.
  • Comparison: It even outperforms Reflexion (based on GPT-4) on average score, despite Retrospex using a much smaller Flan-T5 model. This proves that a smaller model with a dedicated experience critic can punch above its weight class.

Performance on ALFWorld and Webshop

ALFWorld tests household navigation (e.g., “put a clean spoon on the table”), while Webshop tests online shopping skills.

In Table 4 (ALFWorld), Retrospex achieves an 87.0% success rate, improving over the base model (83.5%) and outperforming Reflexion (GPT-3.5).

Table 4: Overall results on ALFWorld

In Webshop (Table 5), the trend continues. Retrospex consistently improves upon the base learner and competitive baselines like Rememberer and AgentLM across different test sets.

Table 5: Overall results on Webshop

Why Dynamic Scoring Matters

Is the dynamic weighting \(\alpha(t)\) actually necessary? Could we just average the scores 50/50?

The authors conducted an ablation study (removing parts of the system to see if they matter). Table 7 shows the results on ScienceWorld.

Table 7: Results on ScienceWorld with different dynamic scoring parameters

  • Column 1 (IL-T5): Using only the LLM works okay (Score: 48.80).
  • Column 2 (d=0, b=0): Using only the RL Critic fails miserably (Score: 36.7). This confirms the RL Critic is a guide, not a replacement for the LLM’s language skills.
  • Last Column (Static): A fixed weight (0.6 LLM + 0.4 Critic) yields a score of 54.37.
  • Retrospex Column (d=0.97): The dynamic decay yields the highest score of 55.98.

The data confirms that trusting the LLM early and the Critic late is the optimal strategy.

Task Complexity Analysis

One final interesting finding is how Retrospex handles tasks of different lengths. In Table 6, tasks are split into Short, Medium, and Long.

Table 6: Average reward scores on different task complexity

Retrospex provides the most dramatic improvement in Medium and Long tasks. This aligns perfectly with the theory: as trajectories get longer, the LLM is more prone to getting “lost” or hallucinating, and the RL Critic’s value estimation becomes the stabilizing anchor that keeps the agent on track.

Conclusion and Implications

Retrospex represents a significant step forward in the design of autonomous language agents. By acknowledging that “acting” (language generation) and “evaluating” (value estimation) are distinct skills, the authors have created a modular system that is both efficient and effective.

Here are the key takeaways:

  1. Decoupling Memory: We don’t need to stuff context windows with raw text of past mistakes. We can distill that experience into a mathematical value function (the Critic).
  2. Offline Learning is Safe: By using offline RL (IQL), agents can learn from static datasets without the risks and costs of live, trial-and-error training during deployment.
  3. Dynamic Collaboration: The “brain” (LLM) and the “gut” (Critic) work best when their influence is balanced dynamically based on the stage of the task.

As LLMs continue to grow in size, frameworks like Retrospex offer a path to making them not just smarter, but wiser—capable of looking back at their history to navigate the future.