Beyond Imitation: How Early Experience Lets Agents Learn from Their Own Mistakes
The long-standing dream of AI is an agent that learns by acting in the world — experimenting, failing, and improving without a human constantly telling it what to do. Language agents built on large language models (LLMs) are a huge step toward that dream: they can navigate websites, call APIs, chain tools, and even help with scientific workflows. But training them remains trapped between two extremes.
On one side is imitation learning: collect expert demonstrations and teach the agent to copy them. Simple to implement and reward-free, but brittle and expensive to scale. On the other side is reinforcement learning (RL): let the agent explore and optimize for reward through trial and error. Powerful when rewards are available and well-defined, but many realistic language-agent environments either lack verifiable reward signals or require extremely long, unstable rollouts.
“Agent Learning via Early Experience” proposes a practical middle path: Early Experience. The idea is to let agents propose actions during training, execute those actions in the environment, and use the resulting future states — not reward signals — as supervision. Those future states are grounded, informative, and scalable: they reveal the consequences of taking non-expert actions, and they can be harvested without a hand-crafted reward function.
This post walks through the core idea, two concrete methods the paper develops (Implicit World Modeling and Self-Reflection), and the evidence that this simple-sounding idea yields consistent, sometimes large gains across a wide range of environments.
Figure 1: Training paradigms for language agents. Left: the Era of Human Data (imitation learning) uses expert demonstrations — reward-free but not easy to scale. Right: the Era of Experience (RL) optimizes for rewards but relies on verifiable reward signals. Center: Early Experience (this work) lets agents propose actions and collect the resulting future states as scalable, reward-free supervision.
The problem in a nutshell
We typically formalize an agent’s problem as an MDP:
\[ \mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\gamma,\rho_0), \]where states \(s\in\mathcal{S}\) describe what the agent observes (web pages, tool outputs, textual scene descriptions), actions \(a\in\mathcal{A}\) are candidate decisions (click, call a tool, type text), and \(T(s,a)\) gives next-state dynamics. A policy \(\pi_\theta(a\mid s)\) maps states to action probabilities.
When reliable rewards \(R(s,a)\) exist, RL can optimize long-term performance. But many real-world language-agent environments either:
- produce no verifiable immediate reward (e.g., the website shows a page but not whether the form submission was correct), or
- require long, delayed interactions to reveal success/failure (e.g., multi-step tool use), which makes RL unstable and costly.
Imitation learning sidesteps rewards by training on a dataset of expert state–action pairs
\[ \mathcal{D}_{\text{expert}}=\{(s_i,a_i)\}_{i=1}^N, \]minimizing
\[ \mathcal{L}_{\mathrm{IL}}(\theta) = -\sum_{i=1}^N \log \pi_\theta(a_i\mid s_i). \]But this ignores the consequences of the agent’s own actions; it never observes what happens after deviating from the expert, so distribution shift compounds at test time.
Early Experience asks: if rewards aren’t available, can we still let the agent interact during training and convert what it observes into supervision?
Early Experience: core idea
Start from the expert dataset \(\mathcal{D}_{\text{expert}}\). For every expert state \(s_i\) we let the agent propose \(K\) alternative actions sampled from its current policy:
\[ \mathcal{A}_i=\{a_i^1,\dots,a_i^K\}. \]Execute those alternative actions in the environment to obtain resulting next states \(s_i^j \sim T(s_i,a_i^j)\). Collect the rollout triples:
\[ \mathcal{D}_{\text{rollout}}=\{(s_i,a_i^j,s_i^j)\mid i\in[1..N],\, j\in[1..K]\}. \]These triples encode grounded feedback: they show how the environment responds to off-expert actions (error messages, different pages, corrupted tool outputs, etc.). The paper explores two practical ways to turn these rollouts into training signals.
Figure 2: Two training paths built on top of expert trajectories. Left: Implicit World Modeling augments a policy by training it to predict resulting states from (state, action) pairs. Right: Self-Reflection uses alternative rollouts to generate natural-language explanations that contrast expert vs. alternative actions; those explanations become training targets.
Method A — Implicit World Modeling (IWM)
Key idea: train the same LLM-based policy to predict the next state textually given (state, action). Because states are already natural-language-like (web DOM summaries, tool outputs, textual scene descriptions), next-state prediction reduces to a familiar next-token prediction task.
For rollout triples \((s,a,s')\in\mathcal{D}_{\text{rollout}}\), the world-modeling objective is
\[ \mathcal{L}_{\mathrm{IWM}} = -\sum_{(s,a,s')\in\mathcal{D}_{\text{rollout}}}\log p_\theta(s'\mid s,a). \]Why this helps:
- The policy internalizes coarse transition dynamics: which actions tend to produce error messages, which change the page structure, and which progress the task.
- Training is lightweight — no separate simulator or planner — because the model itself learns to predict consequences as part of its parameters.
- In practice the authors use a two-stage pipeline: first train on the IWM objective to internalize dynamics, then fine-tune on imitation loss \(\mathcal{L}_{\mathrm{IL}}\) so the policy remains grounded in expert behavior.
IWM is especially effective when state transitions are predictable and structured (e.g., transactional web flows, embodied simulators).
Method B — Self-Reflection (SR)
Key idea: use observed differences between expert and alternative outcomes to generate contrastive, human-readable rationales, and train the agent to predict those rationales together with the expert action.
Procedure:
- Execute the expert action \(a_i\) at \(s_i\) to get \(s_{i+1}\).
- Execute an alternative action \(a_i^j\) to get \(s_i^j\).
- Prompt a language model (often the same family of LLMs) to produce a chain-of-thought style contrastive explanation \(c_i^j\) that answers: “Why is the expert action \(a_i\) better than \(a_i^j\) given the observed outcomes?”
- Collect triplets \((s_i,a_i^j,c_i^j)\) and train the model to generate the concatenated target \(c_i^j \circ a_i\) conditioned on \(s_i\).
The training loss is:
\[ \mathcal{L}_{\mathrm{SR}} = -\sum_{(s_i,a_i^j,c_i^j)\in\mathcal{D}_{\mathrm{refl}}}\log p_\theta(c_i^j, a_i\mid s_i). \]Why this helps:
- The model learns why certain choices are preferable (the principles), not just what to do.
- Natural-language rationales teach transferable constraints (e.g., “respect budget”, “avoid malformed queries”) that generalize across contexts.
- Because rationales are grounded in actual outcomes \(s_{i+1}\) vs. \(s_i^j\), they avoid the pitfalls of ungrounded synthetic rationales that can hallucinate.
The authors use SR alongside the expert dataset: chain-of-thought targets are included when available from the expert data, and SR examples are mixed into fine-tuning.
How well does it work?
The paper runs a large-scale empirical study across eight diverse environments:
- Embodied and scientific simulators: ALFWorld, ScienceWorld
- Long-horizon planning: TravelPlanner
- Multi-turn tool use: BFCLv3, Tau-Bench
- Search and retrieval: SearchQA
- Web navigation: WebShop, WebArena-Lite
They evaluate multiple model families and sizes (Llama and Qwen variants, up to a 70B Llama), and compare three baselines:
- Prompted instruction-tuned model;
- Imitation learning (supervised fine-tuning on \(\mathcal{D}_{\text{expert}}\));
- Two early-experience variants: Implicit World Modeling (IWM) and Self-Reflection (SR).
Table 1 (overview of environments) and Table 2 (main results) are summarized in the paper; the overall pattern is consistent.
Table 1: Benchmarks spanning web navigation, tool use, embodied tasks, planning, and retrieval.
Main takeaways
- Both IWM and SR consistently outperform imitation learning across environments and model sizes. Average absolute gains are substantial (the paper reports ~+9.6% success-rate improvements overall).
- IWM shines in environments with stable, predictable dynamics (e.g., WebShop, ALFWorld) where next-state prediction provides strong signals.
- SR yields especially large improvements on tasks requiring multi-step reasoning or constraint satisfaction (e.g., TravelPlanner, ScienceWorld) by teaching the model problem-solving principles.
- Early Experience improves out-of-domain robustness. In several OOD splits (ALFWorld, SearchQA) the gains from early experience were even larger than in-domain gains, indicating better generalization.
- Early Experience also makes a superior warm start for RL. When subsequent reward-based fine-tuning (GRPO in the paper) is applied, checkpoints initialized from IWM/SR reach higher final performance than imitation-only starts.
Table 2: Summary of results across eight benchmarks (success rates or F1 as appropriate). IWM / SR consistently improve over imitation learning.
Table 3: Out-of-domain evaluations — early experience methods consistently recover a substantial portion of the OOD gap relative to imitation learning.
Figure 3: Reinforcement learning (GRPO) starting points. Models pre-trained with IWM or SR achieve higher ceilings after RL than imitation-only starts. Early experience gives a stronger initialization for reward-driven refinement.
Deeper analyses
Several ablations and analyses help illuminate when and why Early Experience works.
- Data efficiency: On WebShop and ALFWorld, models trained with early experience using only a fraction of expert trajectories often beat imitation-learning models trained on the full dataset. Early experience effectively multiplies the value of expert data.
- Branching factor \(K\) (how many alternative actions are executed per expert state): IWM benefits steadily from larger \(K\) (more diverse dynamics to learn). SR benefits at small-to-moderate \(K\) but can degrade if many alternatives are themselves successful — that reduces the contrast and makes it harder for the model to extract a clean “why” signal.
- Model scaling: The gains persist across model sizes, including large models (70B) with LoRA tuning. Early experience complements model capacity rather than substituting for it — bigger models still improve, and the advantage of IWM/SR remains.
- Grounded vs. ungrounded rationales: Generating rationales without executing alternative actions (i.e., ungrounded STaR-style rationales) performs worse or can even harm performance. The crucial ingredient is grounding the rationale in actual observed outcomes of actions.
Figure 4: (a) Success rate versus fraction of expert trajectories — early experience dominates at all data levels. (b) Success rate versus branching factor \(K\) — IWM benefits with higher \(K\); SR prefers moderate \(K\).
Figure 5: Performance across model scales on WebArena. Early experience retains its advantage across sizes, showing it scales with model capacity.
Where this fits in the training pipeline
Think of Early Experience as a mid-training bridge:
- Start with standard LLM pretraining and instruction tuning.
- Warm up with Early Experience (IWM and/or SR) using expert trajectories and agent-generated rollouts to internalize dynamics and reasoning principles.
- When and if a verifiable reward function becomes available, fine-tune with RL from the stronger Early-Experience-initialized checkpoint.
This sequence is practical and flexible: Early Experience requires no hand-crafted rewards, produces grounded supervision at scale, and improves both immediate policy performance and subsequent RL outcomes.
Limitations and opportunities
The authors are careful to note limitations and open directions:
- Short-horizon focus: The current IWM and SR formulations operate on immediate next-state rollouts. Extending these ideas to address long-horizon credit assignment without rewards (e.g., aggregating long chains of consequences) is a clear next step.
- Computational cost: Generating rollouts requires executing agent proposals in the environment. For very expensive environments, rollout collection may be costly; however, many web and simulated environments are efficient enough to amortize this cost.
- Quality of generated rationales: SR depends on the language model producing clear, faithful contrastive explanations. Low-quality or misleading explanations can limit benefit; the paper uses filtering and canonicalization to mitigate this.
- Safety and distribution of rollouts: In real-world deployments, unconstrained exploration could produce undesirable or unsafe actions. Practical systems will need guardrails (admissible-action constraints, safety filters) when collecting early experience.
Future avenues include combining early experience with richer self-supervised objectives, cross-environment transfer of learned dynamics/rules, and continual online learning where agents keep harvesting experience from deployment.
Takeaway
Early Experience is a pragmatic and effective step toward agents that truly learn from acting. By harvesting the outcomes of the agent’s own decisions and turning those observations into prediction and explanation targets, the methods studied here:
- provide scalable, reward-free supervision;
- teach agents both what to do and why;
- improve in-domain success and out-of-domain robustness; and
- produce a stronger initialization for later RL when reward signals become available.
If you care about making language agents more robust, more generalizable, and more capable of learning from interaction, Early Experience is a simple, well-founded, and empirically validated idea worth adding to the training toolbox.
Paper: “Agent Learning via Early Experience” — Kai Zhang et al. (Meta & The Ohio State University).