Training Robot Agents Without Human Labels: A Deep Dive into Environment Preference Optimization (EPO)

Imagine asking a robot to “heat up a cup of coffee.” To you, this is a simple request. To a robot (or an embodied AI agent), this is a logistical nightmare. It involves navigation, object detection, grasping, opening a microwave, and understanding the concept of “heating.”

Large Language Models (LLMs) like GPT-4 or Llama have shown incredible reasoning capabilities, but applying them to long-horizon physical tasks remains a massive hurdle. The standard approach requires feeding the model thousands of human-annotated examples of exactly how to perform a task. But human annotation is slow, expensive, and unscalable.

What if the agent could learn from the environment itself? What if it could try things out, see what works, and teach itself the difference between a good action and a bad one?

In this post, we are diving into a research paper titled “EPO: Hierarchical LLM Agents with Environment Preference Optimization.” The researchers propose a framework that not only breaks complex tasks down into manageable chunks but also introduces a novel training method—Environment Preference Optimization (EPO)—that allows agents to learn from unannotated data by interpreting feedback from the world around them.

Let’s unpack how this works, from the architectural hierarchy to the mathematical loss functions that make it possible.

The Problem: Long Horizons and Scarce Data

There are two primary challenges in building LLM-based agents for complex environments (like a household simulator):

The Planning Horizon: LLMs are generally designed for immediate token prediction. Maintaining a coherent plan over hundreds of steps (e.g., finding an apple, washing it, putting it on a table) is difficult. The model often “forgets” the broader goal or gets stuck on low-level details.
The Data Bottleneck: Supervised fine-tuning (SFT) is the standard way to train these agents. You show the model a task and the exact sequence of actions to solve it. However, obtaining “internet-scale” data for robot actions is impossible compared to the text data available for chatbots. We have plenty of tasks (instructions), but very few solutions (ground-truth action sequences).

The authors of this paper tackle the first problem with a Hierarchical Framework and the second with EPO.

Part 1: The Hierarchical Framework

To solve the planning problem, the researchers stop trying to make one single LLM do everything. Instead, they adopt a “Divide and Conquer” strategy. They split the agent into two distinct modules, both powered by LLMs (specifically Llama-2 in this study).

Module 1: The High-Level Planner (Subgoal Decomposition)

Think of this module as the “Manager.” It doesn’t care about how to move the robot’s joints or navigate around a chair. Its job is to look at the high-level human instruction and the visual environment, then break the task into a sequence of logical subgoals.

For example, if the instruction is “Heat the cup,” the Manager might output:

Find Cup
Pickup Cup
Find Microwave
Heat Cup

Module 2: The Low-Level Actor (Interaction)

This module is the “Worker.” It receives a specific subgoal from the Manager (e.g., Heat Cup) and figures out the exact low-level actions required to execute it. This involves checking if the agent is holding the cup, navigating to the microwave, opening the door, and placing the object inside.

Visualizing the Architecture

As shown in Figure 1, the flow of information is structured and cyclical.

An illustration of the hierarchical framework showing the flow from visual input to high-level policy (subgoals) and then low-level policy (actions).

Input: The agent receives a visual frame (e.g., a kitchen counter) and a task instruction.
Symbolic Representation: The visual input is converted into text (e.g., “observed: microwave, cup”).
High-Level Policy: The LLM predicts the next subgoal (Heat Cup).
Low-Level Policy: A second LLM takes that subgoal and generates an atomic action (Pickup, Open, etc.).
Environment: The action is executed, the environment changes, and the loop repeats.

This hierarchy simplifies the problem. The High-Level policy only needs to plan a few steps ahead in terms of subgoals, while the Low-Level policy only needs to worry about the immediate few seconds of interaction.

Part 2: Environment Preference Optimization (EPO)

The hierarchy solves the planning issue, but what about the data shortage? This is where the paper makes its most significant contribution. The authors propose a way to train these agents using unannotated datasets—tasks where we know the goal, but we don’t have a human-labeled guide on how to solve it.

They achieve this by leveraging multimodal environment feedback to create their own training signals.

Step 1: The Reward Model

To learn without human labels, the agent needs a way to judge its own performance. The researchers train a Reward Model (also an LLM) to act as a critic.

The Reward Model takes three inputs:

Feedback (\(F\)): Visual data (what objects are visible?) and interaction status (did the last action succeed or fail?).
Task (\(T\)): The instruction (e.g., “Pick up the apple”).
Predicted Output (\(P\)): What the agent wants to do (e.g., “Pickup Apple”).

The model outputs a scalar reward score, \(\hat{r}\), representing how “correct” the proposed action is given the feedback and the task.

Equation for the reward model taking Feedback, Task, and Prediction as inputs.

If the vision system sees an apple and the agent proposes “Pickup Apple,” the reward is high. If the agent proposes “Pickup Chair” (which isn’t interactable or relevant), the reward is low.

Step 2: Generating Preference Data

Once the Reward Model is trained (on a small set of labeled data), it can be used to label a massive amount of unlabeled data.

Here is the pipeline illustrated in Figure 2:

A diagram of the pipeline: Training a reward model on annotated data, using it to rank outputs on unannotated data, and creating an EPO dataset.

Inference: On unannotated tasks, the agent proposes multiple potential subgoals or actions.
Ranking: The Reward Model evaluates these proposals.

Option A: “Pickup Statue” \(\to\) Reward: 1.0 (High alignment with environment)
Option B: “Pickup Dog” \(\to\) Reward: 0.0 (The “dog” is just a statue, or doesn’t exist)

Dataset Creation: This creates a Preference Dataset. We now know that Option A is preferred over Option B (\(p_w \succ p_l\)), even though a human never told us that.

Step 3: The EPO Loss Function

Now that we have pairs of “better” (\(p_w\)) and “worse” (\(p_l\)) actions, how do we train the agent? The authors adapt Direct Preference Optimization (DPO).

DPO is a technique used to align LLMs (like ChatGPT) with human preferences without needing a complex Reinforcement Learning loop. However, standard DPO allows for “soft” alignment—it just wants the model to score the winner higher than the loser. In robotics, we need “hard” alignment; the agent must generate the specific, correct command tokens to function.

The researchers introduce the EPO Loss function, which combines standard DPO with a token-level alignment constraint.

The total EPO loss function combining probability maximization and the DPO term.

This equation has two main parts:

\(-p_w \log(\pi_\theta(\hat{p} | T))\): This term forces the model to increase the probability of generating the winning action (hard alignment).
\(\mathcal{L}_D\): This is the preference optimization term (derived from DPO).

The \(\mathcal{L}_D\) term specifically looks like this:

The DPO component of the loss function, measuring the log ratio between chosen and rejected responses.

By minimizing this combined loss, the agent learns to distinguish between good and bad actions based on the environment’s feedback, effectively teaching itself from the unlabeled data.

Experiments and Results

The researchers evaluated their framework on ALFRED, a rigorous benchmark for household instruction following. ALFRED tasks are long and require understanding natural language, navigating through rooms, and interacting with objects (e.g., “Rinse off a mug and place it in the coffee maker”).

State-of-the-Art Performance

The results were impressive. As shown in Table 1, the EPO framework significantly outperformed previous state-of-the-art methods.

Comparison table showing EPO achieving state-of-the-art results on the ALFRED benchmark.

On “Unseen” tasks (environments the agent had never been in during training), EPO achieved a success rate of 62.35%, compared to the previous best of roughly 50%. This suggests that the model isn’t just memorizing rooms; it’s learning robust, generalizable decision-making policies.

The Power of Unannotated Data

Perhaps the most interesting finding is how EPO utilizes data. The authors compared Supervised Fine-Tuning (SFT) against EPO using different splits of data.

Looking at Table 2, we see a clear trend:

Table comparing SFT and EPO across different data splits, showing EPO’s superiority with unannotated data.

Even when 90% of the data was unannotated (the “10/90” split), EPO (0.5091 success rate) significantly outperformed SFT (0.4689). This validates the core hypothesis: we can improve agents by using the environment to label the data for us.

Qualitative Improvements

What does this improvement look like in practice? Figure 3 provides a visual comparison.

Visual comparison of Baseline vs. EPO policies. The EPO agent correctly identifies the ‘mug’ instead of a generic ‘cup’ and adjusts its pose to interact successfully.

Left (Subgoal Level): The baseline agent sees a white object and vaguely predicts “Pickup Cup.” This fails because the specific object logic is incorrect. The EPO agent, trained on feedback, correctly identifies the subgoal as “Pickup Mug,” aligning with the specific affordances of the environment.
Right (Action Level): The baseline tries to put a pen on a desk but fails because it is standing too far away or at a bad angle. The EPO agent learns to perform a pose adjustment (LookDown) before attempting the action, leading to success. This subtle “posture adjustment” behavior is exactly the kind of nuance that is hard to hand-code but easy to learn through preference optimization.

Why This Matters

The EPO paper represents a significant step forward for Embodied AI for several reasons:

Breaking the Dependency on Labels: By creating a mechanism to learn from unannotated data, the framework opens the door to training on vastly larger datasets. We can potentially unleash agents into simulators, let them attempt millions of tasks, and have them self-improve using EPO.
Bridging High-Level and Low-Level: The hierarchical structure successfully marries the reasoning power of LLMs with the granular control required for robotics.
Multimodal Feedback as a Teacher: It treats the environment not just as a place to act, but as a source of supervision. Visual signals and interaction results become the “teacher.”

Conclusion

“EPO: Hierarchical LLM Agents with Environment Preference Optimization” demonstrates that we don’t always need a human over the robot’s shoulder telling it exactly what to do. By equipping agents with a hierarchical brain and a method to derive preferences from the environment, we can build systems that plan better, act more precisely, and learn more efficiently.

As we move toward general-purpose service robots, techniques like EPO that maximize data efficiency and self-correction will likely become the standard for training the next generation of embodied intelligence.

The Problem: Long Horizons and Scarce Data#

Part 1: The Hierarchical Framework#

Module 1: The High-Level Planner (Subgoal Decomposition)#

Module 2: The Low-Level Actor (Interaction)#

Visualizing the Architecture#

Part 2: Environment Preference Optimization (EPO)#

Step 1: The Reward Model#

Step 2: Generating Preference Data#

Step 3: The EPO Loss Function#

Experiments and Results#

State-of-the-Art Performance#

The Power of Unannotated Data#

Qualitative Improvements#

Why This Matters#

Conclusion#