Teaching Robots New Tricks: How MEREQ Aligns Policies with Less Human Effort

Imagine you have trained a robot to push a bottle across a table. It does the job perfectly—the bottle gets from point A to point B. However, there is a catch: the robot pushes the bottle from the very top, making it wobble dangerously. As a human observer, you want to teach the robot a preference: “Push the bottle, but please push it from the bottom so it’s stable.”

This is a classic problem in Interactive Imitation Learning (IIL). You don’t want to reprogram the robot from scratch; you just want to tweak its behavior through interventions. You watch the robot, and whenever it reaches for the top of the bottle, you take control and move its hand to the bottom.

The problem? Most current algorithms are incredibly inefficient at this. They treat every intervention as a brand-new lesson, often ignoring everything the robot already knew about the task (like how to move its arm or where the goal is). This leads to “catastrophic forgetting,” where the robot unlearns its basic skills while trying to satisfy your new preference.

In this post, we are diving deep into MEREQ (Max-Ent Residual-Q Inverse RL), a novel approach presented at CoRL 2025. This method respects the robot’s prior knowledge and learns only the difference (the residual) between what the robot was doing and what the human wants. The result? Drastically faster alignment with significantly less effort from the human teacher.

The Problem: Why is Teaching Robots So Hard?

In a standard “Human-in-the-Loop” setup, a human expert observes a policy (the robot’s brain) executing a task. When the robot deviates from the human’s preference, the human intervenes, taking over control.

Traditional approaches usually fall into two buckets:

Behavior Cloning (BC): The robot simply mimics the human’s corrective actions. This often fails because it ignores the sequential nature of time—a small error now leads to a big disaster later (compounded errors).
Inverse Reinforcement Learning (IRL): The robot tries to infer the “reward function” (the hidden goal) inside the human’s head.

The issue with standard IRL in this context is that it tries to infer the entire reward function from scratch. If the robot already knows how to navigate a highway but simply needs to learn “stay in the right lane,” relearning the concept of “driving” and “avoiding collisions” just to learn “lane preference” is a waste of data. This inefficiency means the human has to intervene hundreds of times, which is exhausting.

The Solution: MEREQ

The core insight of MEREQ is simple but powerful: Don’t relearn the whole reward. Just learn the residual.

MEREQ assumes the robot has a Prior Policy (\(\pi\)) that is already good at the base task (optimizing a reward \(r\)). The human has a different internal reward function (\(r_{expert}\)). MEREQ aims to find the Residual Reward (\(r_R\)) such that:

\[r_{expert} \approx r + r_R\]

By focusing only on \(r_R\)—the discrepancy between the robot’s prior behavior and the human’s preference—the algorithm can leverage the robot’s existing skills while fine-tuning the behavior with very few samples.

Overview of MEREQ, illustrating the flow from Prior Policy to Sample Collection, Residual Reward Learning, and Policy Update.

As shown in Figure 1 above, the workflow is a loop:

The robot executes its current policy.
The human intervenes if necessary (providing “Bad” samples vs. “Good” samples).
MEREQ uses Inverse RL to figure out the residual reward (\(r_R\)) that explains the human’s corrections.
MEREQ uses Residual Q-Learning (RQL) to update the policy to maximize this new combined reward.

Deep Dive: How MEREQ Works

Let’s break down the mathematics and architecture that make this possible. The method relies on two foundational pillars: Maximum Entropy IRL and Residual Q-Learning.

1. The Probabilistic Model (MaxEnt IRL)

To understand what the human wants, MEREQ uses the Maximum Entropy principle. It assumes that humans are “Boltzmann Rational”—meaning they are exponentially more likely to choose actions that lead to higher rewards, but there is some randomness involved.

The probability of seeing a specific trajectory of actions (\(\zeta\)) given a reward parameter (\(\theta\)) is:

Equation describing the probability of a trajectory based on the partition function and feature counts.

Here, \(\mathbf{f}(\zeta)\) represents the features of the trajectory (e.g., speed, distance to obstacle). Standard IRL tries to find the weights (\(\theta^*\)) that maximize the likelihood of the expert’s demonstrations:

Equation for finding optimal theta by maximizing log likelihood of expert trajectory.

2. Learning the Residual Reward

This is where MEREQ innovates. Instead of solving for the full reward weights \(\theta\), it solves for the Residual weights \(\theta_R\).

The objective function changes. We are now trying to maximize the likelihood of the expert’s data assuming the total reward is the sum of the known prior reward \(r\) and the unknown residual reward. The loss function becomes:

Loss function equation summing the prior reward and residual reward minus the log partition function.

Calculating the gradient of this loss function allows the system to update the residual weights. The gradient calculation involves comparing the expert’s feature counts against the expected feature counts of the current policy.

Gradient of the loss function showing the difference between expert feature counts and expected feature counts.

To make this practical, MEREQ approximates the expectation in the second term by rolling out the current policy \(\hat{\pi}\) in a simulator. This “imagination” step allows the robot to understand how its current behavior differs from the expert’s interventions.

Approximation of the expected feature counts using policy rollouts.

3. Updating the Policy with Residual Q-Learning (RQL)

Once the algorithm has estimated the residual reward (\(r_R\)), it needs to update the robot’s brain (the policy) to act on it. Re-training a Reinforcement Learning (RL) agent from scratch would be slow.

Instead, MEREQ uses Residual Q-Learning. Standard RL learns a Q-function (which predicts future rewards). RQL learns a residual Q-function (\(Q_R\)) that sits on top of the prior Q-function.

The update rule looks like this:

The Soft Bellman update equation for Residual Q-Learning.

This equation essentially says: “The value of an action is the residual reward plus the prior reward plus the expected future value.” By keeping the prior reward \(r\) in the loop, the robot retains its original capabilities while adapting to the new preference.

4. The “Pseudo-Expert” Trick

A clever heuristic MEREQ uses is the concept of Pseudo-Expert Trajectories.

When a human doesn’t intervene, it usually means the robot is doing a good job. Many algorithms throw this data away, focusing only on the moments the human grabbed the controller. MEREQ looks at the “Good-Enough” segments (where the human stayed silent) and treats them as expert demonstrations. This stabilizes the learning process significantly because it prevents the robot from over-correcting based only on failure data.

Experiments and Results

Does it actually work? The researchers tested MEREQ in both simulation and the real world across several tasks.

The Tasks

Highway Driving: The robot knows how to drive but needs to learn a preference for the right-most lane.
Bottle Pushing: The robot pushes a bottle but needs to learn to push from the bottom (to prevent tipping).
Pillow Grasping: The robot needs to learn to grasp a pillow from the center.

Visuals of the Highway simulation showing ego vehicle and traffic. Figure 2: The Highway-Sim task. The goal is to align the car to the right lane.

Sample Efficiency

The primary metric was Sample Efficiency: How many expert interventions are needed to reach a low intervention rate (i.e., the robot acts correctly on its own)?

Graphs showing Sample Efficiency. Top: Intervention rate drops quickly for MEREQ. Bottom: Total expert samples are lower for MEREQ compared to baselines.

As shown in the graphs above, MEREQ (Red line) drops the intervention rate much faster than baselines like plain MaxEnt IRL (Green) or interactive behavior cloning methods like HG-DAgger (Purple).

Top Graph: The intervention rate for MEREQ plummets quickly, meaning the human can stop correcting the robot much sooner.
Bottom Graph: The total number of samples required to reach success is significantly lower for MEREQ across different difficulty thresholds (\(\delta\)).

Real-World Human Effort

Simulations are fine, but what about real humans? The researchers ran “Human-in-the-Loop” experiments where real people controlled a robot arm using a 3D mouse.

Hardware setup showing the robot arm for bottle pushing and pillow grasping.

The results were stark. MEREQ required significantly less effort from the human teachers.

Graph comparing Human Effort across methods. MEREQ requires fewer iterations to minimize intervention rate.

In Figure 4 (above), you can see the “Expert Intervention Rate” over time. In the Highway and Bottle Pushing tasks, MEREQ approaches near-zero interventions much faster than the alternatives. This means the human teacher spends less time fighting the robot and more time watching it succeed.

Qualitative Success

What does this alignment look like in practice?

Comparison of Before and After alignment. Left: Robot knocks bottle over vs pushing correctly. Right: Robot fails grasp vs successful center grasp.

The “Before” images show the Prior Policy failing—knocking the bottle over or missing the pillow grasp. The “After” images show the MEREQ-trained policy successfully executing the task with the human’s preferences (low contact point on the bottle, centered grasp on the pillow).

Conclusion & Why This Matters

MEREQ represents a significant step forward in making robots teachable. By acknowledging that robots often come with prior knowledge, and focusing the learning process only on the residual difference between that knowledge and human preference, we can:

Reduce Human Burden: Teachers don’t need to provide thousands of corrections.
Preserve Basic Skills: Robots don’t forget how to move or avoid obstacles while learning new preferences.
Speed Up Deployment: Robots can be fine-tuned for specific user needs in minutes rather than hours.

The paper notes that future work involves handling “noisy” humans (who might give inconsistent corrections) and moving beyond linear reward functions. But for now, MEREQ offers a compelling blueprint for the future of interactive robot learning: Don’t relearn what you already know; just learn what’s missing.

The Problem: Why is Teaching Robots So Hard?#

The Solution: MEREQ#

Deep Dive: How MEREQ Works#

1. The Probabilistic Model (MaxEnt IRL)#

2. Learning the Residual Reward#

3. Updating the Policy with Residual Q-Learning (RQL)#

4. The “Pseudo-Expert” Trick#

Experiments and Results#

The Tasks#

Sample Efficiency#

Real-World Human Effort#

Qualitative Success#

Conclusion & Why This Matters#