Walking and Working: How Multi-Critic RL Solves the Loco-Manipulation Paradox

Imagine a waiter carrying a tray of drinks through a crowded restaurant. To succeed, they must perform two distinct tasks simultaneously: they must navigate the room with their legs (locomotion) while keeping their hand perfectly level to avoid spilling the drinks (manipulation).

For humans, this coordination is second nature. For robots, specifically quadrupedal robots equipped with robotic arms, this is a profound engineering challenge. This domain is known as Loco-Manipulation.

The core problem lies in conflicting objectives. Efficient walking often requires a stable, horizontal base. However, to reach a specific object or keep a hand steady, the robot might need to tilt, twist, or lunge its body. When we train robots using Reinforcement Learning (RL), these conflicting goals often confuse the agent. Should it prioritize not falling over, or should it prioritize the precise movement of the arm?

In the paper “Mult-i-critic Learning for Whole-body End-effector Twist Tracking,” researchers from ETH Zurich and ANYbotics AG propose a novel framework to solve this. They introduce a Multi-Critic architecture that decouples these conflicting rewards, paired with a Twist-based control scheme that ensures smooth motion.

A quadruped robot performing whole-body manipulation, locomotion with end-effector control, and chicken-head stabilization.

As shown in Figure 1, this approach enables a robot to perform coordinated whole-body behaviors, maintain precise control while walking, and even perform “chicken head” stabilization (keeping the hand fixed in space while the body moves underneath it).

In this post, we will deconstruct how they achieved this, exploring the limitations of current methods and the specific architectural innovations that allow for such fluid motion.

The Core Conflicts in Whole-Body Control

To understand why this paper is significant, we first need to understand the limitations of standard Reinforcement Learning (RL) in robotics.

In a typical RL setup for a legged robot, we define a “Reward Function.” This function is a mathematical score that tells the robot how well it is doing. If we want the robot to walk and move its arm, we usually sum up the rewards:

\[ \text{Total Reward} = (\text{Walk Reward}) + (\text{Arm Reward}) \]

This “scalarization” of rewards creates a tug-of-war.

The Stability Conflict: Locomotion policies usually learn to keep the base flat and horizontal to save energy and stay stable. Manipulation policies often require the base to act as a lever or an extension of the arm, tilting to extend reachability.
The Tuning Nightmare: Balancing the weights between these rewards is difficult. If the “Walk Reward” is too high, the robot ignores the arm commands to ensure it walks perfectly. If the “Arm Reward” is too high, the robot might fling its body to reach a target, causing it to fall over.

The Problem with Pose Tracking

Furthermore, most current approaches use Pose Tracking. The robot is given a target position (\(x, y, z\)) and orientation. It tries to minimize the error between its current pose and the target pose.

While this sounds logical, it lacks information about how to get there. It doesn’t specify velocity. This often results in jerky, stiff motions because the robot tries to “teleport” its end-effector to the target state at every timestep, rather than following a smooth trajectory.

The Solution: A Multi-Critic Architecture

The researchers propose a solution that fundamentally changes how the robot evaluates its own performance. Instead of smashing all objectives into a single scalar reward, they separate the learning process using a Multi-Critic Actor Architecture.

The Teacher-Student Pipeline

The overall system uses a teacher-student training pipeline, a common technique in robust robotics. A “Teacher” policy is trained in simulation with access to “privileged information” (perfect knowledge of ground friction, exact masses, external forces). Once the Teacher is expert, it teaches a “Student” policy that only uses data available to the real robot’s sensors (proprioception).

Diagram showing the policy optimization loop with command generation, the simulator, and the multi-critic network.

Figure 2 illustrates this architecture. The key innovation is on the right side of the diagram: the Multi-critic network.

Decomposing the Critic

In standard Actor-Critic RL, the “Actor” decides what to do (move joints), and the “Critic” predicts how much reward that action will yield.

In this work, the researchers split the Critic into three distinct networks, each responsible for a specific aspect of the task:

Locomotion Critic (\(V_{loco}\)): Evaluates base velocity tracking and stability.
Manipulation Critic (\(V_{mani}\)): Evaluates end-effector (hand) position and orientation tracking.
Contact Schedule Critic (\(V_{cs}\)): Evaluates the timing of footfalls (gait).

By calculating the “Advantage” (how good an action was) separately for each component and then summing them, the policy receives cleaner gradient signals.

Why does this help? If a robot takes an action that is great for the arm but slightly bad for walking, a single combined critic might output a “neutral” value, effectively washing out the learning signal. With multiple critics, the Actor receives specific feedback: “This was excellent for the arm, but you need to adjust your feet timing.” This prevents the conflicting goals from canceling each other out during the learning process.

Twist-Based Task Formulation

The second major contribution is how the robot is told to move. Instead of just giving the robot a Goal Pose (\(T_g\)), the researchers define a command based on Twist.

In physics, a “Twist” represents the velocity of a rigid body: it combines linear velocity (\(v\)) and angular velocity (\(\omega\)).

Why Velocity Matters

Previous methods used hierarchical structures where a high-level planner sent pose targets to a low-level controller. This lacks velocity information, leading to the jerky motions mentioned earlier.

This paper formulates the command \(c_t\) to include explicit velocity targets for both the base and the end-effector:

Equation showing the command vector components including base velocity, end-effector velocity, and goal pose.

Here, \(v_{EE}\) and \(\omega_{EE}\) are the desired linear and angular velocities of the robotic hand.

To generate these commands during training, the system samples a start and goal pose, but effectively interpolates the path between them. The command sent to the policy at every timestep tells the robot specifically how fast it should be moving right now to stay on that trajectory.

Equation showing the calculation of end-effector velocity and angular velocity based on the interpolated target.

This forces the policy to learn dynamic control. It isn’t just trying to be somewhere; it’s trying to move through space smoothly.

Experimental Results

The researchers validated their method in simulation and on real hardware using an ANYmal D quadruped mounted with a Dynaarm manipulator.

1. Trajectory Tracking Accuracy

The primary test was whether the robot could follow specific shapes with its hand: straight lines, circles, and semicircles.

Graphs showing position and velocity tracking for linear, circular, and semicircular trajectories.

Figure 3 displays the tracking performance. The red lines represent the commanded trajectory, and the green lines represent the measured execution.

Precision: Notice how tight the green lines hug the red lines. The position error is minimal.
Velocity: The bottom rows show velocity tracking. The robot isn’t just hitting the waypoints; it is matching the speed profile requested by the controller.

Table 1 further quantifies this, showing that even at varying speeds (0.05 m/s to 0.2 m/s), the robot maintains low tracking error (\(\delta r\)).

Table showing tracking errors for different velocities on hardware.

2. The Power of Multi-Critic vs. Single-Critic

Does splitting the critic actually matter? The researchers compared their method against a standard single-critic approach (where all rewards are summed up before the critic sees them).

Graph comparing training steps between multi-critic and single-critic approaches.

Figure 8 reveals a stark difference.

Single-Critic (Dashed Lines): The policy learns to track the end-effector (Red dashed line drops), but it fails to learn locomotion properly. To minimize error, the single-critic agent tends to just stand still to manipulate, ignoring base velocity commands.
Multi-Critic (Solid Lines): The agent learns to minimize both position error and velocity error simultaneously. It successfully learns to walk and work.

We can also look at reward sensitivity. In RL, tuning the “weight” of a reward is tedious.

Graph showing reward sensitivity analysis.

Figure 6 shows that the Multi-Critic approach (MC) is robust to scaling. Even if you multiply the rewards by 5x or 10x, the performance remains stable. The Single-Critic (SC) approach, however, collapses; if you scale up the manipulation reward, locomotion quality degrades, and vice versa.

3. Emergent Behavior: Trotting

One of the most fascinating results is an “emergent behavior”—something the robot learned to do that it wasn’t explicitly forced to do in that specific way.

The robot was trained with a static walking gait pattern. However, because the Contact Schedule was separated into its own critic, the policy learned a generalized understanding of foot timings.

Comparison of foot heights during static walk vs. trot.

As seen in Figure 4, when the robot needs to move faster, it spontaneously adapts its gait to a trot (moving diagonal legs together), shown on the right. This adaptation happens at runtime without explicit programming for a trot gait, proving that the multi-critic architecture allows the policy to “understand” the underlying mechanics of locomotion better than a rigid reward structure would.

4. “Chicken Head” Stabilization

Finally, to prove the decoupling of base and arm, the researchers tested “Chicken Head” control. This involves commanding the base to walk back and forth while commanding the end-effector to stay locked at a specific point in the world (XYZ coordinates).

Graph showing end-effector position error and base velocity during chicken head mode.

Figure 9 shows the results. The Blue line shows the base moving significantly (up to 0.2 m/s). The Red line shows the end-effector error. Despite the body moving, the hand deviates by less than 3 centimeters on average. This requires the arm to actively compensate for every lurch and step of the legs—a classic example of successful whole-body control.

Conclusion and Future Implications

The paper “Mult-i-critic Learning for Whole-body End-effector Twist Tracking” provides a compelling blueprint for the next generation of mobile manipulators.

By moving away from static pose tracking to twist (velocity) tracking, the researchers enabled smoother, more dynamic arm movements. More importantly, by adopting a Multi-Critic architecture, they solved the “Jack of all trades, master of none” problem that plagues multi-objective Reinforcement Learning.

Key Takeaways:

Decouple your critics: When tasks conflict (like stability vs. reach), let separate neural networks evaluate them. This simplifies tuning and improves performance.
Control the velocity, not just the pose: For smooth interaction with the world, robots need to understand trajectories, not just destinations.
Emergent capabilities: A well-structured learning environment allows robots to develop useful behaviors (like trotting) that weren’t explicitly hard-coded.

As we look toward a future where robots help in households and industrial sites, the ability to walk and handle objects simultaneously—without stopping to stabilize every few seconds—will be the difference between a novelty and a genuinely useful machine. This research brings us one smooth step closer to that reality.

The Core Conflicts in Whole-Body Control#

The Problem with Pose Tracking#

The Solution: A Multi-Critic Architecture#

The Teacher-Student Pipeline#

Decomposing the Critic#

Twist-Based Task Formulation#

Why Velocity Matters#

Experimental Results#

1. Trajectory Tracking Accuracy#

2. The Power of Multi-Critic vs. Single-Critic#

3. Emergent Behavior: Trotting#

4. “Chicken Head” Stabilization#

Conclusion and Future Implications#