Reinforcement Learning (RL) has revolutionized how robots move. We can now train quadrupedal robots to run over rough terrain and robotic arms to reach targets with impressive reliability. However, there is often a stark difference between a robot that successfully completes a task and one that looks natural doing it.
Pure RL policies tend to result in jittery, mechanical, or “weird” behaviors because the reward function strictly prioritizes efficiency—minimizing energy or maximizing speed—ignoring the nuances of biological motion. To fix this, researchers often turn to Imitation Learning, feeding the robot motion capture data from humans or animals.
But here is the catch: Demonstrations are rarely perfect.
If you teach a robot to walk using data recorded on a flat treadmill, what happens when it encounters stairs? If the robot strictly copies the style, it trips. If it strictly focuses on the task, it loses the style. Balancing these two has traditionally been a tedious game of manual tuning.
In this post, we dive into ConsMimic, a new framework presented by researchers from ETH Zurich. They propose a clever way to mathematically force a robot to be “as stylish as possible” only after ensuring it is doing its job correctly.
The Core Problem: The Task-Style Trade-off
Imagine you want a humanoid robot to walk across stepping stones. You have motion capture data of a human walking on a flat floor.
- Task Objective: Don’t fall, step on the stones, move forward.
- Style Objective: Move your arms and legs like the human in the data.
These two goals conflict. The stride length required for the stones might not match the stride length in the human data.
In traditional approaches, engineers create a weighted reward function:
\[ Reward = w_{task} \cdot R_{task} + w_{style} \cdot R_{style} \]Finding the right balance for \(w_{task}\) and \(w_{style}\) is difficult. If \(w_{style}\) is too high, the robot prioritizes looking human over not falling, leading to failure. If \(w_{task}\) is too high, the robot ignores the human data and develops an unnatural gait. This is known as the Task-Style Trade-off.
Enter ConsMimic: Constrained Markov Decision Processes
The researchers propose that instead of hoping fixed weights will work, we should formulate the problem as a Constrained Markov Decision Process (CMDP).
The philosophy is simple: “The task is non-negotiable; the style is a bonus.”
ConsMimic sets up the learning problem to maximize the style reward subject to the constraint that the task performance remains near-optimal.

As shown in Figure 1, the architecture splits the reward signals. The agent receives feedback from a Task Critic (how well am I doing the job?) and a Style Critic (how cool do I look?). The magic happens in how these signals are combined using a constraint.
The Math Behind the Magic
The core objective of ConsMimic can be written as an optimization problem. We want to find a policy \(\pi\) that maximizes style value (\(v^s\)) but ensures the task value (\(v^g\)) is at least a fraction (\(\alpha\)) of the best possible task performance (\(v^{g*}\)):

Here, \(\alpha\) is a user-defined threshold (e.g., 0.9). It tells the robot: “I don’t care how you move, as long as you achieve at least 90% of the optimal success rate.”
To solve this constrained problem, the authors use the Lagrangian Multiplier method. This converts the constraint into a penalty term in the loss function.

Think of \(\lambda\) (lambda) as a dynamic “price” for violating the rule.
- If the robot’s task performance drops below the threshold, \(\lambda\) increases. This makes the “cost” of failing the task very high, forcing the neural network to focus purely on the task.
- If the robot is comfortably succeeding at the task, \(\lambda\) decreases. The network is then free to “spend” its optimization budget on improving style.
Adapting to the Unknown
There is a tricky variable in the equation above: \(v^{g*}\) (the optimal task performance). How do we know what the optimal performance is before we’ve even trained the robot?
If we guess too high, the constraint is impossible to satisfy. If we guess too low, the robot settles for mediocrity.
ConsMimic solves this with an online update rule.
- Warm-up: The training starts with \(\lambda\) set very high (pure task learning). The robot learns to solve the task without worrying about style.
- Record High Score: The system records the best task reward achieved so far.
- Dynamic Adjustment: During training, if the robot discovers a better way to perform the task, \(v^{g*}\) is updated.

This ensures the constraint is always realistic but aspirational. The robot effectively determines its own performance ceiling and then tries to inject style without dropping too far below that ceiling.
How to Measure Style?
Before constraining the style, the robot needs to know what “style” is. The paper uses two methods depending on the task:
Motion Clip Tracking: For manipulation tasks (like a robotic arm), the reward is based on how closely the joints match a specific trajectory.

Adversarial Imitation (AMP): For locomotion (walking/running), exact tracking is too restrictive. Instead, they use a Discriminator (similar to GANs). The Discriminator tries to tell the difference between the robot’s motion and the reference motion. The robot gets a reward if it can fool the Discriminator.

The Symmetry Problem
A common issue in Adversarial Imitation Learning is “mode collapse.” If the demonstration data isn’t perfect, the robot might learn a weird, asymmetric gait (e.g., limping) because the Discriminator focuses on the wrong features.
To fix this, ConsMimic augments the style reward with Symmetry. It mathematically flips the robot’s state (mirroring left and right) and averages the style reward across these transformations.

This forces the robot to learn a balanced gait, even if the demonstration data is slightly biased or imperfect.
Experimental Results
The researchers tested ConsMimic on three distinct platforms:
- Franka Emika Panda: A robotic arm reaching for targets.
- ANYmal-D: A quadruped robot (robot dog).
- GR1: A full-body humanoid robot.
Simulation Performance
The results were compared against a “Task-Only” baseline (no style) and fixed-weight baselines (mixing task and style rewards manually).

In Figure 2 above, look at the green bars (ConsMimic):
- Top Row (Task Reward): ConsMimic maintains high task performance, almost matching the “Task-Only” (purple) baseline. The fixed-weight baseline (dark gray) often fails to solve the task (see Anymal-Lateral).
- Bottom Row (Imitation Score): ConsMimic achieves significantly higher style scores than the Task-Only baseline. While the aggressive fixed-weight baseline (dark gray) sometimes has higher style, it comes at the cost of failing the task (as seen in the top row).
Visualizing the Trade-off
One of the most intuitive results comes from the Franka arm experiment. The goal is to reach a target (Task), but the demonstration suggests a curved, stylish path (Style).

Figure 4 demonstrates the power of the \(\alpha\) parameter:
- \(\alpha = 1.0\) (Left): The robot is forced to be 100% optimal on the task. It ignores the yellow demonstration curve and takes the shortest path (green line matches red line).
- \(\alpha = 0.9\) (Middle): The robot is allowed 10% slack in task optimality. It starts to curve its path to match the style.
- \(\alpha = 0.8\) (Right): With 20% slack, the robot almost perfectly mimics the demonstration curve while still reaching the target.
Humanoid on Rough Terrain
The GR1 humanoid experiments showed that ConsMimic helps generalization. The robot was shown walking data from flat ground but was forced to walk on stairs and stepping stones.

Thanks to the symmetry augmentation and the adaptive constraint, the robot adapted its gait. It lifted its feet higher for the stairs (Task Requirement) while maintaining the upper-body posture and rhythm of the human demonstration (Style Requirement).
Real-World Validation: ANYmal-D
Finally, does it work on real hardware? The team deployed the policy on the ANYmal-D robot.

The visual difference is subtle but important. The “Task-Only” policy (top) often results in stomping or stiff legs. The ConsMimic policy (bottom) produced a more “agile trotting gait.”
Quantitatively, the benefits were clear:
- Mechanical Energy: Reduced by 14.5%.
- Foot Clearance: The robot lifted its feet higher and moved more smoothly.
By moving more “naturally” (imitating biological motion), the robot actually became more energy-efficient, proving that style isn’t just about aesthetics—it’s often about physical efficiency.
Conclusion
ConsMimic provides a robust framework for bridging the gap between rigid robotic control and natural biological motion. By formulating the problem as a Constrained MDP with an adaptive Lagrangian multiplier, the method removes the need for tedious reward tuning.
The key takeaways are:
- Safety First: Prioritize task completion constraints over style maximization.
- Adaptive Learning: Automatically adjust the definition of “optimality” as the agent learns.
- Symmetry Matters: Enforce geometric constraints to prevent unnatural limping or mode collapse.
This approach paves the way for robots that can operate in human environments not just effectively, but with the grace and agility we expect from biological counterparts. Whether it’s a humanoid navigating a construction site or a robotic dog making a delivery, looking natural is a key step toward social acceptance and physical robustness.
](https://deep-paper.org/en/paper/2507.09371/images/cover.png)