Introduction: The Battery Bottleneck

Imagine buying a state-of-the-art quadruped robot. It’s agile, intelligent, and capable of traversing rough terrain. You deploy it for a search and rescue mission or a routine inspection, and 60 minutes later, it shuts down. The battery is dead.

This is the reality for many untethered robots today. For instance, the Unitree Go2 typically operates for only 1 to 4 hours on a single charge. If we want robots to be truly useful in the real world, we cannot just focus on what they do (task performance); we must focus on how they do it (energy efficiency).

In Reinforcement Learning (RL), the standard way to teach a robot to be efficient is to punish it for using energy. You give the robot a reward for moving forward, but you subtract points based on how much torque it uses. This sounds simple, but in practice, it creates a frustrating balancing act. If the penalty is too small, the robot ignores it and drains the battery. If the penalty is too high, the robot becomes “lazy”—it might just stand still or crawl to save energy, failing the mission entirely.

In this post, we are diving into a research paper that proposes a clever solution to this problem: PEGrad (Projecting Energy Gradients). Instead of forcing engineers to manually tune the trade-off between performance and energy, PEGrad mathematically ensures that energy minimization never conflicts with the primary task.

The Problem with Reward Shaping

To understand why PEGrad is necessary, we first need to look at how energy efficiency is currently handled in RL.

Typically, we train a robot using a reward function \(r\). If we want energy efficiency, we create a composite reward:

\[ r = r_{task} + \lambda r_{energy} \]

Here, \(r_{task}\) is the reward for doing the job (e.g., running at 1 m/s), and \(r_{energy}\) is a penalty for energy use (usually negative). The symbol \(\lambda\) (lambda) is a weighting factor that controls how much we care about energy.

The problem is finding the right \(\lambda\).

Figure 1: Comparison of different lambda values showing the trade-off between return and torque.

As shown in Figure 1 above, the choice of \(\lambda\) is incredibly sensitive.

  • Left Plot (Dog-Run): Notice the difference between \(\lambda=0.01\) and \(\lambda=0.1\). At 0.01, the robot runs well. At 0.1, the penalty is so high that the robot likely resorts to crawling or barely moving to avoid “spending” torque.
  • Right Plot (Humanoid Sit): Here, both \(\lambda\) values work fine.

This inconsistency is the core issue. A \(\lambda\) that works for walking might fail for running. A \(\lambda\) that works for a quadruped might fail for a humanoid. This forces researchers to run expensive grid searches, retraining policies over and over to find a “Goldilocks” value that saves energy without destroying performance.

Ideally, we don’t want to balance these objectives; we want to order them. We want to tell the robot: “Prioritize the task above all else. But, provided you are succeeding at the task, do it using the least amount of energy possible.”

Background: Multi-Objective Reinforcement Learning

Before diving into the solution, let’s briefly establish the mathematical landscape. The authors build their method on top of standard Actor-Critic algorithms like Soft Actor Critic (SAC) or PPO.

In a standard Actor-Critic setup, we have:

  1. The Actor (\(\pi\)): The policy that decides what action to take.
  2. The Critic (\(Q\)): The network that estimates how good that action is (expected return).

The Critic tries to minimize the prediction error (Bellman error):

Equation 1: The Critic loss function.

The Actor tries to maximize the value predicted by the Critic (plus an entropy term for exploration):

Equation 2: The Actor loss function.

In a Multi-Objective setting, we don’t just have one Critic. We use two:

  1. \(Q^r\): Estimates the future task reward.
  2. \(Q^e\): Estimates the future energy consumption.

The standard approach essentially combines these estimates linearly using that troublesome \(\lambda\) parameter we discussed earlier:

Equation 4: The standard linear combination of task and energy objectives.

This equation mathematically encodes the conflict. If the gradient of \(Q^r\) points “North” (improve task) and the gradient of \(Q^e\) points “South” (save energy), the optimizer averages them out based on \(\lambda\). If \(\lambda\) is large, “South” wins, and the robot stops doing the task.

The PEGrad Method: Orthogonal Projection

The researchers propose PEGrad to eliminate the need for \(\lambda\). The core insight is geometric.

Imagine the neural network’s parameters as coordinates on a map. There is a “Task Hill” we want to climb to get high rewards. There is also an “Energy Valley” we want to descend into.

If we simply try to walk downhill into the Energy Valley, we might accidentally walk down the Task Hill, losing performance. We want to move towards lower energy only if that movement doesn’t lower our elevation on the Task Hill. In mathematical terms, we want to move along the level set (contour line) of the task performance.

Step 1: Separating the Gradients

First, the authors separate the policy loss into two distinct components: \(L_R\) (Reward Loss) and \(L_E\) (Energy Loss).

Equation 5: Definition of Reward Loss and Energy Loss.

We can compute the gradients for these separately:

  • \(g_R\): The direction that improves the task.
  • \(g_E\): The direction that reduces energy usage.

Step 2: The Taylor Expansion Logic

Why does moving along the “level set” work? We can look at the first-order Taylor approximation of the reward loss. If we change the policy parameters \(\theta\) by a small step \(d\), the new loss is approximately:

Equation 6: Taylor approximation of the reward loss.

If our step \(d\) is orthogonal (perpendicular) to the task gradient \(g_R\), then the dot product \(g_R^T d\) is zero. This means \(L_R(\theta + d) \approx L_R(\theta)\).

In plain English: If we move the weights in a direction perpendicular to the task gradient, the task performance remains unchanged (locally).

Step 3: Projecting the Energy Gradient

This brings us to the main formula of PEGrad. We want to descend the energy gradient \(g_E\), but we must strip away any part of that gradient that conflicts with the task gradient \(g_R\).

We do this using orthogonal projection (Gram-Schmidt process). We calculate the component of \(g_E\) that is parallel to \(g_R\) and subtract it. What remains is \(g_{E_{\perp R}}\)—the energy gradient projected onto the null space of the task gradient.

The final update direction \(d\) combines the standard task update with this “safe” energy update:

Equation 7: The update direction combining task gradient and projected energy gradient.

Here is what is happening in this equation:

  1. \(-\alpha g_R\): This is the standard learning step. Improve the task.
  2. The term in the parenthesis is the projection. It takes the energy gradient and forces it to be 90 degrees to the task gradient.
  3. \(-\beta g_{E_{\perp R}}\): This step minimizes energy without affecting the task.

Step 4: Adaptive Scaling (\(\beta\))

You might notice a new parameter \(\beta\) in the equation above. Is this just another hyperparameter we have to tune, replacing \(\lambda\)?

No. The authors define \(\beta\) adaptively.

The logic is that the “correction” (the energy adjustment) should never be larger than the primary learning signal (the task update). If the energy gradient is massive, we cap it so it doesn’t destabilize training.

Equation 8: Adaptive definition of beta.

This formula ensures that the step size for energy minimization is effectively scaled relative to the task learning rate \(\alpha\), but clipped so it never dominates. This makes PEGrad hyperparameter-free. You don’t pick \(\beta\); the math picks it for you at every single time step.

Implementation Simplicity

One of the strengths of this method is how easy it is to implement in modern Deep RL frameworks like PyTorch.

Algorithm 1 (PEGrad) Summary:

  1. Compute the backward pass for Task Reward \(\rightarrow\) get \(g_R\).
  2. Compute the backward pass for Energy \(\rightarrow\) get \(g_E\).
  3. Calculate the projection: \(g_{E_{\perp R}} = g_E - \frac{g_R^T g_E}{g_R^T g_R} g_R\).
  4. Rescale \(g_{E_{\perp R}}\) if it’s too big (adaptive \(\beta\)).
  5. Feed the combined gradient (\(g_R + g_{E_{\perp R}}\)) to the optimizer.

This requires two backward passes (one for each objective), which adds a slight computational cost, but it removes the human cost of tuning \(\lambda\) for days.

Experimental Results

The authors tested PEGrad on two major benchmarks: DM-Control (Quadruped, Dog) and HumanoidBench (H1 Humanoid). They compared PEGrad against:

  1. Base: Standard RL with no energy constraints (\(\lambda=0\)).
  2. Fixed \(\lambda\): Standard RL with various manual weights.
  3. PCGrad: A competitor method (Gradient Surgery) usually used for multi-task learning.

Simulation Results: DM-Control

The goal here is simple: High Return (Y-axis) and Low Torque (X-axis).

Figure 2: Results on DM-Control Suite showing Return vs Torque.

Look at the scatter plots in Figure 2.

  • The triangles (\(\lambda=0\)) have high returns but huge energy usage (far right).
  • The hexagons (high \(\lambda\)) are far to the left (low energy) but often drop in return (lower down).
  • PEGrad (The Star): Consistently lands in the top-left corner. This is the Pareto optimal zone. It achieves returns comparable to the unconstrained baseline but with significantly less torque.

For the Dog-Run task (bottom left of Figure 2), notice how the fixed \(\lambda\) baselines struggle. They are either efficient-but-fail or successful-but-expensive. PEGrad finds a policy that is both successful and efficient.

Simulation Results: HumanoidBench

Humanoid robots are notoriously harder to control than quadrupeds.

Figure 3: Results on HumanoidBench tasks.

In Figure 3, we see similar success.

  • Sample Efficiency: Look at the “Avg. Return” line graphs (top right of the panels). PEGrad (orange/red line) often learns faster than the baseline.
  • Why? In high-dimensional control (like a humanoid), “flailing” your limbs is a bad exploration strategy. By constraining the policy to be energy-efficient, PEGrad implicitly guides the robot toward smoother, more natural movements, which helps it learn to walk and run sooner.
  • PCGrad Failure: The purple dots/lines representing PCGrad often fail completely (near zero return). The authors speculate this is because PCGrad lacks the priority ordering—it treats energy and task as equals, leading to cancellations that stop the robot from learning the task.

Sim2Real: Transfer to Physical Robot

Simulation is great, but does it work on hardware? The authors deployed their policy on a Unitree Go2 quadruped.

Figure 6: Real world setup with Unitree Go2.

They trained the robot to Stand and Walk using PPO + PEGrad and compared it to a manually tuned baseline and the factory controller. They measured the actual current drawn from the battery.

Table 1: Current and Torque usage comparison.

Table 1 shows the results:

  • Walking Task: PEGrad reduced current drawn by roughly 19.7% compared to the best manually tuned baseline (\(\lambda=0.0002\)).
  • Factory Comparison: PEGrad used significantly less current (5.65 A) compared to the Factory controller (6.46 A).
  • Safety: The baseline with \(\lambda=0\) (no energy penalty) was so aggressive it wasn’t even safe to run for the full walking test—it took dangerous jumps. PEGrad remained stable and efficient.

Conclusion and Implications

The “Non-conflicting Energy Minimization” paper presents a compelling argument against the traditional method of reward shaping. By treating energy minimization as a gradient projection problem rather than a reward weighting problem, PEGrad offers several advantages:

  1. Hyperparameter-Free: No more grid searches for \(\lambda\).
  2. Safety: It inherently prioritizes task success, preventing the “lazy robot” failure mode.
  3. Efficiency: It achieved up to 64% energy reduction in simulations compared to unconstrained policies.
  4. Transferability: It works on real hardware, translating theoretical torque reductions into actual battery life gains.

Why This Matters

As robots move from labs to the real world—delivery bots, inspection drones, home assistants—energy efficiency becomes a critical driver of economic viability. A delivery robot that lasts 6 hours is twice as valuable as one that lasts 3.

PEGrad provides a principled way to squeeze that extra runtime out of the battery without requiring engineers to spend weeks fine-tuning reward functions for every new task or environment. It allows the robot to learn the most efficient way to succeed, dynamically and automatically.

Limitations

The authors note that PEGrad can sometimes conflict with “Style Rewards” (rewards used to make the robot look natural, like “don’t wiggle your feet”). If minimizing energy conflicts with “looking natural,” the robot might choose to look unnatural to save power. Future work will likely look at how to chain multiple objectives (Task > Style > Energy) using recursive projections.

For now, however, PEGrad represents a significant step forward in making RL-controlled robots practical, durable, and energy-aware.