Introduction

Deep Reinforcement Learning (RL) has revolutionized robotics, enabling legged machines to walk, run, and recover from falls. Yet, when we look at the most impressive demonstrations—robots doing backflips, leaping across wide gaps, or climbing vertically—there is often a hidden cost. That cost is human engineering effort.

To achieve agile locomotion, researchers often have to manually design complex reward functions, gather expensive expert demonstrations, or carefully architect curriculum learning phases that guide the robot step-by-step. If you simply ask a robot to “move forward” across a gap, it will likely fall into the gap repeatedly and fail to learn that it needs to jump. It gets stuck in a local optimum.

So, how do we get robots to learn creative, agile solutions without holding their hands?

In a recent paper titled “Unsupervised Skill Discovery as Exploration for Learning Agile Locomotion,” researchers from Georgia Tech propose a framework called SDAX (Skill Discovery as Exploration). This approach leverages the power of unsupervised learning to autonomously explore diverse behaviors. By treating skill discovery as a sophisticated exploration mechanism, SDAX enables quadrupedal robots to learn parkour-style maneuvers—like wall-jumping and gap leaping—with minimal reward engineering.

Figure 1: We deployed our policy on a real robot. The robot successfully leaps over the gap.

In this post, we will deconstruct how SDAX works, the mathematics behind its bi-level optimization, and how it allows robots to teach themselves to overcome physical obstacles.

The Core Problem: Exploration vs. Exploitation

In reinforcement learning, an agent faces the classic trade-off:

  1. Exploitation: Do what you know yields a reward (e.g., shuffling feet to move forward slightly).
  2. Exploitation: Try something new to potentially find a bigger reward (e.g., trying to jump).

For complex physical tasks, simple random exploration (adding noise to the motors) is rarely enough. A random twitch of a leg won’t result in a coordinated jump over a 48cm gap. The robot needs high-level exploration—it needs to explore different strategies (like running fast, jumping high, or crouching low) rather than just exploring different motor twitches.

Enter Unsupervised Skill Discovery

Unsupervised Skill Discovery is a sub-field of RL where an agent learns a repertoire of behaviors (skills) without any external task reward. The agent is usually given a latent vector \(z\) (think of it as a “style” or “command” code). The goal is to learn a policy that behaves differently for every distinct \(z\).

Traditionally, this is done to pre-train a robot so it has a library of moves (walking, rolling, hopping) that can be used later. The authors of SDAX asked a different question: Can we use this skill discovery process during training to solve a specific hard task?

The SDAX Framework

The intuition behind SDAX is to combine two objectives:

  1. The Task Objective: Solve the problem (e.g., maximize forward velocity).
  2. The Diversity Objective: Behave differently depending on the skill vector \(z\).

By pushing for diversity, the robot is forced to try distinct strategies. Some \(z\) vectors might make the robot jump high, while others make it run low. If “jumping high” helps cross a gap, the task reward will spike, and the agent will learn that strategy.

However, simply adding these two rewards together is dangerous. If the diversity reward is too strong, the robot will just perform random gymnastics and ignore the goal. If it’s too weak, the robot stops exploring.

SDAX solves this with a bi-level optimization scheme that automatically tunes the balance between exploration (diversity) and exploitation (task).

Figure 2: A figure of bi-level optimization for \\(\\pi _ { \\theta }\\) and \\(\\lambda\\) . The task reward gives the gradient signal for training \\(\\lambda\\) ,and the sum of both sources of rewards provides the gradient signal for optimizing \\(\\pi _ { \\theta }\\)

As shown in Figure 2, the framework consists of:

  • The Policy (\(\pi_\theta\)): Conditioned on the state \(s\) and the skill \(z\). It tries to maximize the combined reward.
  • The Balancing Parameter (\(\lambda\)): A learnable weight that determines how much the robot should care about diversity.

1. The Policy Objective

The policy \(\pi_\theta\) is trained to maximize the total return, which is the sum of the task reward and the weighted diversity reward.

Equation describing the policy optimization objective.

Here, \(r^{task}\) is the simple objective (e.g., “move forward”), and \(r^{div}\) is the intrinsic reward coming from the skill discovery module. \(\lambda\) scales the diversity reward.

2. The Diversity Reward (\(r^{div}\))

To calculate diversity, SDAX is agnostic to the specific algorithm, but the authors primarily use METRA (and compare with DIAYN). The goal of METRA is to maximize the Wasserstein Dependency Measure. In simple terms, it rewards the agent for moving to states that are distinguishable based on the skill \(z\), but far apart in the state space.

The objective for the diversity reward can be expressed as:

Equation for the METRA diversity objective.

This encourages the robot to find behaviors that result in distinct, distinguishable state trajectories.

3. The Bi-Level Optimization (The “Secret Sauce”)

The most innovative part of SDAX is how it handles \(\lambda\). Instead of manually tuning \(\lambda\) (which is tedious and brittle), the authors treat it as a learnable parameter.

The Logic: We want to adjust the amount of exploration (\(\lambda\)) such that it maximizes the Task Reward. If more exploration leads to better task performance, increase \(\lambda\). If exploration is distracting the robot from the goal, decrease \(\lambda\).

Mathematically, \(\lambda\) is optimized to maximize only the task reward:

Equation showing lambda is optimized for task reward only.

This creates a dependency chain. Changing \(\lambda\) changes the total reward, which changes the policy \(\pi\), which in turn changes the expected task reward. To update \(\lambda\), we need to compute the gradient of the task reward with respect to \(\lambda\).

Using the chain rule, we can decompose this gradient:

Equation decomposing the gradient of J_task with respect to lambda.

This basically asks: “How does the policy parameters \(\theta\) change when we change \(\lambda\), and how does that change in \(\theta\) affect the task performance?”

Through a derivation involving the policy gradient theorem (detailed in the paper’s appendix), the authors arrive at a practical update rule:

Equation for the approximate gradient of lambda.

Interpreting the Update Rule: Look at the equation above. It involves the dot product of two gradients:

  1. The gradient of the policy weighted by the Task Advantage (\(A^{task}\)).
  2. The gradient of the policy weighted by the Diversity Advantage (\(A^{div}\)).

Intuitively, this means:

  • If the gradient direction that improves the task is similar to the gradient direction that improves diversity (dot product is positive), then diversity is helping! Increase \(\lambda\).
  • If the task wants the policy to go one way, and diversity wants it to go the opposite way (dot product is negative), then diversity is hindering performance. Decrease \(\lambda\).

This allows the system to dynamically regulate exploration. Early in training, or when the robot is stuck, diversity might align with finding a solution. Once the task is solved, diversity might become a distraction, and \(\lambda\) will naturally decay.

Experimental Results

The authors tested SDAX on three standard parkour tasks—Leap, Climb, and Crawl—and one “super agile” task, Wall-Jump.

Simulation Performance

The baselines for comparison were:

  • Task Only: Standard PPO with only the task reward.
  • Div Only: Learning skills without any task goal.
  • RND: Using Random Network Distillation (a popular exploration bonus) instead of skill discovery.
  • SDAX (Ours): Using METRA or DIAYN for the diversity component.

The training curves show a clear advantage for SDAX (specifically with METRA).

Figure 3: Training curves, which denote the number of objects passed over the number of updates. Our method with METRA can solve all the tasks and exhibits better sample efficiency.

As seen in Figure 3:

  • Leap (a): The “Task Only” (blue) baseline plateaus quickly. It likely falls into the gap and gives up. SDAX (green) keeps exploring different jump velocities until it clears the gap, achieving high success.
  • Crawl (c): This is tricky because the robot must lower its height. Standard rewards usually encourage standing upright to balance. SDAX discovers the “crouching” skill autonomously.

Visualizing Exploration

Does the robot actually learn diverse skills? Yes. During training, different \(z\) vectors produce different behaviors.

Figure 4: Visualization of the diverse skills explored by the robot during training.

In Figure 4, we see the robot trying different strategies for the same obstacle.

  • Leap: Some skills result in a short hop (failure), while others trigger a powerful launch (success).
  • Climb: Some skills cause the robot to slip (failure), while others find the traction to ascend (success).

Crucially, because \(\lambda\) is tuned to maximize the task reward, the system eventually gravitates toward the successful skills.

The Adaptive \(\lambda\) vs. Fixed \(\lambda\)

A major claim of the paper is that automatic tuning of \(\lambda\) is better than finding a “magic number.”

Figure 5: SDAX outperforms all the baseline rewards with fixed value of lambda.

Figure 5(a) shows that SDAX (green line) outperforms fixed \(\lambda\) values (purple, blue, cyan). Figure 5(b) reveals the behavior of the auto-tuned \(\lambda\). It starts high (encouraging exploration early on) and drastically drops as the robot begins to master the task. This confirms the hypothesis: explore first, then exploit.

The “Positive Collapse” Phenomenon

One might worry: If we train a conditioned policy \(\pi(a|s,z)\), do we need to hunt for the perfect ‘z’ vector at test time?

Surprisingly, the authors found that as training progresses, most skill vectors converge to the solution. They call this “positive collapse.” Because the task reward is shared across all skills, and the task reward eventually dominates, the policy learns that “solving the task” is the best way to get rewards, regardless of what \(z\) is.

Table 1 in the paper (not pictured here, but referenced in results) shows that for the Leap task, success rates across random \(z\) vectors go from 43% to 97% over time. This makes deployment easy: just sample a random \(z\).

The Boss Level: Wall-Jump

To push the limits, the researchers designed a Wall-Jump task. This requires the robot to run, jump at a wall, kick off it (changing orientation), and land.

Standard task rewards (tracking a guideline) failed because the robot would just crash into the wall, unable to figure out the complex orientation change required to kick off.

Figure 25: Our method enables robots to solve the wall-jump task.

By including the robot’s orientation (roll/pitch/yaw) in the skill discovery observation space, SDAX explored different angles of approach.

  • Image (b): Task-only robot crashes.
  • Image (c) & (d): SDAX robot learns to tilt, kick, and backflip off the wall.
  • Graph (e): The green line shows SDAX achieving nearly double the reward of the baseline.

The resulting motion is highly dynamic and was learned without explicit motion capture demonstrations.

Figure 6: Wall-jump sequence visualization.

Real-World Deployment

Simulation is useful, but hardware is the truth. The authors took the policies trained in Isaac Gym and transferred them to a Unitree A1 robot. They used domain randomization (varying friction, mass, and motor strength) during training to ensure robustness.

The results transferred impressively.

Figure 7: Both the climbing (top) and crawling (bottom) policies’ skills were tested on the real robot.

The robot was able to climb 25cm platforms and crawl under 27cm obstacles. Furthermore, the crawling policy proved robust to different terrain friction, transitioning from wood to a rubber mat without retraining.

Figure 8: The crawling policy demonstrates robustness by successfully navigating under obstacles on different terrains: wood (top) and rubber mat (bottom).

Conclusion and Implications

SDAX represents a shift in how we think about “solving” robotic tasks. Instead of telling the robot exactly how to move (via rewards shaping) or showing it what to do (via demonstrations), we simply tell it: “Solve this task, and try to do it in as many different ways as possible.”

By automating the balance between these two instructions via bi-level optimization, SDAX allows the robot to act as its own scientist—hypothesizing different movement strategies (skills), testing them against the environment, and adopting the ones that work.

Key Takeaways:

  1. Exploration via Skills: Using skill discovery provides semantically meaningful exploration (jumping vs. running) rather than just random noise.
  2. Auto-Tuning: The bi-level optimization of \(\lambda\) removes the need for tedious hyperparameter tuning.
  3. Sim-to-Real: The discovered skills are physically feasible and robust enough for real hardware.

This work paves the way for robots that can autonomously learn to navigate increasingly complex environments, reducing the burden on human engineers and allowing for more general-purpose agile machines.