Introduction: The Blank Slate Problem

In the natural world, animals are born with a capacity to learn that far outstrips our current robotic systems. A foal learns to stand, walk, and eventually run, not by following a hard-coded script, but by interacting with its environment, failing, adjusting, and discovering what works. When an animal gets injured, it instinctively adapts its gait to favor the injured limb. It doesn’t need a software update; it relies on a diverse repertoire of movement skills it has acquired over time.

Robotics has long struggled to replicate this adaptability. Traditional methods often rely on rigid, pre-defined behaviors. Even modern Reinforcement Learning (RL), which allows robots to learn from trial and error, typically focuses on optimizing a single, specific task—like walking forward as fast as possible. If the robot encounters a situation it wasn’t explicitly trained for, such as a broken motor or a slippery floor, that single optimized policy often fails catastrophically.

To bridge this gap, researchers from Imperial College London have introduced Unsupervised Real-world Skill Acquisition (URSA). This new framework allows quadrupedal robots to start as a tabula rasa—a blank slate—and autonomously discover a wide range of movement skills directly in the real world.

Figure 1: We propose Unsupervised Real-world Skill Acquisition (URSA),a framework for unsupervised quality-diversity in real-world environments. Each skill is plotted in the latent space with color indicating its estimated value, V(s0,z) ,highlighting the diversity of learned behaviors.

As shown in the heatmap above, URSA enables a robot to build a library of distinct behaviors (represented by the colored dots). By learning not just one way to move, but many, the robot gains the resilience to adapt to physical damage without human intervention. In this post, we will deconstruct how URSA works, the math behind its “curiosity,” and how it enables robots to survive damage that would incapacitate standard systems.

Background: Quality-Diversity and the Reality Gap

To understand URSA, we must first understand the paradigm it improves upon: Quality-Diversity (QD).

In standard optimization, the goal is to find the single global maximum—the highest peak on a fitness landscape. QD algorithms, however, ask a different question: “How many different high-performing solutions can we find?” For a robot, we don’t just want the fastest gait; we want a fast gait, a stable gait, a low-crouch gait, and a high-step gait. This diversity is the key to robustness.

The Challenge of the Real World

While QD algorithms are powerful, they have historically been trapped in simulation. Running a QD algorithm requires millions of trials to explore the behavior space. Doing this on a physical robot is impractical for two reasons:

Sample Efficiency: It would take years of continuous running to collect enough data, during which the robot would wear out.
Safety: Exploration involves failure. In a simulator, a robot flailing its legs is fine. In the real world, bad behaviors break expensive hardware.

URSA addresses these challenges by combining QD with World Models (specifically, the DayDreamer architecture) and Constrained Reinforcement Learning. This allows the robot to “imagine” the consequences of its actions to learn faster and apply strict safety constraints to prevent self-destruction.

The URSA Framework

The core philosophy of URSA is to decouple the discovery of skills from the supervision of skills. The robot isn’t told “lift your left leg”; it is given a reward for forward motion and stability, and it must figure out the mechanics itself.

Figure 2: Overview of URSA: The system checks if the state is safe, if so, encodes it into features, and builds a diverse skill repertoire. New skills are sampled using a Kernel Density Estimator on the repertoire from the safe, reachable skill space. The skill-conditioned policy maximizes its expected return while matching the sampled skill z

The architecture, illustrated in Figure 2, operates as a continuous loop. Let’s break down the three critical pillars that make this system work:

Unsupervised Skill Discovery (The “What”)
Safety-Aware Optimization (The “How”)
Imagination-Based Training (The “Where”)

1. Unsupervised Skill Discovery

How does a robot know it has discovered a “new” skill if we don’t define what a skill is? URSA solves this using a Variational Autoencoder (VAE).

The robot continuously observes its own state (joint angles, body orientation). A VAE compresses this high-dimensional data into a compact, low-dimensional latent space (denoted as \(z\)).

Encoding: The robot takes a complex physical state and maps it to a point in the latent space.
The Repertoire: URSA maintains a “repertoire” (\(\mathcal{R}\)) of these latent points. Every time the robot tries a movement that results in a state significantly different from what it has seen before, it adds this new experience to its repertoire.

The definition of a “skill” essentially becomes: A specific vector \(z\) in the latent space that the robot tries to reproduce.

To ensure the robot keeps learning new things rather than repeating known comfortable moves, URSA employs a Kernel Density Estimator (KDE). The KDE creates a probability distribution over the skills the robot has already mastered. To push the boundaries of its abilities, the system samples target skills from this distribution, effectively asking the robot to reach specific areas of the behavior space.

Mathematically, the system maximizes the entropy of the skill distribution. By trying to flatten the distribution (make it uniform), the robot is forced to explore under-represented areas of the skill space. The researchers derive a lower bound for this entropy to make it tractable for optimization:

Mathematical derivation of the lower bound on the approximate entropy of the KDE.

This equation might look intimidating, but its function is intuitive: it pushes the robot to maximize the distance between the skills in its repertoire (diversity) while ensuring it can reliably reproduce them.

2. Safety-Aware Optimization

Exploration is dangerous. To prevent the robot from learning skills that involve falling over or stripping gears, URSA treats safety as a hard constraint rather than just a negative reward.

This is framed as a Constrained Markov Decision Process (CMDP). The robot must maximize its reward (moving forward) subject to two constraints:

Skill Matching: The robot’s behavior must match the target skill \(z\) it is trying to perform.
Safety: The robot must stay within a set of “safe states” (e.g., staying upright).

The optimization problem is solved using a Lagrangian method. In simple terms, the system has “budget” parameters (Lagrange multipliers) that dynamically adjust priorities. If the robot is in danger of falling, the “Safety” multiplier spikes, forcing the neural network to prioritize staying upright over moving fast or matching the skill.

The objective function looks like this:

Optimization objective function maximizing Value subject to constraints on skill matching and safety costs.

Here, \(V(s, z)\) is the value (reward), while the terms subtracted are the penalties for failing to match the skill (successor features \(\psi\)) and failing the safety check (\(C\)). This ensures the robot only adds safe skills to its repertoire.

3. Imagination-Based Training

To learn efficiently, URSA does not rely solely on physical movement. It uses the DayDreamer world model.

The robot builds a neural network model of the physics of the world. Once this model is accurate enough, the robot can pause its physical body and run thousands of “imagined” trajectories in a fraction of a second on its GPU.

Figure 8: Ilustration of URSA’s imagination-based training loop in the world model W. Given a sampled skill z, the world model generates imagined trajectories used to update networks parameterizing the value function V, successor features psi, cost function C, and the policy pi

As visualized above, the system samples a skill \(z\), and then the World Model (\(\mathcal{W}\)) predicts a sequence of future states (\(s_1, s_2, ...\)). The policy (\(\pi\)) and critics are updated based on these dreams. This allows URSA to learn complex behaviors with only about 5 hours of real-world data—a fraction of what model-free methods require.

Experimental Results

The researchers deployed URSA on a Unitree A1 quadruped robot. The goal was to see if the robot could learn to walk forward in diverse ways without being explicitly told how to coordinate its legs.

RQ1: Did it learn diverse skills?

The primary baseline comparisons were DayDreamer (which just tries to maximize reward, usually resulting in one optimal gait) and DOMiNiC (another diversity-seeking algorithm).

The results were striking. While DayDreamer converged to a single, high-performance gait, URSA filled the behavior space with a variety of movement strategies.

Figure 3: Average joint angles across the skill repertoire in URSA, DayDreamer, and DOMiNiC. Each cell represents average joint angles (hip, upper, and lower) for all leg combinations. Cells are colored if at least one skill’s average falls within that region.

Figure 3 illustrates the coverage of joint angles. The DayDreamer agent (center) occupies a tiny fraction of the space—it found one way to walk and stuck to it. URSA (left) explored a massive range of joint configurations. It learned high-stepping gaits, low-crouching gaits, and various rhythmic patterns, all autonomously.

RQ2: The Ultimate Test — Damage Adaptation

The true value of diversity is revealed when things go wrong. The researchers simulated severe damage to the robot: locking joints, removing the power to a thigh, or even disabling an entire leg.

Because DayDreamer only knew one way to walk, it failed catastrophically when that specific gait became physically impossible due to damage. URSA, however, had a library of thousands of skills. By combining URSA with a rapid adaptation algorithm called Iterative Trial and Error (ITE), the robot could quickly test different skills from its repertoire to see which ones still worked.

Figure 4: Comparison of returns across joint damage scenarios in simulation. The best return for URSA is shown in hatched bars,compared to a version using ITE for adaptation. Results display the median return and interquartile range (IQR) across 5 independent runs.

In the simulation results above (Figure 4), notice the “Without BR (Back Right) Thigh” scenario. The baseline DayDreamer (pink) collapses to near-zero performance. URSA (blue bars), however, maintains high performance. It simply switches to a gait that relies less on the back right thigh.

This resilience held up in the physical world as well.

Figure 5: Left: Comparison of returns across damage scenarios in the real world… Middle: Evolution of attempted skills… Right: Average reward during training…

In real-world tests (Figure 5, Left), URSA consistently outperformed DayDreamer when legs were disabled. The middle graph shows the adaptation process: initially, performance drops (the robot tries to walk normally and fails), but within just a few trials (iterations), it finds a new, effective gait, and performance recovers.

RQ3: Controllability

Finally, one might wonder if these “diverse skills” are actually useful or just random flailing. The researchers tested this by conditioning the skills on forward and angular velocity—essentially asking the robot, “Can you move forward at exactly 0.5 m/s while turning left?”

Figure 6: Velocity tracking errors during skill execution, evaluating the robot’s accuracy in following target velocity commands across the reachable space discovered by URSA. Lower values indicate better control.

Figure 6 shows the tracking error. The dark blue regions indicate low error, meaning the robot could precisely control its velocity across a wide range of commands. The spread of the dots on the left plot confirms that the robot learned to cover the entire velocity space, from standing still to fast walking, and from straight lines to sharp turns.

Conclusion

The URSA framework represents a significant step away from rigid, task-specific robot programming and toward emergent autonomy. By combining the safety of constrained RL, the efficiency of world models, and the curiosity of Quality-Diversity algorithms, the authors have created a system that prepares robots for the unexpected.

The key takeaways from this work are:

Diversity is Security: A robot that knows only one way to move is fragile. A robot that knows 1,000 ways to move is robust.
Unsupervised Learning Works on Hardware: We don’t need to hand-code skills. With the right architecture, robots can discover them.
Imagination Saves Time: “Dreaming” in a world model allows for the massive data throughput needed for deep learning without grinding physical gears to dust.

As we move toward robots that must operate in unstructured environments—from disaster zones to domestic households—the ability to adapt to damage and changing conditions will be more valuable than raw performance on a single metric. URSA demonstrates that the path to this adaptability lies in letting robots explore, learn, and diversify on their own terms.

Introduction: The Blank Slate Problem#

Background: Quality-Diversity and the Reality Gap#

The Challenge of the Real World#

The URSA Framework#

1. Unsupervised Skill Discovery#

2. Safety-Aware Optimization#

3. Imagination-Based Training#

Experimental Results#

RQ1: Did it learn diverse skills?#

RQ2: The Ultimate Test — Damage Adaptation#

RQ3: Controllability#

Conclusion#