Introduction: The Blank Slate Problem
In the natural world, animals are born with a capacity to learn that far outstrips our current robotic systems. A foal learns to stand, walk, and eventually run, not by following a hard-coded script, but by interacting with its environment, failing, adjusting, and discovering what works. When an animal gets injured, it instinctively adapts its gait to favor the injured limb. It doesn’t need a software update; it relies on a diverse repertoire of movement skills it has acquired over time.
Robotics has long struggled to replicate this adaptability. Traditional methods often rely on rigid, pre-defined behaviors. Even modern Reinforcement Learning (RL), which allows robots to learn from trial and error, typically focuses on optimizing a single, specific task—like walking forward as fast as possible. If the robot encounters a situation it wasn’t explicitly trained for, such as a broken motor or a slippery floor, that single optimized policy often fails catastrophically.
To bridge this gap, researchers from Imperial College London have introduced Unsupervised Real-world Skill Acquisition (URSA). This new framework allows quadrupedal robots to start as a tabula rasa—a blank slate—and autonomously discover a wide range of movement skills directly in the real world.

As shown in the heatmap above, URSA enables a robot to build a library of distinct behaviors (represented by the colored dots). By learning not just one way to move, but many, the robot gains the resilience to adapt to physical damage without human intervention. In this post, we will deconstruct how URSA works, the math behind its “curiosity,” and how it enables robots to survive damage that would incapacitate standard systems.
Background: Quality-Diversity and the Reality Gap
To understand URSA, we must first understand the paradigm it improves upon: Quality-Diversity (QD).
In standard optimization, the goal is to find the single global maximum—the highest peak on a fitness landscape. QD algorithms, however, ask a different question: “How many different high-performing solutions can we find?” For a robot, we don’t just want the fastest gait; we want a fast gait, a stable gait, a low-crouch gait, and a high-step gait. This diversity is the key to robustness.
The Challenge of the Real World
While QD algorithms are powerful, they have historically been trapped in simulation. Running a QD algorithm requires millions of trials to explore the behavior space. Doing this on a physical robot is impractical for two reasons:
- Sample Efficiency: It would take years of continuous running to collect enough data, during which the robot would wear out.
- Safety: Exploration involves failure. In a simulator, a robot flailing its legs is fine. In the real world, bad behaviors break expensive hardware.
URSA addresses these challenges by combining QD with World Models (specifically, the DayDreamer architecture) and Constrained Reinforcement Learning. This allows the robot to “imagine” the consequences of its actions to learn faster and apply strict safety constraints to prevent self-destruction.
The URSA Framework
The core philosophy of URSA is to decouple the discovery of skills from the supervision of skills. The robot isn’t told “lift your left leg”; it is given a reward for forward motion and stability, and it must figure out the mechanics itself.

The architecture, illustrated in Figure 2, operates as a continuous loop. Let’s break down the three critical pillars that make this system work:
- Unsupervised Skill Discovery (The “What”)
- Safety-Aware Optimization (The “How”)
- Imagination-Based Training (The “Where”)
1. Unsupervised Skill Discovery
How does a robot know it has discovered a “new” skill if we don’t define what a skill is? URSA solves this using a Variational Autoencoder (VAE).
The robot continuously observes its own state (joint angles, body orientation). A VAE compresses this high-dimensional data into a compact, low-dimensional latent space (denoted as \(z\)).
- Encoding: The robot takes a complex physical state and maps it to a point in the latent space.
- The Repertoire: URSA maintains a “repertoire” (\(\mathcal{R}\)) of these latent points. Every time the robot tries a movement that results in a state significantly different from what it has seen before, it adds this new experience to its repertoire.
The definition of a “skill” essentially becomes: A specific vector \(z\) in the latent space that the robot tries to reproduce.
To ensure the robot keeps learning new things rather than repeating known comfortable moves, URSA employs a Kernel Density Estimator (KDE). The KDE creates a probability distribution over the skills the robot has already mastered. To push the boundaries of its abilities, the system samples target skills from this distribution, effectively asking the robot to reach specific areas of the behavior space.
Mathematically, the system maximizes the entropy of the skill distribution. By trying to flatten the distribution (make it uniform), the robot is forced to explore under-represented areas of the skill space. The researchers derive a lower bound for this entropy to make it tractable for optimization:

This equation might look intimidating, but its function is intuitive: it pushes the robot to maximize the distance between the skills in its repertoire (diversity) while ensuring it can reliably reproduce them.
2. Safety-Aware Optimization
Exploration is dangerous. To prevent the robot from learning skills that involve falling over or stripping gears, URSA treats safety as a hard constraint rather than just a negative reward.
This is framed as a Constrained Markov Decision Process (CMDP). The robot must maximize its reward (moving forward) subject to two constraints:
- Skill Matching: The robot’s behavior must match the target skill \(z\) it is trying to perform.
- Safety: The robot must stay within a set of “safe states” (e.g., staying upright).
The optimization problem is solved using a Lagrangian method. In simple terms, the system has “budget” parameters (Lagrange multipliers) that dynamically adjust priorities. If the robot is in danger of falling, the “Safety” multiplier spikes, forcing the neural network to prioritize staying upright over moving fast or matching the skill.
The objective function looks like this:

Here, \(V(s, z)\) is the value (reward), while the terms subtracted are the penalties for failing to match the skill (successor features \(\psi\)) and failing the safety check (\(C\)). This ensures the robot only adds safe skills to its repertoire.
3. Imagination-Based Training
To learn efficiently, URSA does not rely solely on physical movement. It uses the DayDreamer world model.
The robot builds a neural network model of the physics of the world. Once this model is accurate enough, the robot can pause its physical body and run thousands of “imagined” trajectories in a fraction of a second on its GPU.

As visualized above, the system samples a skill \(z\), and then the World Model (\(\mathcal{W}\)) predicts a sequence of future states (\(s_1, s_2, ...\)). The policy (\(\pi\)) and critics are updated based on these dreams. This allows URSA to learn complex behaviors with only about 5 hours of real-world data—a fraction of what model-free methods require.
Experimental Results
The researchers deployed URSA on a Unitree A1 quadruped robot. The goal was to see if the robot could learn to walk forward in diverse ways without being explicitly told how to coordinate its legs.
RQ1: Did it learn diverse skills?
The primary baseline comparisons were DayDreamer (which just tries to maximize reward, usually resulting in one optimal gait) and DOMiNiC (another diversity-seeking algorithm).
The results were striking. While DayDreamer converged to a single, high-performance gait, URSA filled the behavior space with a variety of movement strategies.

Figure 3 illustrates the coverage of joint angles. The DayDreamer agent (center) occupies a tiny fraction of the space—it found one way to walk and stuck to it. URSA (left) explored a massive range of joint configurations. It learned high-stepping gaits, low-crouching gaits, and various rhythmic patterns, all autonomously.
RQ2: The Ultimate Test — Damage Adaptation
The true value of diversity is revealed when things go wrong. The researchers simulated severe damage to the robot: locking joints, removing the power to a thigh, or even disabling an entire leg.
Because DayDreamer only knew one way to walk, it failed catastrophically when that specific gait became physically impossible due to damage. URSA, however, had a library of thousands of skills. By combining URSA with a rapid adaptation algorithm called Iterative Trial and Error (ITE), the robot could quickly test different skills from its repertoire to see which ones still worked.

In the simulation results above (Figure 4), notice the “Without BR (Back Right) Thigh” scenario. The baseline DayDreamer (pink) collapses to near-zero performance. URSA (blue bars), however, maintains high performance. It simply switches to a gait that relies less on the back right thigh.
This resilience held up in the physical world as well.

In real-world tests (Figure 5, Left), URSA consistently outperformed DayDreamer when legs were disabled. The middle graph shows the adaptation process: initially, performance drops (the robot tries to walk normally and fails), but within just a few trials (iterations), it finds a new, effective gait, and performance recovers.
RQ3: Controllability
Finally, one might wonder if these “diverse skills” are actually useful or just random flailing. The researchers tested this by conditioning the skills on forward and angular velocity—essentially asking the robot, “Can you move forward at exactly 0.5 m/s while turning left?”

Figure 6 shows the tracking error. The dark blue regions indicate low error, meaning the robot could precisely control its velocity across a wide range of commands. The spread of the dots on the left plot confirms that the robot learned to cover the entire velocity space, from standing still to fast walking, and from straight lines to sharp turns.
Conclusion
The URSA framework represents a significant step away from rigid, task-specific robot programming and toward emergent autonomy. By combining the safety of constrained RL, the efficiency of world models, and the curiosity of Quality-Diversity algorithms, the authors have created a system that prepares robots for the unexpected.
The key takeaways from this work are:
- Diversity is Security: A robot that knows only one way to move is fragile. A robot that knows 1,000 ways to move is robust.
- Unsupervised Learning Works on Hardware: We don’t need to hand-code skills. With the right architecture, robots can discover them.
- Imagination Saves Time: “Dreaming” in a world model allows for the massive data throughput needed for deep learning without grinding physical gears to dust.
As we move toward robots that must operate in unstructured environments—from disaster zones to domestic households—the ability to adapt to damage and changing conditions will be more valuable than raw performance on a single metric. URSA demonstrates that the path to this adaptability lies in letting robots explore, learn, and diversify on their own terms.
](https://deep-paper.org/en/paper/2508.19172/images/cover.png)