Imagine trying to learn a new dance move. You don’t read a physics textbook about torque and angular velocity; you simply watch someone else do it, and you try to mimic them. You look at their feet, the rhythm of their steps, and you adjust your body until you match what you see.

For a long time, teaching robots to move has been the exact opposite. Engineers have had to painstakingly define “rewards” using complex mathematical functions. If you want a robot dog to trot, you have to mathematically describe what a trot looks like—defining foot heights, joint velocities, and contact timings—and then use Reinforcement Learning (RL) to “train” the robot to maximize that score. It is a brittle, time-consuming process that often requires expensive Motion Capture (MoCap) labs.

But what if a robot could learn like we do? What if it could just watch a video of a dog running and say, “Okay, I get it,” and then teach itself to do the same?

That is the premise behind SDS (“See it, Do it, Sorted”), a new framework from the Robot Perception Lab at University College London. This paper introduces a pipeline that allows quadrupedal robots (like the Unitree Go1) to learn diverse locomotion skills from a single video demonstration—without human labels, MoCap data, or manual reward engineering.

Figure 1: SDS’s ability to imitate demonstrated skills (top), in simulation (center), and real-world (bottom). The blue tape corresponds to the rear legs, and the red tape to the left-side legs.

In this deep dive, we will explore how SDS leverages the power of GPT-4o to bridge the gap between visual observation and motor control, effectively giving robots the ability to “see, do, and sort” their own skills.

The Bottleneck in Robot Learning

To understand why SDS is significant, we first need to look at how legged locomotion is usually taught.

In Deep Reinforcement Learning (DRL), an agent (the robot) explores an environment and receives a “reward” signal telling it how well it’s doing. The “Reward Function” is the critical piece of code that defines the task.

  • Too simple? The robot finds a loophole (e.g., diving forward instead of running).
  • Too complex? The robot never figures out how to get a positive score.

Traditionally, these functions are hand-coded by experts. If you want to change the gait from a “trot” to a “bound,” you have to rewrite the math. Recent attempts to automate this using Large Language Models (LLMs) like Eureka have shown promise, but they typically rely on text descriptions (“Write a reward for a backflip”). They don’t see the movement.

SDS changes this by creating a Visual-Language feedback loop. It uses GPT-4o not just to write code, but to act as a visual critic, comparing the robot’s attempts against the original video and refining the code iteratively.

The SDS Pipeline: An Overview

The SDS method is a closed-loop system. It doesn’t just look at the video once; it constantly observes the robot’s progress and updates its instructions.

Figure 2: SDS Method Overview. The process cycles from video input to reward generation, training, and evaluation.

As shown in the overview above, the process can be broken down into seven distinct steps:

  1. Input & Analysis: The system takes a raw video and processes it into a format the AI can understand (\(G_v\)).
  2. Reward Generation: GPT-4o writes a candidate Reward Function (\(\mathcal{RF}\)) in Python.
  3. Training: The robot trains in a physics simulator (Isaac Gym) using that function.
  4. Feedback Recording: The system records the robot’s attempt, capturing video frames (\(G_s\)) and contact patterns.
  5. Evaluation: GPT-4o watches the simulation footage and compares it to the original video.
  6. Selection: The best performing reward function is kept.
  7. Evolution: The system refines the code based on the visual feedback and repeats the loop.

Let’s break down the technical innovations that make this possible.

Innovation 1: The Spatio-Temporal Grid (\(G_v\))

One of the biggest challenges in using Visual-Language Models (VLMs) like GPT-4o for video is that they don’t actually “watch” video in the way humans do. They process tokens. Feeding a full high-framerate video into a model is computationally expensive and often results in the model hallucinating details or losing track of time.

SDS solves this with a clever preprocessing step called Grid-Frame Prompting.

Instead of a video file, the system converts the demonstration into a single static image containing a grid of frames. This grid, denoted as \(\mathcal{G}_v\), captures the temporal evolution of the movement in a spatial layout.

Figure 7: Demonstration videos arranged in a grid formation (\\(G_v\\)), serving as input to GPT-4o for SDS processing.

The researchers don’t just pick random frames. They use an adaptive sampling strategy based on the velocity of the movement. If the animal is moving slowly, the system samples frames more densely to capture subtle shifts. If the movement is fast, the sampling is spread out.

The number of frames (\(n\)) and the time interval (\(\tau\)) are calculated as:

Equation for frame sampling based on velocity and duration.

Here, \(T\) is the video duration and \(v\) is the velocity. The frames are arranged into a square grid (e.g., \(4 \times 4\) or \(6 \times 6\)). This allows GPT-4o to analyze the entire gait cycle in a single “glance” without needing a video memory buffer.

To further help the AI understand what it’s looking at, SDS augments these grids with ViTPose++. This is a computer vision algorithm that overlays a skeletal stick figure onto the animal.

Figure 4: ViTPose++ Estimation on the demonstration frame and corresponding simulation frame.

By providing the “skeleton” view, SDS reduces ambiguity. The AI doesn’t have to guess where the knee is; it is explicitly drawn on the image.

Innovation 2: SUS Prompting (See it, Understand it, Sorted)

Giving an AI a picture of a dog and saying “make the robot do this” is usually too vague. To generate precise mathematical reward functions, the AI needs to reason about physics, timing, and stability.

The authors introduce SUS, a “Chain-of-Thought” prompting strategy that breaks the cognitive process into four distinct “agents” or personas.

Figure 3: SDS Prompting Techniques for GPT-4o: a) Demonstration video frames, arranged in a grid (\\(G_v\\)) b) SUS skill decomposition into 4 task-specific agents.

  1. Task-Descriptor Agent: Looks at the grid and identifies the subject and the general action (e.g., “A dog running from left to right”).
  2. Gait-Analyzer Agent: Specifically looks at the feet. It identifies the contact sequence—which legs touch the ground and when. This is crucial for distinguishing a trot (diagonal pairs) from a pace (side pairs).
  3. Task-Requirement Agent: Analyzes physics constraints. Is the torso stable? Is the movement fast or slow? What is the orientation relative to the ground?
  4. SUS-Prompt-Generator Agent: Synthesizes all this observation into a structured prompt that is finally used to generate the Python code.

This structured decomposition prevents the AI from skipping steps and ensures the generated code is grounded in specific visual evidence.

The Evolutionary Loop: Learning from Mistakes

Once the Python reward function is generated, the system spins up a simulation in Isaac Gym. This is a high-performance physics engine that can simulate thousands of robots in parallel.

The robot tries to maximize the reward it was given using an algorithm called PPO (Proximal Policy Optimization). After a short training burst, the system records the result.

But how do we know if the robot succeeded?

In standard pipelines, a human would check. In SDS, GPT-4o checks. The system generates a simulation grid (\(\mathcal{G}_s\))—a grid of images looking exactly like the input grid, but featuring the robot.

Figure 8: Simulation footage arranged in a grid formation (\\(G_s\\)), serving as input as GPT-4o input.

The AI compares the input grid (real dog) with the simulation grid (robot). It looks for:

  • Visual Similarity: Do the poses match?
  • Contact Patterns (CP): Does the rhythm of footsteps match?
  • Stability: Is the robot falling over?

This evaluation creates a feedback loop. If the robot is dragging its feet, GPT-4o notices the visual discrepancy in the grids and modifies the reward function in the next iteration to penalize foot-dragging.

Figure 10: Evolution of task behavior of all skills at a matched gait phase (T=5s) across the 5 SDS reward iterations.

As seen in Figure 10, the first iteration often results in messy or incorrect movement. By iteration 5, the robot’s movement has “evolved” to closely match the reference animal, purely through this automated visual critique.

Experimental Results

The researchers tested SDS on four distinct gaits: Pace, Trot, Hop, and Bound. These represent a spectrum of difficulty and coordination.

Accuracy and Imitation

The primary metric used was Dynamic Time Warping (DTW) distance. DTW measures the similarity between two temporal sequences (the trajectories of the real animal vs. the robot), even if they play at slightly different speeds.

The results were impressive. SDS achieved DTW distances in the order of \(10^{-6}\), indicating extremely high fidelity. More importantly, it achieved 100% gait matching in terms of contact sequences for most skills.

Figure 5: Gait evaluation results. Contact sequences and base height stability.

In the figure above, panel (a) shows the contact patterns. The black/orange blocks represent feet touching the ground. You can see the distinct rhythm of each gait (e.g., the diagonal synchronization in Trotting vs. the front/back synchronization in Bounding). The robot’s patterns (Sim) almost perfectly align with the expected biological patterns.

Stability

One of the dangers of learning from video is that an animal might look stable, but a robot mimicking it might fall over due to different weight distributions. SDS policies proved highly robust. Panel (b) in Figure 5 shows the base height stability. The fluctuations are minimal (millimeters), meaning the robot isn’t bouncing uncontrollably or jittering.

Stability-to-Speed (StS) Metric

The researchers introduced a combined score for stability, penalizing deviation from a safe center of mass and excessive wobbling (angular velocity).

Equation for Stability-to-Speed (StS) ratio.

Even when pushed with external forces (up to 110N lateral perturbation), the SDS-trained policies maintained high stability scores.

Table 10: Policy Stability Score (StS) under 0N and 110N lateral perturbations.

Generalization: Different Body, Same Skill

A major test for any robotic learning algorithm is morphology generalization. If you train a system on a video of a dog, can it control a robot that looks different?

The team tested SDS on the ANYmal-D robot. Unlike the Unitree Go1 (which looks somewhat like the dogs in the videos), the ANYmal is larger, heavier, and has inverted rear knee joints. This kinematic difference usually breaks standard controllers.

Figure 11: Demonstration of the generalization capabilities of SDS on the ANYmal quadruped robot.

Remarkably, SDS successfully taught the ANYmal to trot and bound. Because the system relies on visual goals (where the foot should be) rather than copying joint angles directly, it allows the optimizer to figure out how to get the foot there, regardless of which way the knee bends.

Why Does It Work? (Ablation Studies)

The authors performed “ablation studies”—systematically removing parts of the system to see what breaks.

  1. No Grid (\(G_v\)): If you remove the grid prompting, the AI loses the concept of time and sequence. Imitation fails completely.
  2. No SUS: Without the decomposition agents, the generated reward functions are generic and fail to capture specific gait details.
  3. No Contact Patterns: If the feedback loop only looks at the robot’s body pose but ignores the specific timing of foot contacts, the robot learns to “slide” or “pose” without actually walking properly.

Figure 6: (a) Mean SDS reward signal and ablated variants. (b) Evolution of trotting behavior.

Figure 6(a) shows the learning curves. The black line (Full SDS) consistently achieves higher rewards than versions missing components (the colored lines).

Conclusion and Future Outlook

SDS represents a significant step toward “Zero-Shot” robot learning. By effectively translating visual data into executable code through the lens of a Large Language Model, the framework removes the need for:

  • Manual reward tuning.
  • Complex Motion Capture setups.
  • Human-in-the-loop corrections.

The implications extend beyond just robot dogs. This approach suggests that in the future, we might be able to teach household robots to cook, clean, or use tools simply by showing them a YouTube video.

While limitations remain—it currently handles periodic gaits best and relies on a side-view perspective—SDS proves that the gap between “seeing” an action and “doing” it is closing rapidly. The era of robots that learn by watching is just beginning.

For more details on the specific prompting architecture and open-source code, you can refer to the full paper resources linked in the abstract.