Introduction
One of the “holy grails” of robotics and artificial intelligence is the ability to teach a robot new skills simply by showing it a video of a human performing a task. Imagine if, instead of programming a robot step-by-step or teleoperating it for hours to collect data, you could simply show it a YouTube clip of a person folding a shirt, and the robot would immediately understand how to do it.
While this sounds intuitive to us, it presents a massive computational challenge known as the embodiment gap. Humans have soft hands with five fingers; robots often have rigid grippers or suction cups. Humans move with a specific kinematic structure; robotic arms have different joint limits and degrees of freedom. To a computer vision system, a video of a human hand picking up an apple looks statistically very different from a video of a metal gripper doing the same thing.
Traditionally, researchers have tried to bridge this gap using paired datasets—recording a human and a robot doing the exact same thing in the exact same environment to map the differences. But this is expensive and hard to scale.
In this post, we are diving deep into UniSkill, a new framework presented at CoRL 2025. UniSkill proposes a clever way to learn “universal” skill representations from massive, unlabeled video datasets. The core idea? Focus on the dynamics—the change between frames—rather than the appearance of the agent. By doing so, UniSkill allows a robot to watch a human video and translate that visual information into executable robot actions, without needing aligned data or complex labels.

The Core Problem: The Embodiment Gap
Learning from videos is a scalable approach because video data is abundant. We have millions of hours of human activities available online (e.g., the “Something-Something” dataset). However, extracting robot actions from these videos is difficult.
Most current methods fall into two traps:
- Dependency on Paired Data: They require datasets where humans and robots perform identical tasks in identical scenes. This effectively negates the benefit of using “in-the-wild” internet videos.
- Explicit Translation: They try to map human hand poses to robot gripper poses. This often requires complex 3D tracking and fails when the robot’s morphology doesn’t map 1:1 to a human hand.
UniSkill bypasses these by asking a different question: Can we learn a representation of a “skill” that is agnostic to who is performing it?
If we define a skill not by how the arm looks but by how the environment changes (e.g., “the cup was lifted,” “the drawer was closed”), we can create a shared language between humans and robots.
The UniSkill Method
The UniSkill framework is built on the intuition that a “skill” is effectively a compression of the dynamics between two points in time. If we look at a video frame at time \(t\) and another at time \(t+k\), the difference between them represents the skill being executed.
The framework consists of three main stages:
- Skill Representation Learning: Learning a universal embedding space using large-scale videos.
- Policy Learning: Training a robot to execute actions based on these embeddings.
- Inference: Extracting embeddings from a human video to guide the robot.
Let’s break down the architecture, illustrated below.

1. Universal Skill Representation Learning
The heart of UniSkill is how it learns to encode a video clip into a compact vector \(z_t\) (the skill representation). To do this without labels, the authors use a self-supervised approach involving two models: an Inverse Skill Dynamics (ISD) model and a Forward Skill Dynamics (FSD) model.
Inverse Skill Dynamics (ISD)
The ISD model acts as the encoder. It looks at the current frame \(I_t\) and a future frame \(I_{t+k}\) and attempts to extract the “skill” \(z_t\) that explains the transition between them.

Crucially, the authors found that relying solely on RGB pixel data caused the model to overfit to appearance (e.g., memorizing that “a white arm moving” is a different skill than “a black arm moving”). To fix this, they incorporate Depth Estimation. By using depth maps, the model focuses more on the geometry and movement of objects rather than the texture or color of the agent’s arm.
Forward Skill Dynamics (FSD)
The FSD model acts as the decoder/predictor. It takes the current frame \(I_t\) and the extracted skill \(z_t\), and tries to predict the future frame \(I_{t+k}\).

This structure is inspired by image editing models like InstructPix2Pix. In image editing, you give a model an image and a text instruction (e.g., “add a hat”), and it generates the new image. Here, the “instruction” is the latent skill vector \(z_t\).
By forcing the system to reconstruct the future frame \(I_{t+k}\) using only the current frame and the skill vector, the model is compelled to pack all the necessary dynamic information (what moved, where it went) into \(z_t\). Because the training data includes videos of humans, Franka robots, and WidowX robots, the model learns a generalized concept of movement that applies across all these embodiments.
The training objective essentially minimizes the difference between the predicted future frame and the actual future frame. This forces \(z_t\) to capture the “verbs” of the video (push, pull, lift) rather than the “nouns” (hand, gripper, sleeve color).
2. Universal Skill-Conditioned Policy
Once the ISD model is trained on massive datasets (like diverse human videos and robot datasets), we freeze it. Now we need to teach a specific robot how to execute these skills.
We train a policy \(\pi\) (using a Diffusion Policy architecture) on a dataset of robot demonstrations. For every trajectory in the robot dataset:
- We take two frames, \(I_t\) and \(I_{t+k}\).
- We pass them through the frozen ISD to get the skill \(z_t\).
- We train the policy to predict the robot’s physical actions \(a_t\) given the current observation \(o_t\) and the skill \(z_t\).

The Augmentation Trick: There is still a slight gap between training and testing. During training, the skill \(z_t\) comes from a robot video. During testing, it will come from a human video. To make the policy robust to this shift, the authors heavily augment the images fed into the ISD during training (changing colors, jitters, etc.). This simulates the visual domain gap, forcing the policy to rely on the underlying structural dynamics encoded in \(z_t\) rather than specific visual cues.
3. Inference: Cross-Embodiment Imitation
At test time, we want the robot to mimic a human.
- We record a Human Video Prompt (e.g., a human pushing a block).
- We feed frames from this video into the ISD.
- The ISD extracts a sequence of skill vectors \(z\).
- These vectors are fed into the Robot Policy.
- The robot executes the actions, effectively “mimicking” the human’s intent, even though it has never seen that specific human video before.
Figure 9 (below) visualizes how this inference pipeline differs from standard Goal-Conditioned Behavioral Cloning (GCBC). While GCBC tries to reach a specific pixel-goal (which looks like a human hand, confusing the robot), UniSkill follows the abstract skill representation.

Why Does It Work? Visualizing the “Skill”
One of the most compelling aspects of UniSkill is that the learned representations are interpretable. We can check if the model actually understands “dynamics” by using the Forward Skill Dynamics (FSD) model.
If we take a static image and inject a “skill” extracted from a completely different video, the FSD should be able to hallucinate a future frame where that skill has been executed.

In the figure above, look at the “Current Image” on the right. When conditioned with “Skill A” (picking up), the model predicts the robot arm lifting. When conditioned with “Skill C” (moving sideways), it predicts the arm moving sideways. This confirms that \(z_t\) is indeed encoding motion instructions.
Furthermore, we can look at the clustering of these skills. In the t-SNE plot below, we see that embeddings cluster by Task (colors), not by Embodiment (shapes). Circles (human) and crosses (robot) performing the same task are close together in the embedding space. This proves the representation is embodiment-agnostic.

Note: Figure 12 also highlights the importance of Depth. Without depth (bottom right), the clustering is messy. With depth (bottom left), the tasks separate cleanly.
Experiments and Results
The researchers evaluated UniSkill on both real-world robots (Franka Emika Panda) and simulation benchmarks (LIBERO). They used massive datasets for training, including:
- Human Data: Something-Something V2, H2O.
- Robot Data: DROID, BridgeV2, LIBERO.
They compared UniSkill against GCBC (Goal-Conditioned Behavioral Cloning) and XSkill (a prior state-of-the-art method for cross-embodiment).
Real-World Tabletop Tasks
In the real world, they tasked the robot with manipulating objects like tissues, towels, and trash bins. They provided prompts from a held-out robot (Franka) and, more importantly, from Humans.

The results in Figure 3(a) are striking:
- Franka Prompts: UniSkill achieves 81% success, beating XSkill (61%) and GCBC (60%).
- Human Prompts: This is the hardest setting. XSkill scored 0%, and GCBC scored 11%. UniSkill achieved 36%. While 36% might sound low, it is a massive leap over the baselines, which essentially failed completely.
It is worth noting that for simpler tasks, the success rate was much higher. For example, in the “Pull out the tissue” task, UniSkill achieved 93% with robot prompts and 57% with human prompts (see Table 2 in the paper).
Generalization to Unseen Embodiments (The “Anubis” Robot)
To push the limits, they tested on a custom robot named “Anubis” which was completely unseen during training.

As shown in Figure 4, UniSkill (green bars) consistently outperforms the baseline (blue bars) regardless of whether the prompt comes from a Franka, a Human, or the novel Anubis robot. This confirms the “Universal” claim in UniSkill’s name.
Simulation Results (LIBERO)
Simulation allows for more rigorous, large-scale testing. In the LIBERO benchmark, they created “Human Prompts” by having humans mimic the simulation tasks in the real world.

Here, UniSkill achieves 48% success on human prompts, compared to just 9% for GCBC. The visual comparison in Figure 5 shows the significant visual domain gap between the clean simulation environment (left) and the real-world human demonstration (right). UniSkill successfully bridges this gap.
Ablation: The Power of Big Data
Does adding more data actually help? Yes. The authors conducted an ablation study showing that training on just robot data yields decent results, but adding large-scale human video datasets (Something-Something V2 and H2O) boosts performance significantly.

Looking at Table 6(a), simply adding Human videos increased the success rate from 19% to 49%. This validates the hypothesis that robots can indeed learn better skills by watching humans, provided the representation is learned correctly.
Conclusion
UniSkill represents a significant step forward in robot learning. By shifting the focus from pixel-perfect reconstruction to dynamics modeling, it allows robots to utilize the vast ocean of human video data available on the internet.
Key Takeaways:
- Embodiment Independence: UniSkill learns skills that represent what is happening, not who is doing it.
- Scalability: It leverages unlabeled, in-the-wild datasets, removing the bottleneck of expensive data collection.
- Cross-Embodiment Transfer: It enables a robot to watch a human (or a different robot) and execute the corresponding task without explicit paired training data.
While there are still limitations—such as sensitivity to video speed and extreme viewpoint changes—UniSkill paves the way for a future where general-purpose robots can learn new chores simply by watching us do them. The integration of such skill representations with Vision-Language Models (VLMs) is likely the next frontier, potentially allowing robots to understand both the “what” (language) and the “how” (UniSkill dynamics) of complex tasks.
](https://deep-paper.org/en/paper/2505.08787/images/cover.png)