Introduction

Imagine trying to learn a new skill, like playing a specific song on the piano. A great teacher doesn’t just wait until you’ve finished the piece to tell you “Pass” or “Fail.” Instead, they provide continuous feedback while you play: “That’s the right chord,” “You’re slowing down too much here,” or “You missed that note, try again.”

In the world of robotics, this kind of dense, informative feedback is crucial. Typically, we teach robots using Imitation Learning (showing them exactly what to do thousands of times) or Reinforcement Learning (RL) (giving them a reward signal when they succeed). However, both have a major bottleneck: scaling.

If you want a robot to perform a brand new task, you usually have to collect piles of new human demonstrations or manually write a complex mathematical function to define “success” for that specific task. This is slow, expensive, and requires expert knowledge.

But what if a robot could understand a new task just by reading a language instruction, like “open the red trash bin,” without needing a human to physically demonstrate it again?

This is the problem tackled in the paper “ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations.” The authors introduce a framework that acts like that “great teacher.” It learns a generalizable reward function from a small set of data and uses it to train robots on completely new tasks using only language descriptions.

In this post, we will break down how ReWiND works, the clever “video rewind” trick it uses to understand failure, and how it outperforms existing methods in both simulation and the real world.

The Problem: The Cost of New Tasks

To understand why ReWiND is necessary, we need to look at the limitations of current robot learning.

  1. Manual Reward Engineering is Hard: Designing a reward function that guides a robot from start to finish (dense rewards) is difficult. It’s easy to define “success” (the door is open), but hard to mathematically define “moving towards the handle” without the robot finding loopholes.
  2. Demonstrations are Expensive: Collecting human demonstrations for every slight variation of a task (e.g., opening a blue bin vs. a red bin) is impractical for deploying robots in the wild.
  3. Prior Solutions are Limited: Previous attempts to use language for rewards often cheated by assuming the robot has perfect knowledge of the world (ground-truth states) or by requiring massive online training sessions that are unsafe for real hardware.

ReWiND (which stands for ReWiND) proposes a different path. It uses a modest amount of initial data to learn a Reward Model and a Policy. Once trained, this system can accept a new language instruction and teach itself the task through trial and error, without needing a human to move the robot’s arm even once.

Figure 1: Overview. We pre-train a policy and reward model from a small set of language-labeled demos. Then, we solve unseen task variations via language-guided RL without additional demos.

As shown in Figure 1, the workflow shifts the heavy lifting to a pre-training phase, allowing the deployment phase (learning a new task) to be guided simply by a text prompt and the robot’s own interaction.

The ReWiND Framework

The ReWiND framework operates in three distinct stages. To make this digestable, we will explore the core innovation—the Reward Function—first, as it powers everything else.

Figure 2: (a): We train a reward model on a small demonstration dataset and a curated subset of Open-X, augmented with instructions and video rewinding. (b): We use the trained reward model to label demos with rewards and pre-train a policy. (c): For an unseen task, we fine-tune the policy with online rollouts.

Stage 1: Learning the Teacher (The Reward Model)

The heart of ReWiND is a neural network that takes in a video of the robot and a text description, and outputs a “progress” score (from 0 to 1). If the robot completes the task described in the text, the score goes up. If it fails or does nothing, the score stays low.

The Architecture

The authors designed a Cross-Modal Sequential Aggregator. That sounds complex, but let’s break it down using the architecture diagram below.

Figure 10: ReWiND’s Reward Model Architecture. Frozen language and image embeddings are projected to a hidden dimension. These are fed to a transformer that predicts per-timestep rewards.

  1. Frozen Encoders: They don’t train the vision or language parts from scratch. They use DINOv2 (a powerful vision model) and MiniLM (a language model). This allows the system to leverage “common sense” knowledge these models learned from the internet.
  2. Aggregation: These visual and text features are fed into a Transformer. This model looks at the sequence of frames and the instruction to decide “How much progress has been made?”

The Challenge of “Failure”

Here is the cleverest part of the paper. To train a reward model, you typically show it examples of success (demonstrations). But to be a good teacher, the model also needs to know what failure looks like.

If you only train on perfect demonstrations, the model might think any robot movement is good. But collecting thousands of failed attempts on a real robot is dangerous and tedious.

The Solution: Video Rewind

The authors introduce a data augmentation technique called Video Rewind. They take a successful demonstration video and mechanically “rewind” parts of it.

Imagine a video of a robot picking up a cup.

  • Forward: The hand moves to the cup, grasps it, and lifts it. (Success)
  • Rewind: The hand moves to the cup, grasps it… and then creates a fake video where the hand moves backwards away from the cup, effectively dropping it.

Figure 3: Video rewind. We split a demo at intermediate timestep i into forward/reverse sections. The reverse section resembles dropping the object.

By playing the video in reverse (as shown in Figure 3), they artificially generate “failure” trajectories where the robot undoes its progress. They then train the reward model to predict decreasing reward for these rewound segments.

The mathematical objective for this includes a standard progress loss (Equation 1) and this specific rewind loss (Equation 2):

Equation for Rewind Loss.

This forces the reward model to be sensitive to the robot losing progress, providing the dense feedback necessary for Reinforcement Learning.

The Data: Open-X

To ensure the robot understands more than just the few tasks in the lab, the authors include data from the Open-X Embodied dataset. This is a large, diverse dataset of robots doing various things. Even if the specific tasks aren’t exactly what the test robot will do, seeing this variety helps the vision and language encoders generalize better.

Stage 2: Offline Policy Pre-training

Once the reward model (the teacher) is trained, we need a student (the policy).

Before letting the robot try new tasks in the real world (which is slow), ReWiND uses Offline Reinforcement Learning. It takes the existing demonstrations and re-labels them using the trained reward model.

Equation 4: Offline reward labeling.

Using a method called Implicit Q-Learning (IQL), the robot learns a “base policy.” Think of this as teaching a child basic motor skills. They might not know how to “open this specific red bin” yet, but they know how to move their arm, grasp objects, and generally interact with the table.

Stage 3: Online Learning of New Tasks

Now comes the test. We give the robot a new instruction: “Separate the blue and orange cups.” The robot has never seen a demonstration for this specific task.

  1. The robot executes its pre-trained policy (exploring based on its general skills).
  2. The Reward Model watches the video of the attempt.
  3. Based on the text instruction, the Reward Model assigns a reward score (Did the robot separate the cups?).
  4. The robot updates its policy using these rewards (Online RL).

Equation 8: Online reward labeling.

Because the reward model gives dense feedback (e.g., “You are getting closer”), the robot can adjust its behavior much faster than if it only received a simple “Success/Fail” signal at the very end.

Does the Teacher Actually Understand? (Experiments)

Before seeing if the robot can learn, the researchers checked if the Reward Model actually works. A good reward model should produce high rewards for the correct task and low rewards for mismatched tasks.

Confusion Matrices

The authors tested this by feeding the model videos of Task A and instructions for Task B.

Figure 4: Video-Language Reward Confusion Matrix. ReWiND produces the most diagonal-heavy confusion matrix, indicating strong alignment between unseen demos and instructions.

In Figure 4, the vertical axis represents different video tasks, and the horizontal axis represents different language instructions.

  • A perfect model would show a bright diagonal line (Task A video matches Task A text) and dark colors everywhere else.
  • ReWiND (Far Right) shows a very clean diagonal.
  • Baselines like RoboCLIP and VLC show much more “confusion” (horizontal or vertical stripes), meaning they struggle to distinguish between different tasks or instructions effectively.

Analyzing Rollouts

It is also vital that the reward model works on partial successes. Below is an example of a robot trying to push a button but getting stuck.

Figure 9: Unsuccessful policy rollout for the “Push the Button” task in Meta-World and its corresponding rewards. ReWiND predicts calibrated rewards that reflect better partial progress.

In Figure 9, look at the reward graphs. Most baselines (LIV, VLC) give a flat reward near zero because the button wasn’t fully pressed. However, ReWiND (bottom right) gives a high sustained reward. It recognizes that the robot is at the button and trying, even if it hasn’t clicked it yet. This “partial credit” is what allows the RL algorithm to learn.

Does the Student Learn? (Results)

Finally, does this framework actually teach the robot to perform new tasks?

Simulation Results (MetaWorld)

The authors tested ReWiND on 8 unseen tasks in the MetaWorld simulator.

Figure 5: MetaWorld final performance. ReWiND achieves 79% success rate, significantly outperforming baselines.

The results in Figure 5 are stark.

  • ReWiND (Maroon line) achieves nearly 80% success rate.
  • The best baseline (VLC) hovers around 40%.
  • Standard methods like sparse rewards (learning only from “did I finish?”) fail almost completely (near 0%).

Real-World Robot Results

Simulation is one thing, but the real world involves lighting changes, physics noise, and visual clutter. The authors deployed ReWiND on a bimanual (two-armed) robot setup.

Figure 12: Real World Bimanual Robot Setup with Koch v1.1 arms.

They tested on 5 distinct tasks, including tasks requiring spatial reasoning (“Put the orange cup in the red plate”) and semantic understanding (“Put the fruit-colored object in the box”).

Figure 6: Real-robot RL results. Online RL with ReWiND improves a pre-trained policy by an absolute 56% across all five tasks.

As shown in Figure 6, the pre-trained policy (before online learning) only had a 12% success rate. After online training with ReWiND:

  • ReWiND improved to 68% success.
  • The baseline VLC only improved to 10%.
  • ReWiND improved the policy by 5x relative to the start.

Qualitative examples of these tasks show the robot handling visual variations that weren’t in the training set:

Figure 13: Rollouts for 5 tasks used for online RL, showing visual, spatial, and semantic generalization.

Why Does ReWiND Win? (Ablation Study)

The authors performed an “ablation study,” which involves removing parts of the system to see what breaks. This helps identify the most critical components.

Table 2: Ablation Study showing the impact of removing Open-X, Video Rewind, and Instruction Generation.

Looking at Table 2, we can see:

  1. Removing “Video Rewind” (- Video Rewind): The “Policy Rollout Ranking” drops significantly. This confirms that the “fake failure” videos are essential for the reward model to distinguish between good and bad robot behavior.
  2. Removing Open-X (- Open-X Subset): The ability to generalize to unseen demos drops. The broad data from the internet is crucial for understanding new objects and words.
  3. Removing Target Env Data: If you rely only on Open-X, the model fails to align with the specific robot’s embodiment. You need a mix of broad internet data and a small amount of specific robot data.

Conclusion

ReWiND represents a significant step toward scalable robot learning. By mimicking a teacher that provides dense, continuous feedback, it allows robots to learn tasks they have never seen before using only language instructions.

The key takeaways are:

  • Language is a powerful interface: We can define tasks with text rather than expensive demonstrations.
  • Synthetic data matters: “Video Rewind” turns successful data into failure data, solving the lack of negative examples in robot datasets.
  • Hybrid Data: Combining small domain-specific data with large open-source datasets (Open-X) creates robust reward models.

While there are still limitations—someone currently has to reset the environment when the robot creates a mess—ReWiND moves us closer to a future where we can simply ask a robot to “clean the kitchen,” and it learns how to do it on the fly.