Dreaming of Success: How Robots Can Fine-Tune Skills Entirely Offline

In the world of robotics, there is a massive gap between “watching” a task and “mastering” it.

Imagine you are learning to play tennis. You can watch a professional player (Imitation Learning), and you might pick up the form. But to actually get good, you need to step onto the court and hit the ball thousands of times, adjusting your swing based on where the ball lands (Reinforcement Learning).

For humans, this practice is tiring but safe. For robots, “practice” in the real world is expensive, slow, and dangerous. A robot flailing around trying to learn a new skill might break the object it’s holding, damage its own motors, or hurt someone nearby.

This brings us to a significant bottleneck in modern AI robotics: How do we fine-tune a robot’s policy without the risks of real-world trial and error?

A new framework called DiWA (Diffusion Policy Adaptation with World Models) offers a compelling solution. It allows robots to “dream” about practicing. By learning a mental model of the world, the robot can run thousands of practice iterations in its head—fine-tuning its movements entirely offline—before ever moving a muscle in the real world.

In this post, we will deconstruct how DiWA works, why it combines the best of Diffusion Models and World Models, and look at the impressive results it achieves on real hardware.

The Problem: The High Cost of Real-World Practice

To understand why DiWA is necessary, we first need to look at the two dominant paradigms in robot learning:

Imitation Learning (IL): The robot copies a human expert.

Pros: Safe and learns quickly.
Cons: Brittle. If the robot drifts slightly off the path the human took, it doesn’t know how to recover because it has never seen that situation before.

Reinforcement Learning (RL): The robot learns by trial and error.

Pros: Robust. The robot learns to recover from mistakes to maximize a reward.
Cons: Sample Inefficiency. It often takes millions of interactions to learn a skill. Doing this on a physical robot is practically impossible for complex tasks.

Recent advances have given us Diffusion Policies, which treat robot action generation as a “denoising” process (similar to how DALL-E generates images). These are excellent at capturing complex, multi-modal human behaviors. However, they still suffer from the limitations of Imitation Learning—they struggle to adapt to new situations without more data.

Researchers have tried to apply RL to diffusion policies (a method called DPPO), but it still requires online interaction with the environment.

Comparison of training methods. (a) Standard imitation learning is limited by data. (b) DPPO requires expensive online interaction. (c) DiWA fine-tunes offline in a world model.

As shown in the figure above, DiWA (c) breaks this dependency. Instead of interacting with the real world, it interacts with a learned World Model.

The Core Concept: Learning in a Dream

DiWA stands for Diffusion with World Models Adaptation. The intuition is to build a simulator inside the robot’s neural network. If the simulator is good enough, the robot can practice there.

The framework operates in four distinct phases:

World Model Training: Learn how the world works from unstructured play data.
Policy Pre-training: Learn a basic behavior by imitating experts.
Reward Estimation: Learn to recognize what “success” looks like.
Offline Fine-Tuning: The core innovation—using Reinforcement Learning inside the world model to improve the policy.

Let’s break these down step-by-step.

The DiWA Framework Overview. (1) World Model Training, (2) Policy Training, (3) Reward Estimation, (4) Offline Fine-Tuning.

1. The World Model (The Simulator)

Before the robot can practice, it needs a playground. DiWA uses a Latent World Model. Instead of predicting raw pixels (which is computationally expensive and difficult), the model compresses visual observations into a compact “latent state” (\(z_t\)).

The researchers trained this model on “play data”—unstructured data where a human teleoperates the robot to just mess around with objects. This data is cheap to collect because it doesn’t need to be labeled with success or failure. The World Model learns the physics of the environment: “If I am in state \(z_t\) and I apply action \(a_t\), what will the next state \(z_{t+1}\) look like?”

The transition dynamics are learned using a recurrent state-space model:

Equation describing the Recurrent State update.

This equation essentially says the next hidden state depends on the previous state and the action taken.

2. The Diffusion Policy

DiWA uses a diffusion policy, which is currently the state-of-the-art for robotic manipulation.

In standard robotics, a policy outputs an action directly. In a diffusion policy, the network starts with random noise and iteratively “denoises” it \(K\) times to produce an action sequence. This allows the policy to capture very complex, precise movements that simple networks miss.

3. Latent Reward Estimation (The Coach)

In a real simulator, the code tells you if you won (e.g., return +1 if drawer_is_open). In the real world, we don’t have that function. Since DiWA runs offline, it needs a way to judge its own imagined dreams.

The researchers train a Latent Reward Classifier. They take a small set of expert demonstrations (where the task was successfully completed) and train a neural network to look at the latent state \(z_t\) and predict the probability of success.

To make this robust, they use a contrastive loss function (NT-Xent). This ensures that the “success” states are mathematically clustered tightly together, far away from “failure” states.

Equation for the NT-Xent contrastive loss used to train the reward classifier.

4. The Dream Diffusion MDP

This is the most technically novel part of the paper. Standard Reinforcement Learning assumes a Markov Decision Process (MDP): State \(\to\) Action \(\to\) Next State.

However, a Diffusion Policy is unique because generating one action involves multiple denoising steps. DiWA models this entire process as a Dream Diffusion MDP.

Imagine the robot is in a dream state. It needs to decide what to do.

It starts with noise.
It performs a denoising step (this is treated as a “transition” in the MDP).
It repeats this \(K\) times until it has a clean action.
It executes that action in the World Model to get a new state.

This formulation allows the researchers to apply standard RL algorithms (specifically PPO, Proximal Policy Optimization) to the diffusion process itself. The “state” in this special MDP includes both the world state (\(z_t\)) and the current level of noise in the action (\(\bar{a}^k_t\)).

Definition of the state, action, and reward in the Dream Diffusion MDP.

The reward is only given at the end of the denoising chain (when \(k=1\)), as shown in the equation above.

To ensure the robot doesn’t “hallucinate” strategies that cheat the world model (a common problem where the agent finds bugs in the physics engine to get high scores), DiWA adds a Behavior Cloning Regularization. This creates a mathematical anchor, forcing the new policy to stay somewhat close to the original expert demonstrations.

The loss function combining PPO (RL) and Behavior Cloning (BC) regularization.

Does It Work?

The researchers evaluated DiWA on the CALVIN benchmark, a standard simulator for robotic manipulation, and on real-world hardware.

Simulation Results

The comparison here is stark. The baseline method, DPPO, requires interacting with the environment to learn. DiWA interacts only with its internal model.

In the table below, look at the “Total Physical Interactions” row. DiWA achieves success rates comparable to or better than online methods, but with zero physical interactions during fine-tuning.

Table 1: Comparison of success rates. DiWA uses 0 physical interactions vs millions for DPPO.

Notice tasks like close-drawer or turn-on-lightbulb. The pre-trained policy (Base) often fails (e.g., 59% success). After “dreaming” with DiWA, success jumps to 91%. The competitor, DPPO, requires millions of steps in the environment to reach similar levels.

The graph below visualizes this efficiency. The DiWA line (blue) represents the performance of the policy fine-tuned offline. The other lines show online methods slowly catching up over hundreds of thousands of steps.

Performance plots. DiWA (blue/horizontal) achieves high performance immediately relative to environment steps because it pre-computes improvements offline.

Real-World “Dreaming”

Simulation is one thing, but the real world is messy. Can a World Model trained on real camera data actually predict the future accurately enough to train a policy?

The researchers collected 4 hours of teleoperation data to train the world model. They then tested its ability to “hallucinate” the future.

The results are visually impressive. In the figure below, the robot observes a state, and the World Model predicts the next 80 steps (approx. 10 seconds) of video. The predictions (bottom rows) maintain the geometry of the robot arm and the objects remarkably well.

Real-world world model rollouts. The model accurately predicts the visual outcome of opening drawers and pushing sliders.

Because the world model is accurate, the policy trained inside it transfers to reality. The team tested three skills: opening a drawer, closing a drawer, and pushing a slider.

Real-world success rates. (a) Setup. (b) Success rates improve significantly after offline fine-tuning.

As shown in Figure 4(b), the success rate (y-axis) climbs steadily as the model trains in its imagination (x-axis). For example, the “Close Drawer” skill (green line) went from a low success rate to nearly perfect, purely through mental practice.

Why This Matters

DiWA represents a shift in how we think about robotic training.

Safety: By moving the trial-and-error phase into a virtual “mind,” we protect hardware and surroundings.
Scalability: We have massive amounts of “play data” (videos of robots moving, humans doing things). We have very little “expert data” (perfectly labeled tasks). DiWA allows us to use the massive play data to build a world model, and then squeeze maximum performance out of the scarce expert data.
Data Efficiency: The ability to effectively “recycle” offline data to improve policies removes the need for constant, expensive real-world data collection.

Conclusion

DiWA bridges the gap between the stability of Imitation Learning and the adaptability of Reinforcement Learning. By formulating the diffusion denoising process as a Markov Decision Process and solving it inside a learned World Model, robots can now “dream” their way to mastery.

While there are still limitations—the world model must be high quality, and it can’t fix physics it hasn’t seen—this approach opens the door for robots that can adapt and improve continuously without needing constant hand-holding or dangerous real-world experimentation.

Key Takeaway: The next time a robot perfectly executes a complex task, it might be because it spent the last few hours practicing it in its dreams.

Dreaming of Success: How Robots Can Fine-Tune Skills Entirely Offline#

The Problem: The High Cost of Real-World Practice#

The Core Concept: Learning in a Dream#

1. The World Model (The Simulator)#

2. The Diffusion Policy#

3. Latent Reward Estimation (The Coach)#

4. The Dream Diffusion MDP#

Does It Work?#

Simulation Results#

Real-World “Dreaming”#

Why This Matters#

Conclusion#