Can Robots Imagine? Inside LaDi-WM, the Latent Diffusion World Model Revolutionizing Robotic Manipulation

Imagine you are about to stack a heavy ceramic bowl onto a fragile glass cup. Before you even move your hand, your brain runs a split-second simulation. You visualize the bowl slipping, the glass shattering, and the mess that follows. Consequently, you adjust your grip and approach before you even make contact.

This ability to “predict the future” based on our actions is fundamental to human dexterity. In robotics, this concept is implemented through World Models—internal simulators that allow robots to forecast the consequences of their actions. However, giving robots this foresight is notoriously difficult. Predicting every pixel of a future video frame is computationally expensive and often results in blurry, physically impossible hallucinations.

In this post, we are diving deep into LaDi-WM (Latent Diffusion-based World Model), a new framework presented at CoRL 2025. This paper proposes a clever workaround: instead of predicting pixels, robots should predict the “latent” essence of the world—specifically its geometry and semantics—using diffusion models.

Let’s explore how LaDi-WM works, why it outperforms existing methods by nearly 30% in benchmarks, and how it enables robots to “refine” their thoughts before acting.

The Problem: Pixels vs. Latents

To understand why LaDi-WM is necessary, we must first look at the limitations of current predictive manipulation.

The Trap of Pixel Prediction

Traditional world models often try to predict future video frames (pixels). While visually impressive, this is inefficient for control. A robot doesn’t need to know the exact color value of a pixel on a wall 5 meters away; it needs to know the geometry of the handle it’s trying to grasp and the semantic meaning of “cup” vs. “bowl.” Pixel-level diffusion models are heavy, slow, and struggle to generalize to new environments.

The Limitation of Simple Latent Models

Another approach is predicting latent states—compressed numerical representations of the world. Methods like DreamerV3 do this well for video games. However, in robotics, these latent spaces are usually trained just to reconstruct the image. They often miss the fine-grained geometric details (shapes, edges, depth) and semantic context (what objects are) required for precise manipulation tasks like stacking bowls or opening drawers.

Enter LaDi-WM: The Best of Both Worlds

The core insight of LaDi-WM is that we don’t need to train a representation from scratch. We already have powerful Visual Foundation Models (VFMs) that understand the world.

The researchers propose combining two specific types of pre-trained knowledge:

DINO (Geometric Features): Excellent at understanding local geometry, object parts, and correspondences.
SigLIP/CLIP (Semantic Features): Excellent at understanding global context and language-aligned semantics.

By predicting how these specific features evolve over time, rather than pixels, the robot gets a prediction that is both lightweight and rich in relevant information.

Overall framework of the proposed method showing the World Model training on the left and Policy Learning on the right.

As shown in Figure 2 above, the system is split into two phases:

Learning the World Model (Left): Training a diffusion model to predict future latent states from task-agnostic videos.
Policy Learning (Right): A robot uses this trained world model to imagine future states and refine its actions.

1. The Dual Latent Space

The first step is extracting the state. Instead of a raw image \(I_t\), the model processes the observation into a combined latent vector \(\mathbf{z}_t\).

\[ \begin{array} { r } { \mathbf { z } _ { t } = [ \mathbf { z } _ { t } ^ { D } ; \mathbf { z } _ { t } ^ { S } ] = [ f _ { d i n o } ( I _ { t } ) ; f _ { s i g l } ( I _ { t } ) ] , } \end{array} \]

Here, \(\mathbf{z}_t^D\) represents the DINO features (geometry) and \(\mathbf{z}_t^S\) represents the SigLIP features (semantics). By concatenating them, the robot perceives the world through a lens that captures where things are and what things are simultaneously.

2. Interactive Diffusion Modeling

Standard diffusion models add noise to data and learn to reverse the process to generate new data. However, LaDi-WM has two different types of data (Geometry and Semantics) that follow different distributions. If you simply concatenated them and diffused them together, the model might struggle to learn the distinct dynamics of each. If you trained them separately, you lose the connection between an object’s shape and its identity.

The authors introduce Interactive Diffusion. This mechanism allows the geometric stream and the semantic stream to “talk” to each other during the denoising process.

The diffusion process aims to predict the future sequence of latent states \(\mathbf{z}_{t+1:t+k+1}\) given history and actions:

\[ \begin{array} { r } { \mathbf { z } _ { t + 1 : t + k + 1 } \sim p _ { \theta } ( \mathbf { z } _ { t + 1 : t + k + 1 } | \mathbf { z } _ { t - l : t } , a _ { t : t + k } ) , } \end{array} \]

In the reverse (generation) process, the model estimates the “clean” (denoised) version of the geometry to help guide the semantics, and vice versa.

The clean components \(C_\theta\) are estimated using decomposition networks:

\[ \begin{array} { r l } & { C _ { \theta } ^ { D } = f _ { \theta _ { 1 } } ( \mathbf { z } _ { t + 1 : , n } ^ { D } , n , \mathbf { z } _ { t - l : t } ^ { D } , a _ { t : t + k } ) , } \\ & { C _ { \theta } ^ { S } = f _ { \theta _ { 2 } } ( \mathbf { z } _ { t + 1 : , n } ^ { S } , n , \mathbf { z } _ { t - l : t } ^ { S } , a _ { t : t + k } ) . } \end{array} \]

These clean estimates are then cross-fed into the denoising networks:

\[ \begin{array} { r } { \mathbf { z } _ { t + 1 : , \theta } ^ { D } , \eta _ { \theta } ^ { D } = f _ { \theta _ { 3 } } ( \mathbf { z } _ { t + 1 : , n } ^ { D } , n , C _ { \theta } ^ { S } , a _ { t : t + k } ) , } \\ { \mathbf { z } _ { t + 1 : , \theta } ^ { S } , \eta _ { \theta } ^ { S } = f _ { \theta _ { 4 } } ( \mathbf { z } _ { t + 1 : , n } ^ { S } , n , C _ { \theta } ^ { D } , a _ { t : t + k } ) . } \end{array} \]

Why does this matter? Take a look at the visualization below. The red lines show attention connections between features.

Visual analysis of interactive diffusion showing how interaction aligns connections to identical objects compared to non-interactive methods.

On the left (“Without interaction”), the model struggles to connect geometric features with their semantic counterparts—the attention is scattered. On the right (“With interaction”), the red lines clearly connect relevant parts of the scene. The model understands that the geometry of the handle belongs to the semantic concept of the drawer.

Predictive Policy: Thinking Before Acting

Having a world model is only half the battle. The robot needs a policy (a brain) that uses this model to act. LaDi-WM introduces a Predictive Manipulation Policy with a mechanism called Iterative Refinement.

The Imagination Loop

Most robot policies are “reactive”: See state \(\to\) Output action. LaDi-WM is “reflective”:

Initial Guess: The policy outputs a preliminary action sequence based on current history.
Imagination: The World Model takes this action and predicts the future latent states (what will happen if I do this?).
Refinement: The policy takes the imagined future as a new input and corrects its action.

Diagram illustrating the predictive policy refinement loop using LaDi-WM.

As illustrated in Figure 1 above:

Step 1: The policy generates an initial action (\(a^{init}\)).
Step 2: The LaDi-WM predicts the trajectory (Imagined States).
Step 3: The policy sees these imagined states and produces a refined action (\(a^{ref}\)).

Convergence of Thought

This refinement process isn’t just a one-off. It can be looped. The authors found that iterating this process significantly improves the quality of the action.

Heatmaps showing the predicted action distribution sharpening over iterations.

The heatmaps above (Figure 7) visualize the probability distribution of the robot’s actions. In Iteration 1, the distribution is diffuse—the robot is uncertain. By Iteration 4, the “hot spots” (bright yellow) are tight and focused. The robot has “thought through” the problem and is confident in its plan.

Experimental Results

The theory sounds solid, but does it work? The authors tested LaDi-WM on LIBERO-LONG (a challenging long-horizon manipulation benchmark) and in real-world scenarios.

Simulation Performance

The results on LIBERO-LONG are staggering compared to previous state-of-the-art methods like Seer and ATM.

Table comparing performance on LIBERO-LONG benchmark.

As seen in Table 1, LaDi-WM achieves an Average Success Rate (Avg.SR) of 68.7%, compared to 53.6% for Seer and 44.0% for ATM. This is a massive leap in performance. Notably, it dominates in tasks requiring precise interaction, like “Turn on the stove and put the moka pot on it” (Task 1), scoring 88.3% versus the nearest competitor’s 71.7%.

Real-World Generalization

Robotics papers often struggle to bridge the “Sim-to-Real” gap. LaDi-WM, however, shows impressive transfer capabilities. The authors tested it on a physical 7-DOF robot arm performing tasks like stacking bowls and organizing groceries.

Images of real-world experiment setup and execution examples like stacking bowls and opening drawers.

Table showing real-world performance variants. (Note: Referencing the Real-World table from the image deck, listed as Table 5 in context)

In real-world experiments, the full LaDi-WM system achieved a 60.0% success rate, significantly higher than Vanilla Behavior Cloning (40.0%).

Crucially, the ablation study in the real-world table reveals two key findings:

Pixel Diffusion fails: Using pixel-based diffusion dropped success to 51.4%. This confirms that the latent space is more robust for control.
No Diffusion fails: Removing the diffusion aspect entirely dropped success to 49.3%.

Scalability

A hallmark of modern AI is scalability—does the model get better if you throw more data at it?

Graphs showing scalability of the world model and policy model.

Figure 3 confirms that LaDi-WM scales effectively.

(a): As the number of demonstrations increases, the World Model’s prediction error (MSE) drops.
(b): Increasing policy training data leads to consistent success rate gains, maintaining a lead over baselines.
(c): Simply making the model larger (more parameters) continues to improve performance on both LIBERO and CALVIN benchmarks.

Why Interaction Matters: A Deeper Look

The authors performed ablation studies to prove that the “Interactive” part of their diffusion model wasn’t just a gimmick.

Table showing ablation study of world model architecture.

Table 3 shows that removing the interaction (treating DINO and SigLIP streams independently) drops the success rate from 60.7% to 58.9%. While that might seem small, in specific complex tasks, the gap widens. Furthermore, removing the semantic features (SigLIP) entirely causes a significant drop to 57.3%, proving that geometry alone isn’t enough—the robot needs to understand what it is holding.

Conclusion

LaDi-WM represents a significant step forward in embodied AI. By moving away from pixel prediction and embracing a dual latent space of geometry and semantics, it solves the problem of “blurry” future predictions.

The key takeaways are:

Don’t predict pixels: Predict the latent features of Foundation Models (DINO + SigLIP).
Interaction is key: Geometry and Semantics must be diffused together to maintain consistency.
Think before acting: Using the World Model to iteratively refine the policy’s actions leads to much higher success rates and lower uncertainty.

This approach demonstrates that for robots to act intelligently in the physical world, they need an internal simulator that is fast, semantic-aware, and geometrically precise. LaDi-WM provides exactly that blueprint.

The Problem: Pixels vs. Latents#

The Trap of Pixel Prediction#

The Limitation of Simple Latent Models#

Enter LaDi-WM: The Best of Both Worlds#

1. The Dual Latent Space#

2. Interactive Diffusion Modeling#

Predictive Policy: Thinking Before Acting#

The Imagination Loop#

Convergence of Thought#

Experimental Results#

Simulation Performance#

Real-World Generalization#

Scalability#

Why Interaction Matters: A Deeper Look#

Conclusion#