Imagine you are about to stack a heavy ceramic bowl onto a fragile glass cup. Before you even move your hand, your brain runs a split-second simulation. You visualize the bowl slipping, the glass shattering, and the mess that follows. Consequently, you adjust your grip and approach before you even make contact.
This ability to “predict the future” based on our actions is fundamental to human dexterity. In robotics, this concept is implemented through World Models—internal simulators that allow robots to forecast the consequences of their actions. However, giving robots this foresight is notoriously difficult. Predicting every pixel of a future video frame is computationally expensive and often results in blurry, physically impossible hallucinations.
In this post, we are diving deep into LaDi-WM (Latent Diffusion-based World Model), a new framework presented at CoRL 2025. This paper proposes a clever workaround: instead of predicting pixels, robots should predict the “latent” essence of the world—specifically its geometry and semantics—using diffusion models.
Let’s explore how LaDi-WM works, why it outperforms existing methods by nearly 30% in benchmarks, and how it enables robots to “refine” their thoughts before acting.
The Problem: Pixels vs. Latents
To understand why LaDi-WM is necessary, we must first look at the limitations of current predictive manipulation.
The Trap of Pixel Prediction
Traditional world models often try to predict future video frames (pixels). While visually impressive, this is inefficient for control. A robot doesn’t need to know the exact color value of a pixel on a wall 5 meters away; it needs to know the geometry of the handle it’s trying to grasp and the semantic meaning of “cup” vs. “bowl.” Pixel-level diffusion models are heavy, slow, and struggle to generalize to new environments.
The Limitation of Simple Latent Models
Another approach is predicting latent states—compressed numerical representations of the world. Methods like DreamerV3 do this well for video games. However, in robotics, these latent spaces are usually trained just to reconstruct the image. They often miss the fine-grained geometric details (shapes, edges, depth) and semantic context (what objects are) required for precise manipulation tasks like stacking bowls or opening drawers.
Enter LaDi-WM: The Best of Both Worlds
The core insight of LaDi-WM is that we don’t need to train a representation from scratch. We already have powerful Visual Foundation Models (VFMs) that understand the world.
The researchers propose combining two specific types of pre-trained knowledge:
- DINO (Geometric Features): Excellent at understanding local geometry, object parts, and correspondences.
- SigLIP/CLIP (Semantic Features): Excellent at understanding global context and language-aligned semantics.
By predicting how these specific features evolve over time, rather than pixels, the robot gets a prediction that is both lightweight and rich in relevant information.

As shown in Figure 2 above, the system is split into two phases:
- Learning the World Model (Left): Training a diffusion model to predict future latent states from task-agnostic videos.
- Policy Learning (Right): A robot uses this trained world model to imagine future states and refine its actions.
1. The Dual Latent Space
The first step is extracting the state. Instead of a raw image \(I_t\), the model processes the observation into a combined latent vector \(\mathbf{z}_t\).
\[ \begin{array} { r } { \mathbf { z } _ { t } = [ \mathbf { z } _ { t } ^ { D } ; \mathbf { z } _ { t } ^ { S } ] = [ f _ { d i n o } ( I _ { t } ) ; f _ { s i g l } ( I _ { t } ) ] , } \end{array} \]Here, \(\mathbf{z}_t^D\) represents the DINO features (geometry) and \(\mathbf{z}_t^S\) represents the SigLIP features (semantics). By concatenating them, the robot perceives the world through a lens that captures where things are and what things are simultaneously.
2. Interactive Diffusion Modeling
Standard diffusion models add noise to data and learn to reverse the process to generate new data. However, LaDi-WM has two different types of data (Geometry and Semantics) that follow different distributions. If you simply concatenated them and diffused them together, the model might struggle to learn the distinct dynamics of each. If you trained them separately, you lose the connection between an object’s shape and its identity.
The authors introduce Interactive Diffusion. This mechanism allows the geometric stream and the semantic stream to “talk” to each other during the denoising process.
The diffusion process aims to predict the future sequence of latent states \(\mathbf{z}_{t+1:t+k+1}\) given history and actions:
\[ \begin{array} { r } { \mathbf { z } _ { t + 1 : t + k + 1 } \sim p _ { \theta } ( \mathbf { z } _ { t + 1 : t + k + 1 } | \mathbf { z } _ { t - l : t } , a _ { t : t + k } ) , } \end{array} \]In the reverse (generation) process, the model estimates the “clean” (denoised) version of the geometry to help guide the semantics, and vice versa.
The clean components \(C_\theta\) are estimated using decomposition networks:
\[ \begin{array} { r l } & { C _ { \theta } ^ { D } = f _ { \theta _ { 1 } } ( \mathbf { z } _ { t + 1 : , n } ^ { D } , n , \mathbf { z } _ { t - l : t } ^ { D } , a _ { t : t + k } ) , } \\ & { C _ { \theta } ^ { S } = f _ { \theta _ { 2 } } ( \mathbf { z } _ { t + 1 : , n } ^ { S } , n , \mathbf { z } _ { t - l : t } ^ { S } , a _ { t : t + k } ) . } \end{array} \]These clean estimates are then cross-fed into the denoising networks:
\[ \begin{array} { r } { \mathbf { z } _ { t + 1 : , \theta } ^ { D } , \eta _ { \theta } ^ { D } = f _ { \theta _ { 3 } } ( \mathbf { z } _ { t + 1 : , n } ^ { D } , n , C _ { \theta } ^ { S } , a _ { t : t + k } ) , } \\ { \mathbf { z } _ { t + 1 : , \theta } ^ { S } , \eta _ { \theta } ^ { S } = f _ { \theta _ { 4 } } ( \mathbf { z } _ { t + 1 : , n } ^ { S } , n , C _ { \theta } ^ { D } , a _ { t : t + k } ) . } \end{array} \]Why does this matter? Take a look at the visualization below. The red lines show attention connections between features.

On the left (“Without interaction”), the model struggles to connect geometric features with their semantic counterparts—the attention is scattered. On the right (“With interaction”), the red lines clearly connect relevant parts of the scene. The model understands that the geometry of the handle belongs to the semantic concept of the drawer.
Predictive Policy: Thinking Before Acting
Having a world model is only half the battle. The robot needs a policy (a brain) that uses this model to act. LaDi-WM introduces a Predictive Manipulation Policy with a mechanism called Iterative Refinement.
The Imagination Loop
Most robot policies are “reactive”: See state \(\to\) Output action. LaDi-WM is “reflective”:
- Initial Guess: The policy outputs a preliminary action sequence based on current history.
- Imagination: The World Model takes this action and predicts the future latent states (what will happen if I do this?).
- Refinement: The policy takes the imagined future as a new input and corrects its action.

As illustrated in Figure 1 above:
- Step 1: The policy generates an initial action (\(a^{init}\)).
- Step 2: The LaDi-WM predicts the trajectory (Imagined States).
- Step 3: The policy sees these imagined states and produces a refined action (\(a^{ref}\)).
Convergence of Thought
This refinement process isn’t just a one-off. It can be looped. The authors found that iterating this process significantly improves the quality of the action.

The heatmaps above (Figure 7) visualize the probability distribution of the robot’s actions. In Iteration 1, the distribution is diffuse—the robot is uncertain. By Iteration 4, the “hot spots” (bright yellow) are tight and focused. The robot has “thought through” the problem and is confident in its plan.
Experimental Results
The theory sounds solid, but does it work? The authors tested LaDi-WM on LIBERO-LONG (a challenging long-horizon manipulation benchmark) and in real-world scenarios.
Simulation Performance
The results on LIBERO-LONG are staggering compared to previous state-of-the-art methods like Seer and ATM.

As seen in Table 1, LaDi-WM achieves an Average Success Rate (Avg.SR) of 68.7%, compared to 53.6% for Seer and 44.0% for ATM. This is a massive leap in performance. Notably, it dominates in tasks requiring precise interaction, like “Turn on the stove and put the moka pot on it” (Task 1), scoring 88.3% versus the nearest competitor’s 71.7%.
Real-World Generalization
Robotics papers often struggle to bridge the “Sim-to-Real” gap. LaDi-WM, however, shows impressive transfer capabilities. The authors tested it on a physical 7-DOF robot arm performing tasks like stacking bowls and organizing groceries.

(Note: Referencing the Real-World table from the image deck, listed as Table 5 in context)
In real-world experiments, the full LaDi-WM system achieved a 60.0% success rate, significantly higher than Vanilla Behavior Cloning (40.0%).
Crucially, the ablation study in the real-world table reveals two key findings:
- Pixel Diffusion fails: Using pixel-based diffusion dropped success to 51.4%. This confirms that the latent space is more robust for control.
- No Diffusion fails: Removing the diffusion aspect entirely dropped success to 49.3%.
Scalability
A hallmark of modern AI is scalability—does the model get better if you throw more data at it?

Figure 3 confirms that LaDi-WM scales effectively.
- (a): As the number of demonstrations increases, the World Model’s prediction error (MSE) drops.
- (b): Increasing policy training data leads to consistent success rate gains, maintaining a lead over baselines.
- (c): Simply making the model larger (more parameters) continues to improve performance on both LIBERO and CALVIN benchmarks.
Why Interaction Matters: A Deeper Look
The authors performed ablation studies to prove that the “Interactive” part of their diffusion model wasn’t just a gimmick.

Table 3 shows that removing the interaction (treating DINO and SigLIP streams independently) drops the success rate from 60.7% to 58.9%. While that might seem small, in specific complex tasks, the gap widens. Furthermore, removing the semantic features (SigLIP) entirely causes a significant drop to 57.3%, proving that geometry alone isn’t enough—the robot needs to understand what it is holding.
Conclusion
LaDi-WM represents a significant step forward in embodied AI. By moving away from pixel prediction and embracing a dual latent space of geometry and semantics, it solves the problem of “blurry” future predictions.
The key takeaways are:
- Don’t predict pixels: Predict the latent features of Foundation Models (DINO + SigLIP).
- Interaction is key: Geometry and Semantics must be diffused together to maintain consistency.
- Think before acting: Using the World Model to iteratively refine the policy’s actions leads to much higher success rates and lower uncertainty.
This approach demonstrates that for robots to act intelligently in the physical world, they need an internal simulator that is fast, semantic-aware, and geometrically precise. LaDi-WM provides exactly that blueprint.
](https://deep-paper.org/en/paper/2505.11528/images/cover.png)