Introduction
Robotics has achieved remarkable feats in industrial settings, particularly with rigid objects. We have robots that can weld car chassis with sub-millimeter precision or assemble electronics at lightning speeds. However, move that robot into a kitchen and ask it to fold a dumpling or wrap a piece of dough, and it will likely struggle.
The challenge lies in deformable object manipulation. Unlike a rigid box, a piece of dough or cloth has infinite degrees of freedom. Its shape changes based on contact, gravity, and material properties. When you combine this with a dexterous robot hand (which has high dimensionality itself), the search space for finding a successful movement trajectory becomes computationally explosive.
Traditional methods, such as trajectory optimization, often fail here. They get stuck in local optima or struggle because the “cost function” (the mathematical way we tell the robot if it’s doing a good job) doesn’t provide useful feedback until the task is nearly done.
In this article, we are doing a deep dive into D-Cubed (Latent Diffusion for Dexterous Deformable Manipulation), a novel approach presented by researchers at the University of Oxford. This paper proposes a fascinating intersection of three complex fields: variational autoencoders (VAEs) to learn skills, latent diffusion models (LDMs) to generate plans, and gradient-free optimization to guide the robot toward a goal.

As illustrated in Figure 1, D-Cubed takes a task-agnostic dataset, learns how to move, and then optimizes trajectories to turn a pile of dough into a specific target shape. Let’s explore how this architecture bridges the gap between simulation and reality.
The Core Problem: Why is this Hard?
To appreciate D-Cubed, we must first understand why standard approaches fail.
The Curse of Dimensionality
In dexterous manipulation, you control a hand with many joints (often 20+). You are manipulating an object (like cloth or plasticine) that can deform in infinite ways. To solve a task, you need to find a sequence of actions over a long horizon (e.g., hundreds of time steps). The mathematical space of all possible actions is vast. Searching through it randomly is impossible.
The “Gradient” Problem
A common way to solve robot control problems is using differentiable physics simulators. These simulators calculate gradients—mathematical signals that tell the robot, “If you move your finger slightly to the left, the cost decreases.”
However, with deformable objects, these gradients are often unreliable or “sparse.” If the robot’s finger isn’t touching the object, the gradient is zero (no information). Even when touching, the complex contact physics can make the gradients noisy or inaccurate. This makes standard gradient-based optimization methods prone to failure.
Background Concepts
D-Cubed relies on a few foundational machine learning concepts. Let’s briefly review them.
Denoising Diffusion Probabilistic Models (DDPMs)
Diffusion models have taken the world by storm (powering tools like DALL-E and Midjourney). They work on a simple principle:
- Forward Process: Slowly add noise to data until it becomes pure static.
- Reverse Process: Train a neural network to remove that noise step-by-step to recover the original data.
In robotics, we can treat a trajectory (a sequence of robot movements) as an “image.” We can train a diffusion model to generate realistic robot movements by denoising random static into smooth trajectories.
Mathematically, the reverse process predicts the mean \(\mu\) and variance \(\Sigma\) of the previous step given the current noisy step. The mean is updated as follows:

And the variance is usually fixed or learned:
![]\n\\Sigma _ { \\pmb { \\theta } } ( \\mathbf { x } _ { i } , i ) = \\sigma _ { i } ^ { 2 } \\mathbf { I } = \\widetilde { \\beta } _ { i } = \\frac { 1 - \\overline { { \\alpha } } _ { i } - 1 } { 1 - \\overline { { \\alpha } } _ { i } } \\beta _ { i } ,\n[](/en/paper/2403.12861/images/003.jpg#center)
The model is trained to minimize the difference between the predicted clean data and the actual data:
![]\n\\mathcal { L } _ { \\mathrm { d i f f u s i o n } } = \\mathbb { E } _ { \\mathbf { x } _ { 0 } \\sim q ( \\mathbf { x } _ { 0 } ) , i \\sim [ 1 , N ] } \\left[ | | \\mathbf { x } _ { 0 } - G ( \\mathbf { x } _ { i } , i ) | | _ { 2 } ^ { 2 } \\right] .\n[](/en/paper/2403.12861/images/004.jpg#center)
Cross-Entropy Method (CEM)
CEM is a popular optimization algorithm that doesn’t need gradients. It works by:
- Sampling a batch of solutions from a distribution (usually Gaussian).
- Evaluating them.
- Selecting the top k performers (the “elites”).
- Fitting a new Gaussian distribution to those elites.
- Repeating until convergence.
D-Cubed will cleverly combine these two concepts.
The D-Cubed Architecture
The D-Cubed method is split into two phases: Learning and Trajectory Optimisation.

Part 1: Learning from “Play” Data
One of the biggest bottlenecks in robotics is gathering expert demonstrations. Teaching a robot to fold a shirt by manually moving its arms 1,000 times is expensive and tedious.
D-Cubed bypasses this by using a task-agnostic play dataset. This dataset contains generic hand motions—opening, closing, wiggling fingers, flexing wrists—mimicking human hand movements. It isn’t trying to achieve a specific goal; it’s just exploring the capabilities of the hand.
Step A: The Skill Latent Space (VAE)
The raw action space of a robot hand is high-frequency and jittery. Optimizing every single joint angle for every millisecond is inefficient.
The researchers use a Variational Autoencoder (VAE) to compress short sequences of these actions (e.g., 10 time steps) into a compact “skill” vector, \(z\).
- Encoder: Takes a short action sequence and squashes it into latent code \(z\).
- Decoder: Takes \(z\) and reconstructs the action sequence.
The VAE is trained using the Evidence Lower Bound (ELBO) loss:
![]\n\\mathcal { L } ^ { \\mathrm { E L B O } } = \\mathbb { E } _ { \\mathbf { z } \\sim q _ { \\psi } ( \\mathbf { z } | \\mathbf { a } ^ { t : t + H } ) } \\log p _ { \\psi } ^ { d e c } ( \\mathbf { a } ^ { t : t + H } \\mid \\mathbf { z } ) - D _ { \\mathrm { K L } } \\left[ q _ { \\psi } ( \\mathbf { z } \\mid \\mathbf { a } ^ { t : t + H } ) \\lVert p ( \\mathbf { z } ) \\right]\n[](/en/paper/2403.12861/images/006.jpg#center)
Why do this? This abstraction reduces the search space. Instead of planning thousands of individual motor commands, the system now plans a sequence of “skills” (e.g., “close fingers,” “rotate wrist”).
Step B: The Latent Diffusion Model (LDM)
Now that we have “skills,” we need a way to string them together into long trajectories. This is where the Latent Diffusion Model comes in.
The authors train an LDM (using a Transformer backbone) on the play dataset. The model learns to generate sequences of skill latents (\(z^{1:T_{skill}}\)). Because it is a diffusion model, it is excellent at modeling complex, multi-modal distributions. It can generate diverse, physically plausible sequences of skills.
The LDM training objective minimizes the error in predicting the clean skill sequence from the noisy one:
![]\n\\mathcal { L } _ { \\mathrm { l d m } } = \\mathbb { E } _ { \\mathbf { z } ^ { 1 : T _ { s k i l l } } \\sim \\mathcal { D } _ { p l a y } , i \\sim [ 1 , N ] } [ | \\mathbf { z } _ { 0 } ^ { 1 : T _ { s k i l l } } - G ( \\mathbf { z } _ { i } ^ { 1 : T _ { s k i l l } } , i ) | | _ { 2 } ^ { 2 } ] .\n[](/en/paper/2403.12861/images/007.jpg#center)
Part 2: Gradient-Free Guided Sampling (The Optimization)
This is the most innovative part of the paper. We have a model that can dream up robot hand motions (LDM), but we need it to solve a specific task, like folding dough.
Standard diffusion guidance (classifier guidance) typically uses gradients to push the generation toward a goal. But as discussed earlier, gradients in deformable simulators are unreliable.
Instead, D-Cubed embeds the Cross-Entropy Method (CEM) directly into the reverse diffusion process.
The Algorithm Step-by-Step
- Initialize: Start with a batch of pure Gaussian noise trajectories.
- Reverse Diffusion Loop (Step \(N\) to 1):
- Exploration: The LDM predicts the mean \(\mu_i\) for the next denoising step.
- Sampling: D-Cubed samples a batch of noisy skill trajectories around this mean.
- Evaluation: These noisy skill trajectories are decoded into actual motor actions and executed in the simulator. The cost (distance to the target shape) is calculated.
- Selection: The trajectory with the lowest cost is identified as \(z_{best}\).
- Guidance: This “best” trajectory is used to update the mean for the next diffusion step.
The update equation for the mean effectively pulls the diffusion process toward the low-cost trajectory:
![]\n\\mu _ { \\theta } ( \\mathbf { z } _ { i } ^ { 1 : T _ { s k i l l } } , i ) = \\frac { \\sqrt { \\bar { \\alpha } _ { i - 1 } } \\beta _ { t } } { 1 - \\bar { \\alpha } _ { i } } G _ { \\theta } ( \\mathbf { z } _ { b e s t } ^ { 1 : T _ { s k i l l } } , i ) + \\frac { \\sqrt { \\alpha _ { t } } \\left( 1 - \\bar { \\alpha } _ { i - 1 } \\right) } { 1 - \\bar { \\alpha } _ { i } } \\mathbf { z } _ { b e s t } ^ { 1 : T _ { s k i l l } }\n()](/en/paper/2403.12861/images/009.jpg#center)
Why this works
In the early steps of diffusion (high noise), the variance is large. This means the model generates very diverse trajectories—it effectively “explores” the search space. As the steps progress (low noise), the variance shrinks, and the model “exploits” the best trajectory, refining the fine movements to minimize the cost.
By filtering the samples through the simulator at every step, D-Cubed ensures that the final trajectory isn’t just a hallucination of the diffusion model—it’s physically verified to work.
Experiments and Results
The authors evaluated D-Cubed on a suite of six challenging tasks involving simulated dough and rope. These included single-hand tasks (Folding, Flipping) and dual-hand tasks (Dumpling making, Bun shaping).

The qualitative results (Figure 3) show the robot performing complex maneuvers, utilizing friction and contacts to reshape the objects.
Quantitative Comparison
D-Cubed was compared against strong baselines:
- Grad TrajOpt: Gradient-based optimization.
- MPPI: A standard sampling-based method.
- PPO: A popular Reinforcement Learning algorithm.
- LDM with Classifier Guidance: Using standard gradient guidance.
The metric used is the “Normalized Improvement in Earth-Mover Distance (EMD),” where 1.0 is a perfect match to the target shape.
![Table 1:The averaged normalised improved EMDand standard deviation is reported for each method.The scores are averaged over 3 seeds. The scores for Grad TrajOpt and \\(P P O\\) are taken from previous work [14].](/en/paper/2403.12861/images/011.jpg#center)
Key Takeaways from Table 1:
- D-Cubed dominates: It outperforms all baselines by a significant margin across all tasks.
- Gradients fail: Notice how Grad TrajOpt and LDM w/ Classifier Guidance struggle (scores near 0.0 or very low). This confirms that simulator gradients are too noisy to guide complex deformation tasks.
- Baselines struggle: Even PPO and standard MPPI cannot handle the complexity of the state space.
Why does it perform so well? (Ablation Studies)
The authors performed ablation studies to prove the necessity of their design choices.
1. Importance of Sampling Count In Figure 4(a), they show that increasing the number of trajectories sampled during the reverse process improves performance. This makes sense: more samples = better chance of finding a good path.

2. Gradients Don’t Help Figure 4(b) is particularly interesting. It shows that adding gradient guidance on top of their method offers no statistically significant improvement. The “black box” simulator evaluation is sufficient and more robust.
3. The Power of Skill Latents Does the VAE (Skill Latent) actually help, or could we just run diffusion on raw actions? Figure 5 compares D-Cubed with and without the skill latent space.

For simple tasks (Flip, Folding), raw actions work okay. But for complex tasks like “Rope” and “Wrap,” the skill latent version performs drastically better. The skill space acts as a “macro” language, allowing the planner to think in terms of behaviors rather than muscle twitches.
4. Refinement Over Time Figure 7 demonstrates the optimization process over diffusion timesteps.

We can see a sharp improvement early on. This indicates that the “Exploration” phase (high noise) quickly identifies the general motion required to make contact with the object, while the later steps refine the shape.
Sim-to-Real Transfer
Finally, the holy grail of robotics research: Does it work in the real world?
The authors transferred the trajectories optimized in simulation to a real-world setup using a LEAP hand. Despite the “reality gap” (differences in friction, sensor noise, etc.), the trajectory successfully flipped a deformable object.

This is possible because D-Cubed generates open-loop trajectories that are robust enough to work without constant real-time correction, provided the initial conditions are similar.
Conclusion and Implications
D-Cubed represents a significant step forward in robotic manipulation. By moving away from gradient dependency and embracing generative modeling, it solves tasks that were previously intractable.
Here are the key takeaways for students and researchers:
- Skills are better than Actions: Abstracting low-level controls into “skills” (via VAEs) makes long-horizon planning feasible.
- Diffusion as a Planner: Diffusion models aren’t just for images. They are powerful priors for generating complex time-series data like robot trajectories.
- Simulation is a powerful critic: Even if simulator gradients are noisy, the simulator’s output state is a valuable signal for evaluating and selecting trajectories.
The limitations? D-Cubed is currently computationally heavy because it requires running many simulations during the inference (planning) phase. However, as parallel simulation technology (like Isaac Gym) improves, methods like D-Cubed could become fast enough for real-time applications.
This paper is a great example of how modern generative AI techniques can be creatively applied to classical robotics problems, opening the door for robots that can finally handle the messy, squishy world we live in.
](https://deep-paper.org/en/paper/2403.12861/images/cover.png)