The Best of Both Worlds: Teaching Diffusion Models to Think Like Control Theorists

Introduction

In the world of robotics, there is a constant tug-of-war between creativity and safety.

On one side, we have data-driven methods, particularly diffusion models. These are the “artists.” They have watched thousands of demonstrations and learned to generate complex, human-like motions. They can navigate cluttered rooms or manipulate objects with dexterity. However, like many artists, they don’t always like following strict rules. If you present a diffusion model with a safety constraint it hasn’t seen before, it might hallucinate a path right through a wall.

On the other side, we have model-based optimization. These are the “inspectors.” They rely on physics, hard constraints, and control theory (like Control Barrier Functions). They guarantee safety and stability. But they are often rigid, computationally expensive, and lack the ability to “improvise” in complex scenarios.

The standard solution in robotics has been to let the artist draw a plan, and then have the inspector mark it up with a red pen (post-hoc safety filtering). But what happens when the artist draws something so unsafe that the inspector can’t fix it without ruining the drawing?

In a recent paper, researchers from Georgia Tech propose a new framework called Joint Model-based Model-free Diffusion (JM2D). Instead of a sequential “plan-then-fix” approach, JM2D forces the data-driven planner and the model-based optimizer to generate a solution together simultaneously.

Framework Overview. The left side shows the traditional “Sequential” approach where the planner and optimizer fight each other (purple box). The right side shows JM2D, where they diffuse together to find a mutually compatible solution (green box).

The Problem: The “Sequential” Trap

To understand why JM2D is necessary, we first need to look at why current methods fail.

Imagine a robot arm trying to reach a cup. A diffusion planner suggests a trajectory based on its training data. A safety filter (the model-based module) then looks at this trajectory. If the trajectory is about to hit an obstacle, the safety filter intervenes, nudging the robot away.

This is called Sequential Sampling. The problem is that the diffusion planner is “blind” to the safety filter’s capabilities. It might propose a trajectory that is so aggressive that the safety filter has no choice but to slam on the brakes, causing the robot to freeze or jerk violently. The two modules are misaligned.

Some researchers have tried Constraint-Guided Diffusion, where gradients from a cost function guide the diffusion process. However, this often requires differentiable constraints (which real-world obstacles often aren’t) and can push the robot into “traps”—local minima where the robot gets stuck because the guidance pushed it off the manifold of realistic motions.

The Solution: Joint Sampling (JM2D)

The core insight of JM2D is simple but profound: Don’t just sample the plan. Sample the plan and the safety correction at the same time.

The researchers formulate a Joint Model-Free Model-Based Generation (JM2G) problem.

Let \(x\) be the robot’s plan (the trajectory).
Let \(k\) be the model-based parameters (e.g., the specific control tweaks or safety backup plan).

Instead of generating \(x\) and then solving for \(k\), JM2D generates a joint pair \((x, k)\) from a joint distribution. The “glue” holding them together is an Interaction Potential, denoted as \(V(x, k)\). This potential function simply asks: “Are \(x\) and \(k\) compatible?” High compatibility means the plan \(x\) is feasible given the safety capabilities of \(k\), and \(k\) is appropriate for the plan \(x\).

How It Works: The Joint Diffusion Process

The framework treats both the high-level plan and the low-level optimization parameters as variables to be “denoised.”

Start with Noise: The system starts with random noise for both the plan (\(x_I\)) and the safety parameters (\(k_I\)).
Look Ahead (The Magic Step): At every step of the denoising process, the model needs to know which direction to step. Standard diffusion uses a score function trained on data. JM2D needs a joint score that accounts for both the data prior and the interaction potential.

This presents a mathematical challenge: The interaction potential usually involves hard constraints (like “don’t hit the wall”) or complex optimizations that are non-differentiable. You can’t just take the gradient of a brick wall.

To solve this, the authors use Importance Sampling. Instead of trying to differentiate the constraints, they use a Monte Carlo approach.

Equation 9: The Monte Carlo estimator for the joint score.

Here is the intuition behind the equation above:

From the current noisy state, the algorithm shoots out multiple “guesses” of what the final clean, denoised result might look like (\(\hat{x}_0, \hat{k}_0\)).
It evaluates the Interaction Potential \(V(\hat{x}_0, \hat{k}_0)\) for each guess. Basically, it checks: “In this guessed future, did we crash?”
It calculates a weighted average. Guesses that resulted in safe, compatible outcomes get higher weights.
This weighted average forms the gradient (score) that steers the diffusion process.

By doing this, JM2D guides the diffusion process toward regions where the plan and the safety backup are in harmony, without ever needing to calculate a gradient of the constraint itself.

Visualizing the Difference

The authors demonstrate this behavior using a “Donut” toy domain. The goal is to plan a path from a start to a goal within a grey donut shape, while avoiding a red obstacle that appears only at test time.

Comparison of sampling methods. Sequential sampling (red stars) fails blindly. Gibbs sampling (green) tries to fix it but struggles. JM2D (blue stars) finds the cluster of valid solutions.

Sequential Sampling (Left): The planner picks a spot based on training data, ignoring the red obstacle. The optimization tries to fix it but fails because the initial guess was too far off.
JM2D (Right): The diffusion process “feels” the interaction potential. It naturally converges on the area where the plan is both valid (on the donut) and safe (compatible with the optimization).

Experiments: Does it work on robots?

The researchers tested JM2D on both simulation benchmarks and real hardware.

1. The PointMaze Challenge

In this experiment, a robot must navigate a maze. The tricky part? At test time, the walls are “inflated” (padded), making the corridors narrower than what the robot saw during training.

RAIL (Baseline): A standard method that generates a plan and then applies a safety filter.
JM2D (Ours): The joint sampling approach.

Results graphs showing Safe Success Rate, Intervention Rate, and Task Horizon.

The results in Figure 4 are telling. As the walls get thicker (moving right on the x-axis):

Safe Success Rate (Left): The vanilla diffusion policy (red line) fails miserably because it doesn’t know the walls moved. JM2D (blue line) maintains a near-perfect success rate.
Intervention Rate (Middle): This is the key metric. The RAIL baseline (orange) stays safe, but the safety filter has to intervene constantly (up to 40% of the time). This makes the robot jerky and slow. JM2D has a much lower intervention rate because the generated plans are already compliant with the safety constraints.
Task Horizon (Right): Because JM2D fights the safety filter less, it completes the maze faster (lower is better).

2. Real-World Manipulation

They deployed JM2D on a Franka Emika Panda robot arm. The task was to pick up a mug. The catch? They placed random obstacles (boxes, other objects) in the scene that were not present in the training data.

Real robot experiment setup showing the Franka arm and unseen obstacles.

A standard diffusion planner would happily smash the robot into the new obstacles because they look “close enough” to the empty table it trained on.

Comparison strip. Top: Vanilla Diffusion crashes. Bottom: JM2D smoothly avoids obstacles.

As shown above, JM2D successfully guides the arm around the obstacles. The “backup planner” (the model-based part) informs the diffusion process that the direct path is dangerous. The diffusion process then searches the “joint space” and finds a path that is kinematically feasible and obstacle-free.

Why Not Just “Project” the Noise?

A common alternative in this field is Projection. If a diffusion step suggests a noisy point outside the safe region, why not just mathematically “snap” it to the nearest safe point?

The authors argue this destroys the “data fidelity”—the naturalness of the motion.

Comparison of trajectories. JM2D (Left) is smooth. DPCC (Right) is erratic.

In the figure above, look at the difference between JM2D (a) and DPCC (c), a projection-based method.

JM2D: produces smooth, logical curves. The constraints act as a gentle guide throughout the generation process.
DPCC: produces erratic, tangled paths. By forcefully projecting the noisy samples, the method breaks the internal logic of the diffusion model, resulting in “safe” but unusable or unnatural trajectories.

Conclusion

JM2D represents a significant step forward in bridging the gap between modern AI and classical robotics.

By formulating planning as a joint sampling problem, the authors created a system where the “artist” (diffusion) and the “inspector” (optimization) work in tandem rather than in opposition. The use of Monte Carlo estimation allows this framework to handle the messy, non-differentiable constraints of the real world—like walls, tables, and cups—without needing perfect differentiable simulators.

For students and researchers, the key takeaway is this: Alignment matters. It is not enough to simply stack modules on top of each other. True robustness comes when the components of a robotic system—learning, planning, and control—share information throughout the decision-making process.

Introduction#

The Problem: The “Sequential” Trap#

The Solution: Joint Sampling (JM2D)#

How It Works: The Joint Diffusion Process#

Visualizing the Difference#

Experiments: Does it work on robots?#

1. The PointMaze Challenge#

2. Real-World Manipulation#

Why Not Just “Project” the Noise?#

Conclusion#