The integration of Generative AI into robotics has sparked a revolution. Specifically, Diffusion Policy (DP) has emerged as a state-of-the-art approach for “behavior cloning”—teaching robots to perform tasks by mimicking human demonstrations. Unlike older methods that try to average out human movements into a single mean trajectory, Diffusion Policy embraces the fact that humans solve tasks in many different ways. It models the distribution of possible actions, allowing a robot to handle multimodal behavior (e.g., grasping a cup from the left or the right).

However, there is a catch. Diffusion models are inherently stochastic. They work by denoising random Gaussian noise into a trajectory. While this allows for diversity, it introduces a “dice roll” effect at inference time. Sometimes, the model samples a trajectory that is technically possible but lies on the fringe of the distribution—an outlier. In high-precision tasks, like threading a needle or stacking cubes, these outliers can lead to failure.

In this post, we are doing a deep dive into KDPE (Kernel Density Estimation for Diffusion Policy Trajectory Selection). This research proposes a clever, inference-time “filter” that significantly improves the reliability of diffusion-based robots without requiring expensive retraining.

The Problem: The Double-Edged Sword of Stochasticity

To understand KDPE, we first need to look at the limitations of standard Diffusion Policy.

When a robot trained with DP sees an observation (an image of the workspace), it starts with a set of random noise. It iteratively removes this noise to reveal a clear action trajectory. Because the starting point is random, the output trajectory is different every time you run the policy, even for the exact same observation.

Usually, this is fine. But sometimes, the denoising process produces a trajectory that is:

  1. Jittery or inconsistent: The stochasticity impacts the quality of the motion.
  2. An Outlier: Since DP is a supervised learning approach, it might learn bad habits or outliers present in the training data. If the model samples one of these “rare” modes during execution, the robot might veer off course.

Most current solutions involve throwing more data at the problem or combining DP with other complex algorithms. The authors of KDPE propose a different angle: Why not generate multiple options and pick the best one?

The Solution: KDPE

KDPE stands for Kernel Density Estimation for Diffusion Policy Trajectory Selection. The core idea is elegant in its simplicity but sophisticated in its mathematical execution.

Instead of generating a single trajectory and hoping for the best, the robot generates a population of \(N\) trajectories in parallel. Then, it uses statistical analysis—specifically Kernel Density Estimation (KDE)—to determine which of these trajectories is the most “representative” of the learned distribution. In simpler terms, it looks for the trajectory that sits in the “densest” part of the action space, effectively filtering out the weird outliers.

Background: Diffusion Policy Mechanics

Before we get into the density estimation, let’s briefly recap how the trajectories are generated. Diffusion Policy models the robot’s policy as a Denoising Diffusion Probabilistic Model (DDPM).

At inference time, the process looks like this:

Equation for the denoising step in Diffusion Policy.

Here, \(\mathbf{A}^K\) is the starting random noise. The model works backward from step \(K\) to \(0\). At each step, a neural network \(\epsilon_\theta\) predicts the noise, which is then subtracted to refine the action \(\mathbf{A}\).

The network is trained using a Mean Squared Error (MSE) loss, trying to predict the noise added to ground-truth demonstrations:

Loss function for training the noise prediction network.

The result is a trajectory of actions \(\mathbf{A}^0\) comprising the end-effector’s position, orientation, and gripper state over a time horizon \(T\).

Core Method: Manifold-Aware Action Selection

The heart of this paper is not the generation of trajectories (which uses standard DP), but the selection mechanism.

1. Sampling a Population

The robot takes the current observation \(\mathbf{o}_t\) and samples \(N\) independent trajectories (e.g., \(N=100\)). Because the denoising process is stochastic, we get a cloud of slightly different potential futures.

2. Focusing on the End-Effector

The researchers found that analyzing the last action in the predicted trajectory provides a strong signal for the quality of the whole path. Let’s call the last action of the \(i\)-th trajectory \(\mathbf{a}_i\).

3. The Challenge of “Density” in Robotics

The goal is to find which action \(\mathbf{a}_i\) has the highest probability density. If we were just dealing with 2D points on a graph, we could use standard Gaussian KDE. But a robotic action is complex. It consists of:

  • Position: Euclidean space (\(\mathbb{R}^3\)).
  • Gripper State: Euclidean (usually 1 dimension, open/closed).
  • Orientation: Rotation space (\(SO(3)\)).

You cannot simply subtract two rotation matrices to find the distance between them. The space of rotations is non-Euclidean; it lives on a manifold. To perform Kernel Density Estimation effectively, we need a kernel that understands this geometry.

4. The Manifold-Aware Kernel

The authors propose a unified kernel function that handles position, rotation, and gripper state simultaneously. The kernel \(k(\mathbf{a}_i, \mathbf{a}_j)\) measures the similarity between two actions:

Equation defining the multivariate kernel function.

In this equation, \(\mathbf{H}\) is a covariance matrix that weights the different components (position, rotation, gripper). The critical part is \(\boldsymbol{\Delta}_{ij}\), which represents the “difference” between two actions.

This difference vector is constructed by concatenating the differences of the individual components, respecting their specific manifolds:

Equation showing the construction of the difference vector delta.

Let’s break down the term \(\log(R_j^T R_i)^\vee\):

  1. \(R_i\) and \(R_j\) are rotation matrices.
  2. \(R_j^T R_i\) computes the relative rotation between them.
  3. \(\log(\cdot)\) is the Lie group logarithm map, which maps the rotation from the manifold \(SO(3)\) to the tangent space \(\mathfrak{so}(3)\).
  4. \((\cdot)^\vee\) (the “vee” operator) converts that tangent space element into a vector in \(\mathbb{R}^3\).

Essentially, this calculates the angular distance between two 3D orientations in a mathematically rigorous way.

When we plug this into the exponent of the kernel function, we get a weighted sum of squared distances:

Expanded exponent of the kernel function showing weighted distances.

Here, the term involving the rotations effectively represents the geodesic distance (the shortest path on the rotation sphere) squared:

Equation defining the geodesic distance on SO(3).

This math ensures that “closeness” in orientation is treated just as naturally as “closeness” in centimeters for position.

5. Selecting the Best Trajectory

With the kernel defined, the algorithm can now calculate the density score \(\rho\) for any specific action \(\tilde{\mathbf{a}}\). The density is the average similarity to all other generated actions in the population:

Equation for calculating the probability density of an action.

Finally, KDPE selects the trajectory associated with the action that has the maximum \(\rho\) value.

Visualization of the Density

To visualize how this works, look at the figure below. It shows a heat map of gripper actions. The green and red arrows represent open and closed grippers.

Visualization of PDF estimated via KDE.

As you move from left to right, the “probe” orientation changes. Notice how the heat map (the purple/yellow blobs) changes shape? The KDE spikes only when the position, orientation, and gripper state all align with the majority of the sampled data. This proves the kernel is successfully fusing all these distinct data types into a single density landscape.

Experimental Setup

To prove that picking the “densest” trajectory is better than picking a random one, the authors tested KDPE on a suite of standard benchmarks.

Simulation Tasks: They utilized RoboMimic and MimicGen, covering tasks like lifting objects, assembling pieces, and manipulating coffee machines.

RoboMimic and MimicGen task environments.

They compared four methods:

  1. DP (Baseline): Standard Diffusion Policy (random selection).
  2. KDPE (Ours): Selecting the highest density action.
  3. KDPE-OOD: Selecting the lowest density action (to prove outliers are bad).
  4. Tr-KDPE: A variant that looks at the density of the whole trajectory, not just the last step.

Real Robot Tasks: They also deployed the system on a physical Franka Emika Panda robot for tasks requiring dexterity, such as PickPlush, CubeSort, and CoffeeMaking.

Real-world robot tasks including plush picking and coffee making.

Results: Precision and Consistency

The results confirm the hypothesis: filtering for the mode of the distribution improves success rates.

Simulation Benchmarks

In the table below, “ph” stands for proficient human (clean data) and “mh” stands for mixed human (messy data).

Table comparing success rates of DP and KDPE variants.

Key Takeaways from the Data:

  • KDPE Wins: KDPE outperforms the standard DP baseline in almost every category.
  • Messy Data: The improvement is most improved on “mh” datasets. This makes sense—if the training data has bad demonstrations, the model produces more outliers. KDPE filters these out effectively.
  • Precision Matters: On the ToolHang task, which requires high precision, KDPE shows a massive improvement (up to 12% in the OOD experiment context).
  • Outliers are Fatal: The KDPE-OOD row shows abysmal performance. This confirms that the “tails” of the diffusion distribution contain failure modes. If you pick the least dense trajectory, the robot almost certainly fails.

Robustness to Visual Perturbation

One common issue in robotics is that models break if the lighting or colors change slightly. The authors tested this by modifying object colors in the simulation.

Visual perturbation experiments showing color-shifted objects.

Even in these shifted domains, KDPE maintained a higher success rate than standard DP, suggesting that the “mode” of the distribution is more robust to visual noise than random samples are.

Real World Success

On the physical robot, the results mirrored the simulation.

Success rates for real-world tasks.

In the CoffeeMaking task—a long-horizon task requiring precise insertion of a pod—KDPE showed smoother behavior and higher reliability. The authors noted that the robot’s motion was visibly less jittery, as the KDE selection likely filtered out trajectories that had high-frequency noise or erratic movements.

Why not use the whole trajectory?

You might wonder why KDPE focuses only on the last action rather than the whole path. The authors actually tested a “Trajectory-KDPE” (Tr-KDPE) which uses conditional probabilities to model the full sequence:

Equations for conditional KDE used in Tr-KDPE.

However, as seen in the results table, Tr-KDPE did not perform as well as the simpler KDPE. The reason is the Curse of Dimensionality. As you increase the number of dimensions (from one action to a sequence of actions), the data becomes sparse, and estimating density becomes exponentially harder and computationally expensive. Focusing on the final action acts as a sufficient heuristic without breaking the compute budget.

Conclusion and Implications

KDPE offers a compelling “inference-time scaling” strategy for robotic manipulation. It demonstrates that we don’t always need to retrain a model to get better performance; sometimes, we just need to be smarter about how we sample from the model we already have.

By treating the output of a Diffusion Policy not as a single answer but as a probabilistic field of options, KDPE leverages the “Wisdom of the Crowd” (or rather, the wisdom of the generated samples). By using a manifold-aware kernel, it correctly navigates the complex geometry of robotic poses, ensuring the robot sticks to the most reliable, “normal” behavior it learned, while discarding the dangerous creative liberties the diffusion model might occasionally take.

For students and researchers, KDPE highlights an important lesson: Generative models are powerful, but their stochastic nature requires control. Statistical tools like KDE provide the guardrails needed to turn a probabilistic guess into a reliable robotic action.

Visualization Tool The authors also released a visualizer to help debug these trajectories, using point clouds to see exactly where the distribution is centered.

Visualization tool showing trajectory densities.

This work paves the way for more reliable autonomous agents that can benefit from the creativity of generative AI without suffering from its hallucinations.