Don’t Retrain, Just Steer: How DSRL Adapts Diffusion Robots via Latent Space
In the rapidly evolving world of robot learning, Behavioral Cloning (BC) has emerged as the dominant paradigm. By collecting demonstrations from humans (teleoperation) and training neural networks to mimic those actions, we have enabled robots to perform impressive manipulation tasks. Recently, Diffusion Models—the same tech behind DALL-E and Stable Diffusion—have taken over robotics, providing policies that can model complex, multi-modal action distributions with high precision.
But there is a catch.
BC policies are brittle. If a robot encounters a scenario slightly different from its training data, or if the human demonstrations weren’t perfect, the robot fails. In the “open world,” we cannot possibly collect demonstrations for every edge case. We need robots that can learn from their own mistakes and adapt online.
Typically, we turn to Reinforcement Learning (RL) for this. But applying RL to diffusion policies is a nightmare. It usually requires back-propagating gradients through a deep, iterative denoising process, which is computationally expensive and unstable. Or, it requires fine-tuning massive “generalist” models (like \(\pi_0\)), which destroys their pre-trained capabilities (catastrophic forgetting).
What if we didn’t have to touch the weights of the neural network at all?
In a fascinating new paper, Steering Your Diffusion Policy with Latent Space Reinforcement Learning, researchers from UC Berkeley, University of Washington, and Amazon propose a clever workaround. Instead of retraining the robot’s “brain,” they propose optimizing the “thought starter”—the random noise injected into the diffusion model.
This method, Diffusion Steering via Reinforcement Learning (DSRL), treats the pre-trained policy as a fixed engine and simply learns how to steer it. The result is a highly sample-efficient method that can adapt giant foundation models using simple black-box access.
Let’s dive into how this works.
The Problem: The High Cost of Adaptation
To understand DSRL, we first need to look at why current methods struggle.
The Diffusion Policy Paradigm
In a standard diffusion policy, the robot observes the state \(s\) (e.g., camera images). To decide on an action, it samples a random noise vector \(\boldsymbol{w}\) from a standard Gaussian distribution (\(\boldsymbol{w} \sim \mathcal{N}(0, I)\)). It then progressively “denoises” this vector, conditioned on the state \(s\), to produce an action \(\boldsymbol{a}\).
This works great for mimicking data. But if you want to improve the policy using RL (rewards), you typically have two bad options:
- Retrain the whole model: You treat the diffusion model parameters as the policy weights. This is slow and prone to breaking the pre-trained behaviors.
- Backprop through time: You try to calculate how a change in weights affects the final action by differentiating through the entire denoising chain (which might be 20 to 100 steps). This is memory-heavy and numerically unstable.
The DSRL Insight
The authors ask a different question: Why is the input noise always random?
In standard deployment, we assume \(\boldsymbol{w}\) must be random Gaussian noise. But mathematically, the denoising process (specifically when using DDIM or Flow Matching) is a deterministic function. If you fix the state \(s\) and fix the noise \(\boldsymbol{w}\), the output action \(\boldsymbol{a}\) is always the same.
This means the diffusion policy is actually just a function \(f(s, w) = a\).
If the robot is failing to grab a mug, maybe the problem isn’t the policy’s weights. Maybe we just sampled an “unlucky” \(\boldsymbol{w}\) that led to a clumsy grasp. If we could find a “lucky” \(\boldsymbol{w}\) that leads to a perfect grasp, we solve the task without changing a single weight in the diffusion network.

As shown in Figure 1, standard deployment (top) relies on random chance. DSRL (bottom) inserts a small, learned policy \(\pi^w\) that looks at the state and says, “Don’t use random noise; use this specific noise vector \(\boldsymbol{w}\).” This steers the frozen diffusion model toward high-reward actions.
Core Method: Steering via Latent-Noise Space
The core contribution of this paper is formalizing the idea of Latent-Action MDPs.
1. The Latent-Action MDP
In a standard Markov Decision Process (MDP), the agent selects a physical action \(\boldsymbol{a}\) (e.g., joint velocities). In DSRL, we redefine the agent’s job.
The authors propose a transformed environment where:
- Action Space: The latent noise space \(\mathcal{W}\) (usually \(\mathbb{R}^d\)).
- Transition: When the agent picks a noise vector \(\boldsymbol{w}\), the environment internally runs the fixed diffusion policy to get \(\boldsymbol{a} = \pi_{dp}(s, w)\), executes \(\boldsymbol{a}\), and returns the next state \(s'\).
From the perspective of the RL algorithm, the huge, complex diffusion policy is just part of the environment dynamics. This is brilliant because it makes the diffusion policy a black box. The RL agent doesn’t need to know gradients, weights, or architecture. It just needs to know: “If I output noise \(\boldsymbol{w}\), I get reward \(r\).”
2. The “Steering” Metaphor
Think of the pre-trained diffusion policy as a car engine that is already running. The standard approach (Gaussian sampling) is like letting the car drive randomly. DSRL puts a driver behind the wheel. The driver (the RL agent) doesn’t build the engine; they just turn the steering wheel (select \(\boldsymbol{w}\)) to guide the car toward the destination.

Figure 2 visualizes this geometry. The pink box is the pre-trained policy. By shifting the input \(w\), we shift the output \(a\). Interestingly, because diffusion models map a high-dimensional noise space to a manifold of likely actions, different noise vectors (\(w_2\) and \(w_3\)) might map to very similar actions (aliasing).
3. Noise-Aliased DSRL (DSRL-NA)
While you could just run any RL algorithm (like SAC or PPO) on this new Latent-Action MDP, the authors introduce a specialized algorithm called DSRL-NA to make it sample-efficient.
The challenge is that offline data (demonstrations) comes in the form of \((s, a)\) pairs. We know the action \(a\), but we don’t know the noise \(\boldsymbol{w}\) that generated it (inverting a diffusion model is hard). Standard RL on the latent space can’t easily use this rich offline data.
DSRL-NA solves this by training two critics:
- The Action Critic (\(Q^{\mathcal{A}}\)): Standard Q-learning on the physical \((s, a)\) space. It learns from offline data and online interaction. It tells us “How good is this physical grasp?”
- The Latent Critic (\(Q^{\mathcal{W}}\)): This critic learns the value of noise vectors. It distills knowledge from \(Q^{\mathcal{A}}\). It simply asks \(Q^{\mathcal{A}}\): “If I pick noise \(\boldsymbol{w}\), I’ll get action \(\boldsymbol{a}\). How good is \(\boldsymbol{a}\)?”
This creates a bridge. We can leverage massive offline datasets to learn the dynamics of the task (\(Q^{\mathcal{A}}\)), and then propagate that knowledge into the steering policy (\(\pi^w\)).

Algorithm 1 details this process. The key is line 4, where the Latent Critic \(Q^{\mathcal{W}}\) updates by querying the Action Critic \(Q^{\mathcal{A}}\). This allows DSRL to be incredibly sample-efficient, adapting in minutes rather than hours.
Experimental Results
The paper puts DSRL to the test across simulations, real-world robots, and different model sizes.
1. Online Adaptation (Simulation)
The first question is: Can DSRL fix a suboptimal policy faster than existing methods?
The authors compared DSRL against state-of-the-art methods like DPPO (Diffusion PPO) and IDQL on standard benchmarks (Robomimic and Gym).


Figures 3 and 4 show the results. In almost every task, DSRL (the dark blue line) shoots up to high success rates much faster than the baselines.
- Sample Efficiency: Look at the x-axes. DSRL often solves the task in a fraction of the timesteps required by other methods.
- Stability: Unlike standard RL, which often collapses or oscillates, DSRL maintains steady improvement.
2. The Power of Offline Data
Because of the “Noise Aliasing” architecture (DSRL-NA), the method can digest offline data to jumpstart learning.

In Figure 5, the dashed light blue line (DSRL + offline) learns even faster than pure online DSRL. Compare this to standard offline-to-online methods like RLPD (yellow) or CAL-QL (red), which barely make progress on these specific diffusion tasks. This confirms that DSRL is uniquely suited to leverage prior demonstrations.
3. Real-World Robotics
Simulation is nice, but does it work on hardware? The authors tested DSRL on a Franka Emika Panda and a WidowX robot.

The tasks (shown in Figure 6) included Pick-and-Place, Drawer Closing, and Block Stacking. The base policies were trained on limited data and often failed to complete the tasks reliably.

Figure 7 is striking. The green dashed line is the base diffusion policy—it often hovers near 0% or 20% success.
- DSRL (Blue Diamonds): Within roughly 4,000 to 6,000 timesteps (a manageable afternoon of robot operation), DSRL adapts the policy to near 100% success.
- RLPD (Red Triangles): A strong baseline for standard RL, RLPD struggles to match DSRL’s learning speed on these tasks.
4. Steering Generalist Giants (\(\pi_0\))
Perhaps the most exciting application is steering Foundation Models. The authors applied DSRL to \(\pi_0\), a 3.3-billion parameter Vision-Language-Action model.
Fine-tuning a 3.3B parameter model is computationally prohibitive for most labs. However, DSRL only learns the input noise, which is a tiny vector compared to the model weights.

The charts in Figure 8 show the success rate of \(\pi_0\) on Libero and Aloha tasks. The base model (green dashed) fails completely (0% success). DSRL (blue) steers this giant model to success efficiently.
They even validated this on a real robot with \(\pi_0\).
- Toaster Task: Base \(\pi_0\) success: 5/20. Steered \(\pi_0\) success: 18/20.
- Spoon Task: Base \(\pi_0\) success: 15/20. Steered \(\pi_0\) success: 19/20.
This proves DSRL enables effective “black-box” fine-tuning of proprietary or massive models where accessing weights is impossible or impractical.
Why DSRL Works (Ablations)
You might wonder: “Does the base policy need to be good for this to work?”
The authors investigated this by training base policies on “Better”, “Okay”, and “Worse” quality data (based on operator skill).

Figure 12 (third panel) shows the result. While the base policy trained on “Worse” data (red dotted line) starts with terrible performance, DSRL is eventually able to steer it to the same high performance as the “Better” policy.
This implies that even an imperfect clone of a clumsy human demonstrator captures enough understanding of the “manifold of valid actions” that DSRL can find the good actions hidden within it.
Conclusion and Implications
DSRL represents a shift in how we think about adapting robot policies. Instead of treating the neural network as a blank slate that must be constantly rewritten, we treat it as a library of capabilities. The role of Reinforcement Learning shifts from learning how to move to learning which movement to select.
Key Takeaways:
- Efficiency: By optimizing a small latent policy rather than a massive diffusion backbone, DSRL learns faster and with less compute.
- Stability: It avoids the notorious instability of differentiating through diffusion chains.
- Universality: It works on specific task policies and massive generalist models like \(\pi_0\), and requires only black-box access.
For students and researchers, this opens up a new workflow: Download a massive, pre-trained robot brain (like Octo or \(\pi_0\)), freeze it, and simply train a lightweight “steering wheel” to handle your specific edge cases. It’s a pragmatic, scalable path toward robots that can finally adapt to the open world.
](https://deep-paper.org/en/paper/2506.15799/images/cover.png)