We propose DSRL: steering diffusion policies to desired behaviors by running reinforcement learning over their latent-noise space. DSRL is highly sample-efficient and enables real-world improvement of diffusion policies for robotic control, and state-of-the-art performance on simulated benchmarks.
Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior—an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies—a state-of-the-art BC methodology—we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.
Standard deployment of a BC-trained diffusion policy \(\pi_{\mathrm{dp}}\) first samples noise \(\boldsymbol{w} \sim \mathcal{N}(0,I)\) that is then denoised through the reverse diffusion process to produce an action \(\boldsymbol{a}\). We propose modifying the initial distribution of \(\boldsymbol{w}\) with an RL-trained latent-noise space policy \(\pi^{\mathcal{W}}\) that, instead of choosing \(\boldsymbol{w} \sim \mathcal{N}(0,I)\), chooses \(\boldsymbol{w}\) to steer the distribution of actions produced by \(\pi_{\mathrm{dp}}\) in a desirable way:
We refer to our approach as DSRL: Diffusion Steering via Reinforcement Learning. Notably, DSRL:
DSRL enables highly sample-efficient adaptation of real-world diffusion policies for robot control, and achieves state-of-the-art performance on simulated benchmarks.
Performance of single-task diffusion policy trained on 10 demonstrations before and after DSRL adaptation.
Diffusion policy with standard noise sampling
Uncut DSRL training timelapse
Diffusion policy with DSRL-learned noise policy
Performance of multi-task diffusion policy trained on BridgeData V2 dataset before and after DSRL adaptation.
Diffusion policy with standard noise sampling
Uncut DSRL training timelapse
Diffusion policy with DSRL-learned noise policy
Diffusion policy with standard noise sampling
Diffusion policy with DSRL-learned noise policy
Diffusion policy with standard noise samplingn
Diffusion policy with DSRL-learned noise policy
Performance of pretrained generalist policy \(\pi_0\) before and after DSRL adaptation, utilizing the publically available DROID weights for \(\pi_0\).
\(\pi_0\) zero-shot
Uncut DSRL training timelapse
\(\pi_0\) with DSRL-learned noise policy
\(\pi_0\) zero-shot
Uncut DSRL training timelapse
\(\pi_0\) with DSRL-learned noise policy
DSRL
RLPD
We compare the exploration behavior of DSRL to that of a traditional RL algorithm, in this case RLPD. Each clip shows the first 9 episodes of online RL for each approach. We see that the behaviors induced by RLPD are effectively random, and provide no meaningful learning signal. In contrast, DSRL behaviors that are ''reasonable'' from the start plays, attempting to pick up the mushroom on nearly every episode, providing much more useful data for learning.
While in principle DSRL can be instantiated with virtually any RL algorithm, we introduce an approach which makes particular use of the diffusion policy's structure to increase sample efficiency and incorporate offline data.
DSRL treats the noise \(\boldsymbol{w}\) as an action, so that if we select noise \(\boldsymbol{w}\) and denoise it through the diffusion policy \(\pi_{\mathrm{dp}}\) to produce an action \(\boldsymbol{a}\), then instead of transition
\((\boldsymbol{s}, \boldsymbol{a}, \boldsymbol{r}, \boldsymbol{s}')\), we train on the transition \((\boldsymbol{s}, \boldsymbol{w}, \boldsymbol{r}, \boldsymbol{s}')\).
There may, however, exist \(\boldsymbol{w}'\neq \boldsymbol{w}\) such that, when denoised, \(\boldsymbol{w}'\) produces the same denoised action \(\boldsymbol{a}\) as \(\boldsymbol{w}\):
Thus, in principle, we can infer that \(\boldsymbol{w}'\) has the same behavior in our environment as \(\boldsymbol{w}\) without actually playing it. Naively applying an RL algorithm to transitions
\((\boldsymbol{s}, \boldsymbol{w}, \boldsymbol{r}, \boldsymbol{s}')\) ignores this potential aliasing of noise actions, however.
Furthermore, it is unclear how to make use of offline data, where we have transitions \((\boldsymbol{s}, \boldsymbol{a}, \boldsymbol{r}, \boldsymbol{s}')\) of the form—since we do not know which \(\boldsymbol{w}\) produced \(\boldsymbol{a}\), we cannot directly use this data to learn a noise-space policy.
To exploit the noise-aliasing nature of the diffusion policy and enable the use of offline data, we propose the following algorithm:
Here we train one \(Q\)-function on the original action space using only transitions of the form \((\boldsymbol{s}, \boldsymbol{a}, \boldsymbol{r}, \boldsymbol{s}')\), and then train a second \(Q\)-function on the noise space via distillation: using the diffusion policy \(\pi_{\mathrm{dp}}\) to generate \((\boldsymbol{a},\boldsymbol{w})\) pairs and training the noise space \(Q\)-function at \(\boldsymbol{w}\) to match the value of the \(Q\)-function on the original action space at \(\boldsymbol{a}\). This enables more sample-efficient, fully off-policy learning, and naturally incorporates offline data.
Online performance of DSRL compared to state-of-the-art methods for RL with diffusion policies on OpenAI Gym and Robomimic benchmarks.
Offline performance of DSRL compared to state-of-the-art methods for offline RL, aggregated across 10 tasks from OGBench benchmark.
Offline-to-online performance of DSRL compared to state-of-the-art offline-to-online RL methods on Robomimic benchmark.
\(\pi_0\) zero-shot
\(\pi_0\) after DSRL adaptation
Online performance of DSRL steering \(\pi_0\) on the Libero benchmark and a simulated bimanual Aloha setup, compared to several existing approaches for online adaptation of generalist policies.
@article{wagenmaker2025steering,
author = {Wagenmaker, Andrew and Nakamoto, Mitsuhiko and Zhang, Yunchu and Park, Seohong and Yagoub, Waleed and Nagabandi, Anusha and Gupta, Abhishek and Levine, Sergey},
title = {Steering Your Diffusion Policy with Latent Space Reinforcement Learning},
journal = {arXiv},
year = {2025},
}