Steering Your Diffusion Policy with Latent Space Reinforcement Learning

Andrew Wagenmaker¹*, Mitsuhiko Nakamoto¹*, Yunchu Zhang²*, Seohong Park¹, Waleed Yagoub², Anusha Nagabandi³, Abhishek Gupta²*, Sergey Levine¹*

¹UC Berkeley, ²University of Washington, ³Amazon
*Core contributor

Paper arXiv Code

We propose DSRL: steering diffusion policies to desired behaviors by running reinforcement learning over their latent-noise space. DSRL is highly sample-efficient and enables real-world improvement of diffusion policies for robotic control, and state-of-the-art performance on simulated benchmarks.

Abstract

Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior—an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies—a state-of-the-art BC methodology—we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.

Approach: Diffusion Steering via Reinforcement Learning

Standard deployment of a BC-trained diffusion policy \(\pi_{\mathrm{dp}}\) first samples noise \(\boldsymbol{w} \sim \mathcal{N}(0,I)\) that is then denoised through the reverse diffusion process to produce an action \(\boldsymbol{a}\). We propose modifying the initial distribution of \(\boldsymbol{w}\) with an RL-trained latent-noise space policy \(\pi^{\mathcal{W}}\) that, instead of choosing \(\boldsymbol{w} \sim \mathcal{N}(0,I)\), chooses \(\boldsymbol{w}\) to steer the distribution of actions produced by \(\pi_{\mathrm{dp}}\) in a desirable way:

We refer to our approach as DSRL: Diffusion Steering via Reinforcement Learning. Notably, DSRL:

Completely avoids challenges typically associated with finetuning diffusion policies—such as back-propagation-through-time—by treating the noise as the action, and interpreting the base diffusion policy as part of the environment.
Only requires training a small MLP policy to select the noise, rather than finetuning the weights of a (potentially much larger) diffusion policy.
Does not require any access to the weights of the base diffusion policy, treating it completely as a black-box.

DSRL enables highly sample-efficient adaptation of real-world diffusion policies for robot control, and achieves state-of-the-art performance on simulated benchmarks.

DSRL Enables Real-World Online Improvement of Single-Task Diffusion Policies

Performance of single-task diffusion policy trained on 10 demonstrations before and after DSRL adaptation.

Diffusion policy with standard noise sampling

Uncut DSRL training timelapse

Diffusion policy with DSRL-learned noise policy

DSRL Enables Real-World Online Improvement of Multi-Task Diffusion Policies

Performance of multi-task diffusion policy trained on BridgeData V2 dataset before and after DSRL adaptation.

Diffusion policy with standard noise sampling

Uncut DSRL training timelapse

Diffusion policy with DSRL-learned noise policy

Diffusion policy with standard noise sampling

Diffusion policy with DSRL-learned noise policy

Diffusion policy with standard noise samplingn

Diffusion policy with DSRL-learned noise policy

DSRL Enables Real-World Online Improvement of Pretrained Generalist Policies

Performance of pretrained generalist policy \(\pi_0\) before and after DSRL adaptation, utilizing the publically available DROID weights for \(\pi_0\).

\(\pi_0\) zero-shot

Uncut DSRL training timelapse

\(\pi_0\) with DSRL-learned noise policy

\(\pi_0\) zero-shot

Uncut DSRL training timelapse

\(\pi_0\) with DSRL-learned noise policy

Exploration Behavior

DSRL

RLPD

We compare the exploration behavior of DSRL to that of a traditional RL algorithm, in this case RLPD. Each clip shows the first 9 episodes of online RL for each approach. We see that the behaviors induced by RLPD are effectively random, and provide no meaningful learning signal. In contrast, DSRL behaviors that are ''reasonable'' from the start plays, attempting to pick up the mushroom on nearly every episode, providing much more useful data for learning.

Improving Sample Efficiency with Noise Aliasing

While in principle DSRL can be instantiated with virtually any RL algorithm, we introduce an approach which makes particular use of the diffusion policy's structure to increase sample efficiency and incorporate offline data.

DSRL treats the noise \(\boldsymbol{w}\) as an action, so that if we select noise \(\boldsymbol{w}\) and denoise it through the diffusion policy \(\pi_{\mathrm{dp}}\) to produce an action \(\boldsymbol{a}\), then instead of transition \((\boldsymbol{s}, \boldsymbol{a}, \boldsymbol{r}, \boldsymbol{s}')\), we train on the transition \((\boldsymbol{s}, \boldsymbol{w}, \boldsymbol{r}, \boldsymbol{s}')\). There may, however, exist \(\boldsymbol{w}'\neq \boldsymbol{w}\) such that, when denoised, \(\boldsymbol{w}'\) produces the same denoised action \(\boldsymbol{a}\) as \(\boldsymbol{w}\):

Thus, in principle, we can infer that \(\boldsymbol{w}'\) has the same behavior in our environment as \(\boldsymbol{w}\) without actually playing it. Naively applying an RL algorithm to transitions \((\boldsymbol{s}, \boldsymbol{w}, \boldsymbol{r}, \boldsymbol{s}')\) ignores this potential aliasing of noise actions, however. Furthermore, it is unclear how to make use of offline data, where we have transitions \((\boldsymbol{s}, \boldsymbol{a}, \boldsymbol{r}, \boldsymbol{s}')\) of the form—since we do not know which \(\boldsymbol{w}\) produced \(\boldsymbol{a}\), we cannot directly use this data to learn a noise-space policy.

To exploit the noise-aliasing nature of the diffusion policy and enable the use of offline data, we propose the following algorithm:

Here we train one \(Q\)-function on the original action space using only transitions of the form \((\boldsymbol{s}, \boldsymbol{a}, \boldsymbol{r}, \boldsymbol{s}')\), and then train a second \(Q\)-function on the noise space via distillation: using the diffusion policy \(\pi_{\mathrm{dp}}\) to generate \((\boldsymbol{a},\boldsymbol{w})\) pairs and training the noise space \(Q\)-function at \(\boldsymbol{w}\) to match the value of the \(Q\)-function on the original action space at \(\boldsymbol{a}\). This enables more sample-efficient, fully off-policy learning, and naturally incorporates offline data.

Simulated Results

Online DSRL

Online performance of DSRL compared to state-of-the-art methods for RL with diffusion policies on OpenAI Gym and Robomimic benchmarks.

Offline DSRL

Offline performance of DSRL compared to state-of-the-art methods for offline RL, aggregated across 10 tasks from OGBench benchmark.

Offline-to-Online DSRL

Offline-to-online performance of DSRL compared to state-of-the-art offline-to-online RL methods on Robomimic benchmark.

Generalist Policy Adaptation via DSRL

\(\pi_0\) zero-shot

\(\pi_0\) after DSRL adaptation

Online performance of DSRL steering \(\pi_0\) on the Libero benchmark and a simulated bimanual Aloha setup, compared to several existing approaches for online adaptation of generalist policies.

BibTeX

@article{wagenmaker2025steering,
  author    = {Wagenmaker, Andrew and Nakamoto, Mitsuhiko and Zhang, Yunchu and Park, Seohong and Yagoub, Waleed and Nagabandi, Anusha and Gupta, Abhishek and Levine, Sergey},
  title     = {Steering Your Diffusion Policy with Latent Space Reinforcement Learning},
  journal   = {arXiv},
  year      = {2025},
}