FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation

Abstract

Real-world reinforcement learning for robotic manipulation remains challenging, and this difficulty is amplified for flow matching policies: applying policy gradient methods to these policies is fundamentally limited by the need to backpropagate through time (BPTT) along the multi-step ODE that maps noise to actions, which is computationally prohibitive and numerically fragile. We propose FlowDPG, a DDPG-style method specifically designed for flow matching policies that distills the critic gradient into the velocity field at training time, bypassing BPTT entirely. Intuitively, FlowDPG combines two complementary vectors: the demonstration-driven velocity that keeps the action feasible, and the critic-driven correction that steers it toward higher value. Our contributions are threefold: (1) a BPTT-free distillation framework that enables stable DDPG-style policy improvement on flow matching policies, (2) a formal connection between the FlowDPG update direction and vanilla Deterministic Policy Gradient via three explicit approximations, and (3) real-world validation on a long-horizon, multi-stage, dual-arm AirPods assembly task, where FlowDPG attains a 92% end-to-end success rate, substantially outperforming recent RL methods spanning value-conditioning, auxiliary-module adaptation, and adjoint-based critic-gradient approaches.

Method Overview

Classical flow matching transports noise $\epsilon$ to a demonstration-feasible action $a$ via $v_\theta$, and lacks the ability to explore new actions outside the demonstration distribution. FlowDPG adds a critic-driven correction along $\nabla_a Q(s, a)$ to reach a value-improved target $a^*$, then distills the resulting velocity $v'_\theta$ back into the flow field; in this process, new actions with high value may be explored.

Assembling 25 AirPod Cases in 1 Hour Uncut

The policy trained by FlowDPG autonomously assembles 25 AirPods cases (50 AirPods) in a row without human intervention. It recovers from dropped grasps by re-grasping, corrects misalignments with fine adjustments, and removes-and-retries when an AirPod is inserted incorrectly.

Disturbance Robustness

The policy trained by FlowDPG is robust to a wide range of disturbances, including a human removing an already grasped case or AirPod, closing an already opened AirPod case, reshuffling the AirPod layout in the tray, and dislodging an already inserted AirPod.

Recovery from Unseen Failures

The left video shows that a policy trained only via behavior cloning on the original dataset, or via value-conditioned reweighting of different data points without exploring actions outside the dataset, cannot recover from an unseen failure. By contrast, the right video shows that by exploring higher-value actions during training, FlowDPG produces emergent re-grasp behaviors that enable recovery from such failures.

Robust Insertion Beyond Demonstrations

The left video shows the policy repeatedly alternating between thinking and action, adjusting the insertion angle multiple times before succeeding. The right video shows that even after accidentally grasping two AirPods at once, the policy can still insert one of them—behavior that is entirely out of distribution. These examples demonstrate that the trained policy can handle such corner cases.