In the rapidly evolving world of robotic imitation learning, we are constantly trying to bridge the gap between how a robot “thinks” (computation) and how it “acts” (execution).
For the past few years, the gold standard in this field has been dominated by Diffusion Policies. These models are incredibly powerful; they can learn complex, multi-modal behaviors—like knowing that to get around an obstacle, you can go left or right, but not straight through the middle. However, they come with a significant cost: latency.
Imagine driving a car, but every few seconds you have to close your eyes, calculate your next ten seconds of steering, and only then open your eyes to execute the first second of that plan. This is essentially how conventional diffusion and flow-matching policies work. They generate a whole “chunk” of future actions from scratch (noise) before the robot moves a muscle. This “stop-and-think” approach introduces lag, making tight, reactive control difficult.
In this post, we are diving deep into a new paper titled “Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories.” This research proposes a brilliant simplification: instead of generating a whole trajectory in a latent “diffusion time,” why not generate actions in real-time and stream them to the robot immediately?
Let’s unpack how this works, why it matters, and the mathematics that make it possible.
The Problem: The “Trajectory of Trajectories”
To understand the innovation here, we first need to look at the current state of the art.
Conventional Diffusion Policies and Flow-Matching Policies treat robot control as a generative modeling problem. They take a history of observations (what the robot sees) and try to generate a sequence of future actions (an action chunk).
The standard process looks like this:
- Start with a sequence of pure Gaussian noise.
- Iteratively “denoise” this sequence over many steps (\(N\) steps).
- Each step refines the entire trajectory.
- Only after the \(N\)-th step is finished do you have a usable action plan.
The authors describe this as sampling a “trajectory of trajectories.” The model generates a diffusion trajectory where every point on that path is itself a full action trajectory.

As shown in Figure 1(a) above, the robot sits idle while the computation moves up the vertical axis (Diffusion/Flow time). It creates a bottleneck. If your control loop needs to run at 100Hz, but your diffusion model takes 50ms to infer, you have a problem.
The Solution: Streaming Flow Policy (SFP)
The researchers propose a paradigm shift. Instead of treating the generation process (flow time) and the robot’s movement (execution time) as separate dimensions, Streaming Flow Policy (SFP) aligns them.
As seen in Figure 1(b), the algorithm starts from the last known action (or the current state) and incrementally integrates a velocity field to find the next action. This means as soon as the first step of computation is done, the first action is ready to be sent to the robot.
There are three major conceptual changes here:
- Initialization: Instead of starting from pure noise, start from a narrow Gaussian around the last executed action.
- Space: Instead of learning a flow over the space of trajectories (\(\mathcal{A}^T\)), learn a flow over the action space (\(\mathcal{A}\)).
- Streaming: Execute actions on-the-fly during the sampling process.
This approach transforms the complex “denoising a whole path” problem into a simpler “where do I go next?” problem, solvable with a neural ordinary differential equation (ODE).
Deep Dive: The Core Method
This is the heart of the paper. How do we actually train a network to stream actions? The method relies on Flow Matching, a technique for learning continuous dynamics.
1. Constructing the Target Flow
To train a neural network to guide the robot, we first need to define what the “ideal” flow looks like. In standard flow matching, you define a probability path that turns noise into data. Here, the authors construct a Conditional Flow.
Given a demonstration trajectory \(\xi\) (the ground truth path the robot should take), the authors mathematically construct a vector field that creates a “tube” around this path.

Look at Figure 2(b) above. The white arrows represent the constructed target velocity field. If you are slightly off-track (off the red or blue line), the velocity field pushes you back toward the demonstration.
The math behind this constructed flow is elegant. It combines two forces:
- Trajectory Velocity: The speed and direction of the demonstration itself (\(\dot{\xi}(t)\)).
- Stabilization Term: A correction force that pulls the state back to the demonstration if it drifts away.
The equation for this target velocity \(v_{\xi}\) is:

Here, \(k\) is a stabilizing gain. This is similar to a PD controller. It ensures that the flow doesn’t just “go with the flow” but actively tries to stay on the path. This stabilization is crucial because it reduces distribution shift—a common killer in imitation learning where small errors compound over time until the robot fails.
2. The Integration Process
Once we have this target, we train a neural network \(v_{\theta}(a, t \mid h)\) to predict this velocity given the current action \(a\), the time \(t\), and the observation history \(h\).
At inference time (when the robot is actually running), we don’t have the ground truth trajectory. We rely on the network. We start at the robot’s current position (or last command) and “integrate” the network forward.
The mathematical formulation for generating the trajectory is:

In practice, this integration is done using a numerical solver (like Euler’s method). The beauty of SFP is that you can execute the result of the integral at \(t=0.1\) immediately, without waiting to solve for \(t=1.0\).
3. Handling Multi-Modality
One of the biggest strengths of Diffusion Policy is its ability to model multi-modal distributions (e.g., passing a cup to the left or right hand). Standard regression policies fail here—they average the two modes and try to pass the cup to the “middle” (which might be empty space or an obstacle).
You might wonder: If SFP starts from a single point (the current state) and follows a deterministic vector field, how can it be multi-modal?
The magic lies in Flow Matching. The loss function used to train the network minimizes the difference between the network’s output and the constructed conditional flows averaged over the dataset.

Because the target flow is constructed from all demonstrations, the learned network \(v_{\theta}\) learns a marginal velocity field. As shown in Figure 2(c), the learned field (the grey/black arrows) creates a bifurcation.
If the robot is slightly to the left, the flow catches the “left mode” current. If it’s slightly to the right, it catches the “right mode.” Even though the policy can be run deterministically at test time (by setting initial noise to zero), the training process ensures the vector field preserves the distinct modes present in the data.
4. The Limitations of Marginal Matching
It is important to be academically honest about what this method cannot do. While SFP matches the marginal distribution of actions at each timestep, it does not necessarily guarantee the correct joint distribution across time.
The authors provide a fascinating counter-intuitive example to illustrate this.

In Figure 4, the training data contains “S” shapes and “2” shapes (Panel a). These trajectories cross each other in the middle. Because SFP learns a velocity field dependent on the current position \((a, t)\), when the trajectories intersect at the same location and time, the vector field has to average them or split them.
Panel (d) shows the result: the model creates “E” shapes and “3” shapes. It enters the intersection on one trajectory and exits on the other.
- The Good News: At any specific timestamp, the robot is in a valid state found in the training data (Marginal distribution matches).
- The Bad News: The overall shape (Joint distribution) is a mix-and-match of training samples.
However, the authors argue that for robotics, this “compositionality” is often a feature, not a bug. If both paths are valid, switching between them might be acceptable.
Experimental Results
So, does it work on actual robots?
The authors compared Streaming Flow Policy against standard Diffusion Policy (DP), a faster version (DDIM), and other baselines on benchmarks like Push-T (a 2D pushing task) and RoboMimic (simulated robot arm manipulation).
Latency and Success
The primary hypothesis was that SFP would be faster and more reactive. The data supports this strongly.
Looking at the Push-T results (Table 2 in the paper), SFP achieves 95.1% average success with state inputs, outperforming standard Diffusion Policy (92.9%) while running significantly faster.
Real-world experiments were conducted on a Franka Research 3 robot arm.

The authors note that the motion produced by SFP is noticeably smoother. Because diffusion policies output discrete “chunks” that are often executed open-loop, you can sometimes get jerky transitions between chunks. SFP generates a continuous stream, leading to fluid motion.
The Sweet Spot of “Chunking”
Even though SFP can stream continuously, in practice, it helps to predict a short horizon ahead. This is known as the “Action Chunking” horizon (\(T_{chunk}\)).
If \(T_{chunk}\) is too small, the robot is reacting purely to the now, which can be unstable if there are sensor delays or noise. If \(T_{chunk}\) is too large, the robot is essentially moving with its eyes closed for too long.

Figure 8 shows the performance relative to chunk size. There is a clear “sweet spot” around a chunk size of 8 to 16. SFP allows for this flexibility; you can integrate 8 steps, send them to the robot, and then recalculate, achieving a balance between long-term planning and high-frequency control.
Conclusion and Implications
Streaming Flow Policy represents a maturing of generative control in robotics. While Diffusion Policy proved that generative models could handle multi-modal behavior, SFP refines how we apply them.
By aligning the mathematical “flow time” with the physical “execution time,” the researchers have created a method that is:
- Fast: No waiting for a full denoising loop.
- Stable: Using control theory concepts (stabilization gain) to keep the neural network grounded.
- Simple: It re-uses existing architectures, just changing inputs/outputs and the loss function.
For students and researchers entering the field, this paper highlights an important lesson: simply throwing a powerful model (like Diffusion) at a problem isn’t the end of the road. By understanding the underlying dynamics—velocity fields, integration, and stability—we can re-engineer these models to fit the specific constraints of physical robots, resulting in behavior that is not just smart, but smooth and responsive.
The future of robot control isn’t just about generating better plans; it’s about generating them fast enough to catch a falling cup.
](https://deep-paper.org/en/paper/2505.21851/images/cover.png)