Introduction
Imagine you are reaching for a coffee mug on a cluttered desk. You don’t consciously hallucinate a photorealistic video of your hand moving, frame by frame, texture by texture, before you act. Instead, your brain operates on an intuitive, implicit level. It predicts the consequences of your movement—the spatial feeling of the grasp, the weight of the cup, the avoidance of the stapler—without needing to render every pixel of the scene in your mind’s eye.
This predictive capability is known as world modeling. In robotics, giving machines this ability is the “Holy Grail” for achieving dexterous, intelligent behavior.
Traditionally, researchers have tried to teach robots to model the world by having them predict future video frames. The logic is sound: if a robot can generate a video of itself successfully picking up a cup, it surely “understands” the physics and dynamics involved. However, this approach has a heavy cost. Generating high-fidelity video is computationally expensive and slow. Worse, it creates a conflict of interest during training: the model spends so much capacity trying to render the perfect texture of a wooden table (which is irrelevant to the task) that it loses focus on the actual robot arm dynamics.
Enter FLARE (Future LAtent REpresentation Alignment).
In a recent paper from NVIDIA and partners, researchers propose a new framework that gives robots the benefits of world modeling without the heavy burden of video generation. Instead of predicting pixels, FLARE predicts latent representations—compact, mathematical summaries of the future state.
In this post, we will deconstruct FLARE. We will explore how it modifies standard diffusion policies, why “action-aware” embeddings are the secret sauce, and how this method allows robots to learn from human videos without any action labels at all.

The Core Problem: Policy Learning vs. World Modeling
To understand why FLARE is necessary, we first need to look at the current state of Robot Learning.
The dominant paradigm in modern robotics is Visuomotor Policy Learning. You feed a robot current observations (images from cameras) and an instruction (e.g., “pick up the apple”), and the robot outputs an action (motor commands). Recently, Diffusion Policies and Flow-Matching models have taken over the field. These models learn to generate complex action sequences by refining random noise into precise trajectories.
While effective, these “Policy Only” methods (shown in the top half of Figure 1 above) are reactive. They map current sight to current action. They lack a mechanism to explicitly reason about the long-term consequences of their movements.
The alternative is World Model-based learning, where the robot predicts future observations \(O_{t+H}\). As mentioned, doing this in pixel space is inefficient. FLARE bridges this gap by aligning the policy’s internal state with the latent embedding of the future, rather than the raw pixels.
Background: Flow Matching for Control
Before diving into the architecture, let’s briefly establish the mathematical foundation: Flow Matching. This is the generative engine powering the robot’s actions.
In this setup, we have an action chunk \(A_t\) (a sequence of future actions) drawn from expert demonstrations. We add noise to this action based on a timestep \(\tau\) (ranging from 0 to 1). When \(\tau=0\), it’s pure noise; when \(\tau=1\), it’s the clean action.
The noisy action is defined as:

Here, \(\epsilon\) is sampled noise. The goal of the neural network (the policy) is to predict the “velocity” or the direction to move from the noise toward the clean action. The network, denoted as \(V_\theta\), takes the current observation embedding \(\phi_t\), the noisy action \(A_t^\tau\), and the robot’s proprioceptive state \(q_t\) as input.
The training objective (loss function) minimizes the difference between the network’s prediction and the actual direction needed to reconstruct the action:

At inference time (when the robot is actually running), the model starts with random noise and iteratively refines it using the predicted velocity to generate a smooth action trajectory:

The authors use a Diffusion Transformer (DiT) for this process. Now, let’s see how FLARE modifies this standard architecture.
The FLARE Method
FLARE introduces a “lightweight” modification to the standard Vision-Language-Action (VLA) architecture. The goal is to force the policy to “think” about the future while it is generating actions.
1. Architecture: Adding Future Tokens
The standard DiT policy processes a sequence of tokens. Usually, these tokens represent the current robot state and the noisy actions. FLARE adds a set of Learnable Future Tokens to this sequence.
Take a look at the architecture below:

Here is the flow of information:
- Inputs: The model takes the current observation (vision + text), the current joint states (\(q_t\)), and the noisy actions (\(A_t^\tau\)).
- Future Tokens: A set of \(M\) learnable tokens is appended to the input sequence.
- Processing: The DiT blocks process all tokens via self-attention. This means the “Future Tokens” can attend to the “Action Tokens” and vice versa. They share information.
- Extraction: At a specific intermediate layer \(L\), the activations corresponding to the Future Tokens are pulled out.
- Alignment: These extracted features are projected and compared against the actual embedding of the future observation (\(O_{t+H}\)).
By forcing these extra tokens to predict the future embedding, the gradients backpropagate through the whole network. This forces the entire model to learn internal representations that capture the dynamics of the environment.
2. The Alignment Loss
The core innovation is the Future Latent Alignment Loss. Instead of reconstructing pixels, the model minimizes the cosine distance between its predicted future latent state and the ground-truth future latent state.
Let \(f_\theta\) be the policy’s prediction of the future tokens, and \(g\) be a frozen encoder that produces the ground truth embedding of the future image (\(\phi_{t+H}\)). The loss is:

This implies that if the robot is about to pick up an apple, its internal “future tokens” should look mathematically similar to the embedding of an image where the apple has been picked up.
The final training objective combines the standard action generation (flow matching) with this new world modeling capability:

The authors found that a weighting coefficient of \(\lambda = 0.2\) works best. This ensures the world modeling objective supports the action learning without overpowering it.
3. The “Action-Aware” Embedding
A critical question remains: What model should we use to encode the future (\(g\))?
You might think to use a standard pre-trained model like CLIP or SigLIP. While FLARE supports this, the authors found that generic vision models capture too much irrelevant detail (like the color of the wall or lighting shadows).
For optimal performance, the robot needs to focus on task-relevant features: the relationship between the gripper and the object, or the geometry of the tool.
To achieve this, the authors pretrained a custom Action-Aware Vision-Language Embedding. They utilized a massive dataset of cross-embodiment robot data (over 2,000 hours):

They used a Q-former architecture (similar to BLIP-2). This module compresses visual features into a small set of tokens (32 tokens). Crucially, this embedding model was trained to predict actions. This forces the embeddings to discard background noise and retain only the information necessary for control.

During the downstream training of FLARE, this embedding model acts as the “teacher” (the \(g\) function), providing the targets for the future latent alignment.
Experimental Results
The researchers evaluated FLARE on two rigorous benchmarks: RoboCasa (simulation) and real-world tasks with a GR1 Humanoid robot.
1. Simulation Benchmarks
RoboCasa involves complex kitchen tasks like opening doors, turning on stoves, and arranging objects. The GR1 simulation tasks focus on dexterous manipulation.

The results were decisive. FLARE significantly outperformed standard Diffusion Policies and even other world-model baselines like UWM (Unified World Models).

Key takeaways from Table 1:
- FLARE vs. Policy Only: On the hard “Pick and Place” tasks, FLARE jumps to 53.2% success compared to 43.8% for the policy-only baseline.
- FLARE vs. Diffusion Policy: The gap is even wider (53.2% vs 29.2%).
- Efficiency: FLARE achieves these gains without the massive computational overhead of generating pixels.
2. Real-World Humanoid Control
Simulation is useful, but the real test is physical hardware. The team tested FLARE on a GR1 humanoid robot performing table-top manipulation.

The results in the real world mirrored the simulation. FLARE showed a distinct advantage, particularly in data-limited regimes.

In the chart above (Right), notice the Average performance. FLARE achieves a 95.3% success rate across real-world tasks, whereas the standard policy lags at 81.2%.
Qualitative Difference: The authors noted that the baseline policy often acted “greedily,” trying to move directly to a target and knocking over obstacles (like a water bottle) in the process. The FLARE policy, because it predicts future states, implicitly “foresaw” the collision. It learned to maneuver around the bottle to reach the target, exhibiting safer and more intelligent behavior.

3. The “Superpower”: Learning from Action-Free Videos
Perhaps the most exciting capability of FLARE is its ability to learn from Human Egocentric Videos.
Collecting robot data is hard because you need to teleoperate the robot and record the precise actions (motor angles). Collecting video of a human doing a task is easy—just strap a GoPro to someone’s head. However, human videos don’t come with robot joint labels (\(A_t\)), so standard policy learning (Eq. 1) can’t use them.
FLARE changes the game. Because it has a World Modeling Loss (Eq. 3), it can train on human videos by strictly optimizing the alignment objective. The model learns: “If I am in state A, the future state should look like B.” It learns the dynamics of the task from humans, even without action labels.
The researchers tested this on Novel Objects that the robot had never seen before.

The results (shown in the bar chart above) are striking:
- Green Bar: FLARE trained with Human Videos + only 10 robot demos.
- Gray Bar: FLARE trained on only robot demos.
With just 10 robot demonstrations, adding human videos doubled the success rate (from ~40% to 80%). This suggests that FLARE can effectively transfer the “concept” of a task from human videos to robot control.

Why Does It Work? (Ablation Studies)
The paper includes detailed ablations to verify their design choices.
Does the specific embedding model matter? Yes. While using a standard SigLIP model works better than nothing, the custom “Action-Aware” embedding provides the best performance (55.0% vs 49.6%).

Where should we attach the loss? The DiT has multiple layers. If you force the alignment too early (Layer 4), performance drops. The model needs sufficient depth to process the action tokens before it can accurately predict the future. Layer 6 (out of 8) was found to be the sweet spot.

Handling Distribution Shift (EMA): Because the policy is learning, its internal representations shift. To keep the target embeddings stable yet adaptive, the authors used an Exponential Moving Average (EMA) to update the target encoder. As shown below, a coefficient of \(\rho=0.995\) yielded the highest success rate.


Conclusion
FLARE represents a significant step forward in robotic intelligence. It successfully integrates the intuition of World Models with the precision of Flow-Matching Policies.
By predicting latent futures rather than pixels, FLARE remains lightweight and scalable. It requires minimal architectural changes—just a few extra tokens—yet delivers state-of-the-art performance. Most importantly, it unlocks the vast reservoir of human video data for robot training, allowing robots to learn the dynamics of tasks from us, even when they don’t know the precise motor commands we used.
As robots move from controlled labs into the messy real world, this ability to implicitly predict “what happens next” will be the key to safe, reliable, and generalized autonomy.
](https://deep-paper.org/en/paper/2505.15659/images/cover.png)