Imagine you are trying to move a large sofa through a narrow doorway. Before you even lift a finger, you likely simulate the process in your head. You visualize tilting it at a 45-degree angle, realize that the legs will get stuck, and then revise your plan to take the legs off first. This process—mental simulation followed by critique and revision—is second nature to humans. In cognitive science, this is often linked to “System 2” thinking: slow, deliberative, and logical.

Robots, however, typically struggle with this. While recent Vision-Language Models (VLMs) like GPT-4V or Gemini have given robots an impressive ability to “see” and “describe” the world, they often act like “System 1” thinkers: fast, intuitive, but prone to impulsive errors. They might recognize a block that needs to be moved, but fail to realize that moving it now will cause a stack of other blocks to collapse five steps later.

In the research paper “Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation,” researchers introduce a novel framework that grants VLMs this missing capability: the ability to imagine the future and reflect on it.

This post explores how this method, dubbed ReflectVLM, combines the semantic knowledge of large models with a “mind’s eye” (a diffusion model) to solve complex, long-horizon puzzles that stump even the most advanced commercial models.

The Problem: Why VLMs Struggle with Physics

Vision-Language Models are trained on internet-scale data. They know that “apples” are “red” and “spoons” go in “bowls.” However, they lack a fundamental understanding of intricate physics and temporal causality.

When a robot faces a multi-stage manipulation task—like assembling a complex interlocking puzzle—it cannot simply look at the current state and guess the next move. It needs to look ahead.

  1. Long-Horizon Dependencies: A mistake made at step 1 might not cause a failure until step 10.
  2. Physical Constraints: Blocks might be interlocking. You can’t insert Block B until Block A is removed, but the VLM might just see “Block B fits in the hole” and try to jam it in.
  3. Error Compounding: In a 50-step task, a 95% success rate per step results in a total success rate of less than 8%. One small slip, and the whole plan fails.

Traditional robotics solves this with “Task and Motion Planning” (TAMP), which uses rigid, symbolic logic. But TAMP requires perfect knowledge of every object’s geometry and mass, which is rarely available in the real world. We want the flexibility of a VLM with the planning rigor of TAMP.

The Solution: Reflective Planning

The researchers propose a test-time framework that enables a VLM to “think” before it acts. The core intuition is simple: Don’t just predict the next action. Predict the result of that action, look at it, and ask, “Is this what I wanted?”

As illustrated in Figure 1 below, the system consists of a VLM (the planner) and a Diffusion Dynamics Model (the simulator).

Figure 1. Reflective planning. Our method uses a VLM to propose actions and a diffusion dynamics model to imagine the future state of executing the plan. The imagined future helps the VLM reflect the initial plan and propose better action.

The process works in a loop:

  1. Propose: The VLM suggests an action (e.g., “Pick up purple block”).
  2. Imagine: A generative model creates an image of what the world will look like after that action.
  3. Reflect: The VLM looks at this “imagined future.” If the result looks bad (e.g., the block falls over), it revises its plan (e.g., “Actually, pick up yellow block”).

Let’s dive into the technical architecture that makes this possible.

Core Method: Inside the Architecture

The framework is built on two pillars: Interactive Post-Training (teaching the VLM how to reflect) and the Diffusion Dynamics Model (giving the VLM a way to see the future).

1. Teaching the VLM to Reflect

A standard VLM is trained to describe images or answer questions. It is not naturally trained to critique its own robotic plans. To fix this, the authors use a post-training strategy inspired by imitation learning.

They assume access to an “expert policy” (an oracle) during training time. This expert knows the optimal moves. The training data is generated by rolling out the VLM in the environment and correcting it.

Crucially, they generate two types of training examples from these interactions, as shown in Figure 2:

Figure 2. Training data generation. Training data for the reflection mechanism is collected by relabeling the rollouts. For each timestep, two training examples are generated: (Q1, A1) for action proposal and (Q2, A2) for reflection. H is the imagination horizon, and h is the history length. a_t^* is the action label given by the expert policy.

  • Task 1: Action Proposal (Q1, A1): “Here is the current state \(I_t\) and the goal \(I_g\). What should I do?” The label is the expert action \(a^*_t\).
  • Task 2: Reflection (Q2, A2): “Here is the current state, the goal, and a photo of what the future looks like (\(I_{t+H}\)) if you follow your current plan. What should I do?” The label is still the expert action \(a^*_t\).

This second task is the magic sauce. It forces the VLM to learn a correlation between visual outcomes and correct decisions. If the future image \(I_{t+H}\) shows a mess, the VLM learns to output an action that avoids that mess.

The model is optimized using a combined Cross-Entropy loss function:

Equation for the loss function combining action proposal and reflection.

This equation ensures the model gets better at both proposing actions and refining them based on visual feedback simultaneously.

2. The Imagination Engine: Diffusion Dynamics Model

For the VLM to reflect, it needs an image to reflect on. In the real world, you can’t “undo” a robot arm crashing through a table. You need a simulator.

The authors train a Diffusion Dynamics Model (DDM). This acts as a learned world model. It takes the current image (\(I_t\)) and a text description of an action (\(a_t\)) and generates the next image (\(I_{t+1}\)).

Figure 3. Architecture of Diffusion Dynamics Model, which consists of a latent encoder, text encoder, Diffusion UNet and latent decoder. The latent encoder and text encoder are frozen during training, while Diffusion UNet and latent decoder are finetuned on our task data. N: random noise.

As shown in Figure 3, the architecture leverages a pre-trained “InstructPix2Pix” model.

  • Encoders: The image and the text action are encoded into latent representations (\(z_t\) and \(z_{a_t}\)).
  • Diffusion U-Net: This network takes the noise and conditions it on the current state and action to denoise it into the future state.
  • Decoder: Transforms the latent representation back into a pixel-perfect image of the future.

This approach is far more flexible than coding a physics engine from scratch because it learns directly from visual data.

3. Inference: The “Reflective Planning” Algorithm

At test time (when the robot is actually working), the system puts these pieces together.

  1. Initial Proposal: The VLM sees the current board and the goal. It suggests a sequence of actions.
  2. Simulation Loop: The Diffusion model takes these actions and recursively generates frames: \(I_t \rightarrow I_{t+1} \rightarrow \dots \rightarrow I_{t+H}\).
  3. Reflection: The VLM is fed the original goal, the current state, and the final imagined state (\(I_{t+H}\)).
  4. Decision: The VLM decides whether to stick with the plan or output a corrective action.

Figure 4 below provides a beautiful “filmstrip” example of this in action. Look at step 15.

Figure 4. Filmstrip of our method solving a complicated assembly task. Frames are indexed by timestep. The goal image is in the top-left corner (with a green border). Each frame is the observation after executing the action (in black) above it. The other action in gray is the original action proposed by the VLM if it is revised after reflection. We highlight the reflection process at timestep 15, where the VLM first proposes an action to pick up the purple brick, but after reflection, it chooses to pick up the yellow brick instead as the generated future state (red-bordered image) shows little progress towards the goal.

At step 15, the VLM initially wants to “Pick up purple.” However, the system simulates this and realizes (via the diffusion model) that this leads to a bad state. The reflection mechanism kicks in, critiques the plan, and switches the action to “Pick up yellow.” This ability to self-correct is what prevents the error compounding that plagues other methods.

Experiments & Results

To test this, the researchers created a procedural generation engine for interlocking puzzle tasks. These aren’t just stacking blocks; they are assembly tasks where pieces must be inserted in a specific order (e.g., you can’t place the roof before the walls).

Figure 5. Task examples. (a) Generated multi-stage manipulation tasks with interlocking pieces. Top: initial configurations. Bottom: goal configurations. (b) The graph shows the dependencies between the objects in the blue assembly board on the left. Each node represents an object, and each directed edge indicates the predecessor object should be assembled before the successor object.

As seen in Figure 5, the dependencies can be modeled as a graph, though the robot never sees this graph—it only sees the RGB images.

How did it perform?

The results were stark. The team compared ReflectVLM against state-of-the-art commercial models (GPT-4o, GPT-o1, Gemini 2.0) and traditional planning methods like Monte Carlo Tree Search (MCTS).

Figure 6. Performance of our method and baselines. Success rate (%) on 100 tasks. For the zero-shot test of state-of-the-art VLMs and MCTS, the experiments were conducted once; for other methods, the results are the average of five seeds.

Figure 6 reveals several key takeaways:

  1. Commercial VLMs Fail: Even powerful reasoning models like GPT-o1 achieved only a 15% success rate. They simply cannot handle the fine-grained physical causality required for these puzzles.
  2. MCTS Struggles: Surprisingly, adding Monte Carlo Tree Search (a standard planning algorithm) to a VLM only reached 24%. MCTS is computationally heavy and sensitive to the “value function” (how the robot scores a state). In these puzzles, most states look “neutral” until the very end, making standard search algorithms struggle.
  3. ReflectVLM Dominates: The proposed method (Ours w/ diffusion) achieved an 82.4% success rate. It even rivals the performance of using a ground-truth simulator (“Ours w/ sim” at 85.4%), proving that the learned diffusion model is accurate enough to be useful.

Is it efficient?

A common critique of “thinking” robots is that they are too slow. If you have to run a simulation for every move, does the robot freeze for minutes?

Table 2. Inference computation cost. Inference wall clock time per step. MCTS result is averaged over 100 tasks and 1 seed; the others are averaged over 100 tasks and 5 seeds. All experiments are done on a single A100 GPU.

Table 2 compares the inference time.

  • MCTS: ~391 seconds per step. This is agonizingly slow for a real-time robot.
  • ReflectVLM (w/ diffusion): ~11 seconds per step.

While 11 seconds is not “real-time” in the sense of a reflex, it is perfectly acceptable for high-level planning in complex assembly tasks. It is orders of magnitude faster than traditional search methods while being significantly more accurate.

Conclusion and Implications

The paper Reflective Planning provides a compelling blueprint for the next generation of robotic agents. It moves us away from the paradigm of “blindly following instructions” toward “deliberative reasoning.”

By coupling the semantic understanding of VLMs with the visual foresight of diffusion models, the authors created a system that can:

  1. Visualize the consequences of its actions.
  2. Critique its own plans based on physical constraints.
  3. Recover from potential errors before they happen.

This “System 2” approach—thinking before acting—is likely the path forward for general-purpose robots that can operate in unstructured environments like our homes and workshops. As diffusion models become faster and more accurate, the ability for robots to “dream” about the future will only become more powerful.