Introduction: The Long-Horizon Challenge

Imagine asking a robot to “make dinner.” To you, this is a single request. To a robot, it is a staggering sequence of complex, physically grounded actions: open the fridge, identify the ingredients, grasp the onion, place it on the cutting board, pick up the knife, and so on.

In the field of robot learning, we call these long-horizon manipulation tasks. They are notoriously difficult because errors compound. If the robot fumbles opening the fridge, the rest of the plan is irrelevant.

To solve this, researchers often turn to Imitation Learning—teaching robots by showing them demonstrations. Ideally, we want robots to learn from “unlabeled play data.” This is data collected by humans just messing around with the robot, moving objects, and exploring the environment without specific goal labels. It is cheap to collect but incredibly messy to learn from.

Recent trends have tried to solve this using Generative AI. The logic goes: “If we can use a video generation model (like a mini-Sora) to hallucinate a video of the robot solving the task, we can just ask the robot to follow that video.”

However, there are two massive problems with generative video planners:

  1. Hallucination: Diffusion models dream up pixels, not physics. They might generate a video where a bowl disappears or a robotic arm gains a second elbow.
  2. Speed: Generating video frames pixel-by-pixel is computationally expensive, often taking seconds per plan—an eternity in real-time control.

Enter Vis2Plan, a new framework that takes a different approach. Instead of dreaming up new videos, it extracts symbolic plans from existing data and retrieves real, physically valid images to guide the robot.

In this deep dive, we will explore how Vis2Plan combines the best of both worlds: the reasoning power of symbolic AI and the perceptual richness of modern computer vision.

The Architecture of Vis2Plan

Vis2Plan is a hierarchical framework. It separates the high-level “thinking” (planning) from the low-level “doing” (control). But unlike traditional symbolic planners that require humans to manually code rules (like PDDL), Vis2Plan extracts its symbols automatically from raw video.

Let’s look at the high-level architecture:

Vis2Plan Framework Overview.

As shown in Figure 1, the system works in two distinct phases:

  1. Offline Learning (Bottom Stream): The system digests unlabeled play data to learn symbols, a transition graph, and policy modules.
  2. Online Inference (Top Stream): When given a goal, the planner finds a symbolic path, converts it into a sequence of visual subgoals (images), and executes them using a low-level controller.

The beauty of this approach is that it is white-box. Unlike a black-box neural network that outputs a vector and says “trust me,” Vis2Plan produces a readable plan of symbols. You can inspect the steps the robot intends to take.

Stage 1: From Pixels to Symbols

How do you turn a video of a messy kitchen into clean, discrete symbols without human labeling? Vis2Plan leverages Vision Foundation Models (VFMs).

The researchers recognized that while raw pixel data is high-dimensional and noisy, the state of a task usually changes only when objects interact. For example, a “pot” has a stable state when it’s sitting on the stove, and a different stable state when it’s in the sink. The movement in between is just the transition.

1.1 Video Preprocessing and Tracking

First, the system needs to understand what is in the scene. It uses SAM2 (Segment Anything Model 2) to track objects and SigLIP2 to extract rich visual features from those objects. This converts a video from a stream of pixels into a stream of object-centric feature vectors.

1.2 Skill Segmentation via Stable States

The core insight here is Stable State Identification. The system analyzes the feature vectors over time. When the visual features of objects remain similar for a duration, that represents a “stable state.” When they change rapidly, an action is occurring.

Symbolic skill extraction process.

As illustrated in Figure 2:

  1. Input: Raw video frames.
  2. Segmentation: By calculating temporal similarity (how similar is frame \(t\) to frame \(t+1\)?), the system identifies peaks of change. These peaks mark the boundaries between sub-skills.
  3. Clustering: The system uses agglomerative clustering on the stable states. If the “pot on stove” looks similar in video A and video B, they are clustered into the same Symbolic State.

This effectively turns a continuous video into a sequence of discrete nodes, allowing the construction of a Symbolic Transition Graph.

1.3 The Symbolic Graph

The result of Stage 1 is a directed graph where nodes are object states and edges are possible transitions.

Example of a symbolic graph node structure.

Figure 12 shows a visualization of this graph. Each node (like Node 19) represents a specific configuration of the world (e.g., “Object 2 is in state 3”). This graph is the map the robot will use to navigate tasks. Because it is built from real data, every transition in the graph is something the robot has actually seen happen.

Stage 2: Learning the Planning Modules

Having a graph is great, but to execute a task, the robot needs three specific learned capabilities:

  1. State Prediction: “Where am I now?”
  2. Reachability: “Can I actually get from Image A to Image B?”
  3. Control: “How do I move my arm to get to Image B?”

2.1 The Next-Symbolic-State Predictor

Since the robot sees the world through a camera, it needs to map its current image observation \(O_t\) to a node in the symbolic graph. The researchers train a classifier \(C_{\theta}\) that takes an image and predicts the next likely symbolic state.

Equation for the symbolic state predictor loss.

This predictor acts as the bridge between the continuous real world and the discrete symbolic map.

2.2 The Reachability Estimator

This is arguably the most critical component for preventing hallucinations. Just because two states are connected in a graph doesn’t mean the robot can physically transition between them from its current specific pose.

Vis2Plan trains a Reachability Estimator \(R_{\psi}\). This is a neural network trained using Contrastive Reinforcement Learning.

Architecture of the reachability network.

As shown in Figure 13, the network takes a current observation and a potential subgoal image. It outputs a score indicating how feasible the transition is.

The training uses an MC-InfoNCE loss function:

Equation for MC-InfoNCE loss.

In simple terms, this equation forces the network to assign high scores to transitions that actually occurred in the dataset (positive pairs) and low scores to random pairings (negative pairs). If the planner suggests a jump that violates physics or the robot’s capabilities, this estimator will output a low score, allowing the planner to reject it.

2.3 The Low-Level Controller

Finally, the system needs a policy to move the robot. Vis2Plan uses a Goal-Conditioned Policy. It takes the current image and the next visual subgoal (selected by the planner) and outputs the motor actions (joint angles, gripper position).

Equation for training the low-level policy.

This policy is trained via behavioral cloning on short snippets of the play data. It doesn’t need to know the long-term goal; it just needs to know how to get from “here” to the “immediate next step.”

Stage 3: Symbolic-Guided Visual Planning (Inference)

Now we put it all together. The robot is placed in a kitchen and told: “Put the pot on the stove.”

The process, visualized in Figure 3, follows a “Search-Retrieve-Validate” pattern.

Symbolic-Guided Visual Planning Framework.

Step 1: Symbolic Planning (The Search) The robot looks at the scene (\(O_t\)) and identifies its current symbolic state. It looks at the goal symbol (\(z_g\)). It then runs an A search algorithm* on the Symbolic Graph built in Stage 1. This finds the shortest path of symbols: \(z_{start} \rightarrow z_1 \rightarrow z_2 \rightarrow z_{goal}\).

Step 2: Visual Grounding (The Retrieval) Symbols are abstract. The low-level controller needs pixels. For every symbol in the plan (e.g., “pot held above stove”), the system retrieves a set of real images from the dataset that match that symbol.

Step 3: Reachability Filtering (The Validation) The system now has a sequence of sets of images. It needs to pick the specific sequence \(O_1, O_2, \dots, O_n\) that creates the smoothest, most feasible path.

It solves an optimization problem:

Optimization objective for visual planning.

It uses the Reachability Estimator (\(R_{\psi}\)) to score the transitions. It selects the sequence of images that maximizes reachability (ensuring physical consistency).

Optimization Trick: To make this fast, Vis2Plan pre-computes the reachability features. Instead of passing heavy images through a network during runtime, it simply does dot-products of pre-computed vectors.

Optimized visual subgoal sampling.

Figure 14 compares the standard approach (slow) with Vis2Plan’s optimized approach (fast). This optimization allows Vis2Plan to run orders of magnitude faster than video generation models.

Experiments and Results

The researchers evaluated Vis2Plan in two environments:

  1. LIBERO Simulation: A standard benchmark for robot manipulation.
  2. Real World: A robotic arm interacting with a toy kitchen (manipulating pots, bowls, and vegetables).

Quantitative Performance

The results were decisive. Vis2Plan significantly outperformed baselines, particularly on long-horizon tasks (multi-stage tasks).

Table of results in simulation.

In Table 1 (Simulation), look at the “Long Horizon” columns. End-to-end methods (like GC-Transformer) fail completely (0% success). Diffusion-based video planners (AVDC) struggle. Vis2Plan achieves success rates of 72-82% on the hardest tasks.

The real-world results mirrored this success:

Table of real-world robot experiments.

Vis2Plan achieved an average success rate of 0.71 across tasks, while the diffusion-based video planner (AVDC) only managed 0.18.

Speed Comparison

One of the most impressive claims is efficiency. Because Vis2Plan retrieves existing images rather than generating new ones, it is incredibly fast.

Table comparing inference speeds.

Table 7 shows that while the diffusion planner (AVDC) takes 1.42 seconds to think of a plan, Vis2Plan takes just 0.03 seconds. In robotics, this difference is the gap between a fluid motion and a stuttering, pausing robot.

Qualitative Analysis: Why do others fail?

The paper provides a fascinating look at why generative video planners fail.

Failure cases of baseline methods.

Figure 5 highlights “Hallucinations.”

  • AVDC (Diffusion): In the red box, the generative model simply “deletes” the bowl. It vanishes from existence. The low-level controller, confused by the missing object, fails.
  • GSR (Graph Search Retrieval): This baseline connects states based on naive visual similarity. It jumps from holding a pan to a state where the pan is suddenly somewhere else, causing the robot to flail.

In contrast, Vis2Plan produces physically consistent plans because it retrieves real images that actually happened.

Successful Vis2Plan examples.

Figure 15 shows Vis2Plan in action. The Symbolic sub-goals (top rows) provide the high-level logic, while the Visual sub-goals (bottom rows) provide the pixel-perfect targets for the controller. Whether it is putting a bowl on a stove in simulation or putting a green onion in a pot in the real world, the plans are coherent.

Conclusion

Vis2Plan represents a shift in how we think about robot planning from unlabeled data. Rather than relying on the “black magic” of end-to-end learning or the expensive hallucinations of video generation models, it takes a structured approach.

Key Takeaways:

  1. Symbolic Guidance is Powerful: By extracting discrete states, we gain the reliability of classical planning (A* search).
  2. Retrieval > Generation: For robotic planning, finding a real image that guarantees physical feasibility is often better (and safer) than generating a synthetic one.
  3. Efficiency Matters: Robots need to act in the real world. A planner that runs in 0.03 seconds enables real-time responsiveness that 1-second planners cannot match.

By effectively bridging the gap between symbolic reasoning and visual imitation, Vis2Plan offers a robust blueprint for robots that can learn complex, multi-step tasks simply by watching us play.