How Robots Can "Imagine" to Explore: Breaking Free from Random Actions

Introduction

How does a human learn to interact with a new environment? If you place a toddler in front of a table with blocks and cups, they don’t just randomly twitch their muscles until something interesting happens. They look at the objects, form a mini-goal (e.g., “I want to put the blue block inside the cup”), and then try to execute it. If it doesn’t fit, they learn. If it works, they remember the result and try something new.

For a long time, robots haven’t learned like this. The dominant paradigm, Reinforcement Learning (RL), often relies on “random” exploration—essentially trying random actions to see what yields a reward or changes the pixel values of the camera feed. In complex, real-world environments, this is inefficient and potentially dangerous. A robot arm flailing randomly is more likely to break a cup than stack it.

A new paper, “Imagine, Verify, Execute” (IVE), proposes a shift towards Agentic Exploration. By leveraging the common sense and reasoning capabilities of Vision-Language Models (VLMs), the researchers have created a system that explores like a human: it imagines a goal, verifies if it’s possible, and then executes it.

Comparison of human, RL, and IVE exploration strategies.

As shown in Figure 1 above, while RL agents (center) focus on maximizing mathematical coverage often without semantic meaning, IVE (right) mimics the human process (left) of understanding, imagining, and verifying.

The Problem with “Twitching” to Learn

To understand why IVE is necessary, we first need to look at why traditional methods struggle in the real world.

In simulated video games, RL agents are fantastic. They can fail millions of times to learn a strategy. But in robotics, we face two massive hurdles:

Semantic Blindness: Traditional intrinsic rewards (curiosity) are often based on pixel novelty. To a standard RL agent, a flickering light might be more “interesting” than stacking a block because the pixels change more drastically. It doesn’t understand objects.
Physical Feasibility: A VLM might have semantic knowledge (it knows a cup is for holding things), but it often lacks physical grounding. It might “imagine” putting a large box inside a small cup—a semantically interesting idea, but physically impossible.

The IVE framework bridges this gap. It uses the semantic intelligence of models like GPT-4 to propose interesting goals, but wraps them in a rigorous system of memory and verification to ensure the robot does things that are both novel and physically possible.

The IVE Framework: A Deep Dive

IVE stands for Imagine, Verify, Execute. It is a closed-loop system that operates without human intervention or pre-defined rewards. Let’s break down the architecture step-by-step.

1. The Scene Describer: Seeing the World

Before a robot can plan, it must understand what it is looking at. Raw pixels are too noisy for high-level planning. IVE solves this by converting the RGB-D observation into a Scene Graph.

A scene graph is a structured representation of the world. Instead of pixels, the computer sees nodes (objects) and edges (relationships).

Equation representing the scene graph structure.

As defined in the equation above:

\(V\) represents the set of objects (e.g., Red cup, Blue block).
\(E\) represents the relationships between them (e.g., Stacked on, Near).

The Scene Describer uses a VLM to analyze the camera feed and generate this graph. This abstraction is crucial because it allows the robot to reason symbolically. It doesn’t worry about lighting conditions or texture; it just knows “The Blue block is on the Tray.”

2. The Explorer: Imagination and Memory

Once the current state is understood, the Explorer module takes over. This is the “brain” of the operation. Its job is to imagine a future scene graph—a configuration that doesn’t exist yet but could.

However, we don’t want the robot to do the same thing over and over. To prevent this, IVE utilizes a Memory module. The system stores a history of all previously visited scene graphs.

When the Explorer plans a move, it queries this memory to find similar past situations.

Equation for memory retrieval.

This equation selects past graphs (\(\mathcal{G}_j\)) that are similar to the current state (\(\mathcal{G}_t\)) within a certain threshold (\(\tau\)). By looking at what it has already done in similar situations, the Explorer can deliberately generate a novel goal—something it hasn’t tried before.

Once a goal is set (e.g., “Place the Red cup on the Tray”), the Explorer generates a sequence of high-level skills to achieve it.

Equation showing a sequence of skills generated by the Explorer.

Here, the plan \(\mu\) consists of interpretable commands like “move(Red cup, Stacked on, Tray).”

3. The Verifier: The Voice of Reason

This is where IVE distinguishes itself from standard VLM approaches. VLMs act like confident improvisers—they will happily suggest actions that are dangerous or impossible.

The Verifier module acts as the system’s “conscience.” It looks at the plan proposed by the Explorer and evaluates it against recent history.

Is the tray already full?
Did we try this 5 minutes ago and fail because the stack toppled?
Is this placement stable?

If the Verifier says “No,” the plan is rejected, and the Explorer must imagine something else. If it says “Yes,” the plan moves to execution. This step is critical for safe, effective exploration in the real world.

4. Action Tools: Making it Real

Finally, the high-level plan needs to become robot movement. The Action Tools module translates semantic commands into low-level motor controls.

Example of transforming skill to action in a real-world environment.

As shown in Figure 4 above, a command like “Stack the white box” is broken down into a pipeline: finding the object (a), finding the target location (b), calculating the grasp pose (c), and executing the motion (d, e).

This modularity is powerful. The “thinking” part of the robot (the VLM) doesn’t need to know how to calculate joint angles; it just needs to know which tool to call.

Does it Work? Experiments and Results

The researchers tested IVE in both simulation (VimaBench) and on a real-world UR5e robot arm. They compared it against standard RL exploration methods (like Random Network Distillation - RND) and human operators.

Exploration Diversity

The primary goal of exploration is to see as many diverse states as possible. The metric used here is Entropy—in this context, a higher entropy means the robot visited a wider variety of unique scene configurations.

Equation defining State Entropy.

Using the formula above, where \(p(s)\) is the probability of visiting a state, the researchers measured how “curious” the agents were.

Exploration capability evaluation across simulated and real-world environments.

The results (Figure 5) are striking:

Massive Improvement over RL: IVE (blue line) achieved a 4.1 to 7.8 times increase in state entropy compared to RL baselines. The RL agents tended to get stuck or explore irrelevant variations.
Competitiveness with Humans: IVE discovered 82% to 122% of the scene diversity exhibited by expert humans. In some cases, it even outperformed humans because humans tend to forget what they did 20 minutes ago, whereas IVE’s memory module ensures it constantly seeks new configurations.

The Importance of Each Component

Is the complex architecture really necessary? The researchers performed an ablation study, removing parts of the system to see what would happen.

Ablation study of IVE showing performance drops without key modules.

Figure 6 illustrates the necessity of the “Imagine, Verify, Execute” loop:

w/o Memory (Green): Without memory, the robot kept repeating the same actions, leading to low diversity.
w/o Verifier (Not shown in graph but discussed): Without the verifier, the robot attempted impossible tasks, leading to execution failures.
w/o Explorer (Red): Using simple rules instead of VLM imagination resulted in rigid, uninteresting behavior.

Downstream Utility: Learning from the Data

The ultimate test of exploration is whether the data collected is actually useful. Can other robots learn from the experiences IVE gathered?

The researchers used the data collected by IVE to train policies for specific tasks (Behavior Cloning) and to train World Models (predicting physics).

Table comparing success rates of policies trained on different datasets.

Table 1 shows the success rate of a Behavior Cloning policy trained on data from different sources.

Policies trained on RL exploration data (RND/RE3) failed almost completely (0% - 8% success). The data just wasn’t meaningful enough.
Policies trained on IVE data achieved up to 58% success, matching or even slightly exceeding policies trained on Human data.

This is a significant finding. It suggests that we can unleash IVE robots overnight to “play” with objects, and the resulting data is high-quality enough to train robots to perform specific tasks later.

Conclusion and Implications

The “Imagine, Verify, Execute” paper presents a compelling argument for the future of robotic learning. By moving away from random motor babbling and toward structured, semantic curiosity, robots can explore faster, safer, and more effectively.

Key takeaways include:

Semantics Matter: Understanding objects and relationships (via Scene Graphs) allows for much richer exploration than just looking at pixels.
Memory is Key: Knowing what you’ve already tried is essential for discovering what you haven’t.
Verification Bridges the Gap: A “sanity check” module allows us to use powerful but hallucination-prone VLMs in the physical world.

This approach brings us a step closer to general-purpose robots that can enter a new environment, look around, and autonomously teach themselves how the world works—just like us.

Introduction#

The Problem with “Twitching” to Learn#

The IVE Framework: A Deep Dive#

1. The Scene Describer: Seeing the World#

2. The Explorer: Imagination and Memory#

3. The Verifier: The Voice of Reason#

4. Action Tools: Making it Real#

Does it Work? Experiments and Results#

Exploration Diversity#

The Importance of Each Component#

Downstream Utility: Learning from the Data#

Conclusion and Implications#