Introduction

Imagine you are looking for your keys. You don’t scan every millimeter of the ceiling or the blank wall; you look on the table, check the couch cushions, or look near the door. Your search is active, exploratory, and guided by a “mental model” of where things usually are.

Now, imagine asking a robot to “find the banana.” For a robot, this is an incredibly complex task known as Open-Vocabulary Object Localization. The robot must understand what a “banana” is (semantics), navigate a potentially cluttered and unseen environment (exploration), and understand how its physical movements change what it sees (dynamics).

Current approaches usually fall into two camps:

  1. Imitation Learning: Teaching robots by showing them thousands of human demonstrations. This is expensive and hard to scale.
  2. Vision-Language Models (VLMs): Asking a model like GPT-4 what to do. While VLMs are smart, they lack “physical grounding”—they might tell a robot to move through a table because they don’t understand the physical constraints of the scene.

Enter WoMAP (World Models for Active Perception). In a recent paper, researchers from Princeton University introduced a novel “recipe” for training robots to find any object without needing a single human demonstration. By combining photorealistic simulation, latent world models, and language understanding, WoMAP achieves success rates significantly higher than current state-of-the-art baselines.

WoMAP uses a world model to ground high-level action proposals and maximize predicted rewards. In this example, given three high-level VLM proposals,WoMAP selects “look behind the bowl" as the optimal choice after evaluating outcomes of each action roll-out in latent space.

In this post, we will tear down the architecture of WoMAP, exploring how it uses “dreamed” trajectories to plan real-world actions.

The Core Challenge: Active Perception

Perception isn’t passive. When we look at a scene, we move our heads to understand depth or verify an object’s identity. This is Active Perception.

For a robot, the problem is formulated as a Partially Observable Markov Decision Process (POMDP). The robot has a limited field of view, occlusions hide objects, and sensors are noisy. Given a language instruction \(l\) (e.g., “find the mug”), the robot must choose a sequence of actions (moving its camera/arm) to maximize the probability of seeing the target object.

The difficulty lies in the intersection of three requirements:

  1. Semantic Understanding: Knowing what the target looks like.
  2. Dynamics Modeling: Knowing how the environment changes when the robot moves.
  3. Scalability: Learning this without requiring thousands of hours of humans driving robots around.

The WoMAP Recipe

WoMAP addresses these challenges with a three-part pipeline. It effectively creates a “gym” for the robot to learn in, trains a brain (the World Model) inside that gym, and then uses that brain to validate suggestions from a VLM during deployment.

Left: Core components of WoMAP. Scalable data generation with Gaussian Splats (Section 3.2), world modeling with object detection reward supervision (Section 3.3),and latent space action planning (Section 3.4). Right: The action optimization process. Given the task and current observation,a VLM generates high-level proposals, which we transform into coarse actions (green arrows); each action is further optimized within WoMAP’s reward gradient field (red arrows),and the action sequence with the highest predicted reward is executed.

Let’s break down each component shown in the figure above.

Ingredient 1: Scalable Data Generation via Gaussian Splats

One of the biggest bottlenecks in robotics is data. Collecting real-world data is slow and dangerous for hardware. Simulations are fast but often look “cartoonish,” creating a “sim-to-real gap” where a robot trained in sim fails in the real world.

WoMAP bridges this gap using Gaussian Splatting.

Gaussian Splatting is a technique that creates a highly photorealistic 3D scene representation from a simple video recording. The researchers take a one-minute video of a scene (like a messy kitchen table), and the algorithm reconstructs it as a cloud of 3D Gaussians. This allows the system to render novel views—images from angles the camera never actually visited during the recording.

Data Generation with Gaussian Splats. We train Gaussian Splats for each scene and obtain ground truth object locations through semantic labeling [25] for informative view sampling. Each observation is labeled with GroundingDINO [26] to get confidence scores for all training targets.

As shown in the figure above, the pipeline works as follows:

  1. Record Video: Capture a real-world scene.
  2. Generate Splat: Create the photorealistic 3D environment.
  3. Semantic Labeling: Use models like CLIP and GroundingDINO to automatically label objects (e.g., identifying the “banana” or “scissors”) within the 3D space.
  4. Trajectory Sampling: Automatically generate thousands of random robot paths through this virtual scene.

This effectively allows the robot to “hallucinate” training data. It can practice looking at the banana from 10,000 different angles in simulation, generating a dataset \(\mathcal{D} = \{(o_i, r_i, P_i)\}\) containing observations (images), rewards (did I see the object?), and poses (where was the camera?).

Visualization of training trajectories generated in PyBullet and Gaussian Splat.

The result is a training set that is both physically realistic and massive in scale, all derived from a few minutes of video.

Ingredient 2: The Latent World Model

With data in hand, the next step is training the robot’s internal model of the environment. A World Model allows an agent to predict the consequences of its actions. “If I move left, what will I see?”

WoMAP’s world model operates in a latent space. Instead of trying to predict every single pixel of the next image (which is computationally expensive and error-prone), it compresses the image into a compact vector representation (\(z_t\)) and predicts how that vector changes.

World Model Architecture for simultaneous dynamics and rewards prediction.

The architecture consists of three parts:

  1. Observation Encoder (\(h_\theta\)): Compresses the raw image into a latent state \(z_t\). The researchers found that using a pre-trained DINOv2 encoder (frozen, not fine-tuned) provided the most robust visual features.
  2. Dynamics Predictor (\(q_\psi\)): Predicts the next latent state \(z_{t+1}\) given the current state and an action.
  3. Rewards Predictor (\(v_\phi\)): Predicts the utility of the state. “How much of the target object is visible here?”

The Secret Sauce: Reconstruction-Free Training

Most traditional world models (like Dreamer) use image reconstruction as a training signal. They try to decode the latent state back into pixels and check if it matches the original image.

The WoMAP authors argue that this is unnecessary and even detrimental for this task. Reconstruction forces the model to care about irrelevant details (like the texture of the wall) rather than the task at hand (finding the object). Furthermore, it leads to training instability.

Instead, WoMAP uses Dense Reward Distillation. They supervise the model using the confidence scores from an object detector. If the robot moves to a position where the object detector says “I am 90% sure this is a banana,” the model learns that this transition yields a high reward. This effectively distills the knowledge of a massive vision model (GroundingDINO) into the lightweight world model.

Frozen vs. Finetuned DINOv2 Encoder. Finetuning the DINOv2 encoder generally leads to training instability, negatively impacting performance.

The ablation study above highlights why they froze the DINOv2 encoder. Attempting to fine-tune the encoder (the lines dropping to zero or fluctuating wildly) led to instability, whereas using the frozen, pre-trained features (the smoother top curves) resulted in stable learning.

Ingredient 3: Planning with VLM Guidance

A world model can predict the future, but it doesn’t inherently know which future is best to pursue without a plan.

In a complex room, searching randomly is inefficient. Humans use common sense: “The apple is likely in the fruit bowl, not under the stapler.” Large Language Models (LLMs) and VLMs possess this semantic common sense.

WoMAP combines the two:

  1. Proposal: The robot sends its current view to a VLM (like GPT-4) and asks for suggestions. The VLM might suggest: “Look behind the bowl” or “Check the left side of the table.”
  2. Grounding (The WoMAP Step): These suggestions are high-level and clumsy. WoMAP takes these candidates and uses its World Model to simulate the specific trajectories. It runs an optimization loop (Model Predictive Control) to fine-tune the exact camera angles and movements that maximize the predicted reward.

()\n\\begin{array} { r l } { \\underset { a _ { t : t + T } } { \\operatorname* { m a x } } } & { \\displaystyle \\sum _ { \\tau = 1 } ^ { T } ( \\mathbb { E } _ { v _ { \\phi } } [ r _ { t + \\tau } \\mid z _ { t + \\tau } , e _ { g } ) ] + \\gamma \\left| a _ { t + \\tau - 1 } - a _ { t + \\tau - 2 } \\right| _ { 1 } ) , } \\ { \\mathrm { s u b j e c t ~ t o } } & { z _ { t + \\tau } \\sim q _ { \\psi } ( z _ { t + \\tau } \\mid z _ { t + \\tau - 1 } , a _ { t + \\tau - 1 } ) \\ \\forall \\tau \\in [ T ] , } \\end{array}\n()

Mathematically, the system solves the optimization problem above. It searches for a sequence of actions (\(a_{t:t+T}\)) that maximizes the expected reward (seeing the object) while penalizing jerky movements (the smoothness term).

Experiments and Results

The researchers evaluated WoMAP in both simulation (PyBullet) and real-world environments (using the TidyBot) across tasks of varying difficulty.

Qualitative Performance

The visual difference in behavior is striking. In the figure below, the robot is tasked with finding a banana hidden behind a mug.

  • WM-Grad (Blue): A basic world model planner without VLM guidance gets stuck in local minima, taking a circuitous path.
  • VLM (Yellow): The VLM suggests directions but lacks the precision to actually get a good view.
  • WoMAP (Orange): It moves decisively around the occlusion to view the target.

Visualization of the TidyBot’s trajectories for all planners. When asked to find an object, e.g.,a banana occluded by a mug, WoMAP finds the target object (banana) more eficiently than the other planners.As illustrated,the WM-Grad computes inefcient, circuitous paths, while the DP does not look behind occlusions. See Section 4.3 and the paper’s video for more details. Further, we show images from the scene and wrist cameras at different timesteps when planning with WoMAP (right).

Quantitative Success

The data backs up the visuals. The team tested across different scene difficulties (Easy, Medium, Hard) and initialization difficulties (how bad the starting view is).

PyBullet evaluation tasks and results. Success rates (translucent bars) and effciency scores (solid bars) in active object localization across PyBullet scenes (presented in the order of increasing difficulty) and initial-pose conditions: easy (E), medium (M),and hard (H). WoMAP outperforms all baseline methods in all scenes and initial-pose conditions.

In the PyBullet evaluation above, WoMAP (Orange bars) consistently dominates.

  • vs. VLM: The VLM often fails completely (success rates near 0.1-0.2 in hard tasks) because it hallucinates actions that are physically impossible or ineffective.
  • vs. Diffusion Policy (DP): The imitation learning baseline (Pink) struggles to generalize. It acts “habitually,” moving in directions it saw in training data even if that doesn’t make sense for the new scene layout.

The results hold true in the Gaussian Splat simulation environments as well:

Gaussan Splat evaluation tasks and results. Success rates (translucent bars) and effciency scores (solid bars) in active object localization across Gaussian Splat scenes and initialpose conditions: easy \\(\\mathrm { ( E ) }\\) ,medium (M),and hard (H).As in the PyBullet scenes,WoMAP outperforms all baseline methods via effective action grounding and optimization.

Sim-to-Real Transfer

Perhaps the most impressive result is the Zero-Shot Sim-to-Real Transfer. The model was trained entirely on data generated from Gaussian Splats. It was never trained on the physical robot hardware.

Yet, when deployed on the real TidyBot:

Table 1: Success rates \\(( \\% )\\) for zero-shot sim-to-real transfer for VLM and WoMAP.

WoMAP achieved a 63% success rate in random real-world scenes (“GS-Random”), compared to 0% for the VLM directly. The VLM failed in the real world largely because it kept suggesting actions that violated the robot’s joint limits or ignored physical constraints. WoMAP’s world model acted as a reality check, grounding those suggestions into feasible motions.

Generalization Capabilities

Finally, a robust robot must handle variations in lighting and phrasing. The researchers tested WoMAP under extreme lighting conditions (red light, blue light) and background changes.

Visual generalization setup: lighting and background conditions.

While performance dropped slightly (as expected), WoMAP maintained a 50% success rate even under drastically different lighting, demonstrating that the latent features learned by DINOv2 and the World Model are robust to visual noise.

Visual generalization results for various background and lighting conditions.

It also generalized semantically. If trained to find a “banana,” it could still succeed if asked to find a “sweet thing” or “tropical food,” thanks to the language embeddings used in the reward predictor.

Generalization plots for unseen queries and objects in the same category: (left) banana, (center) scissors,(right) mug.We see a positive correlation in semantic similarity (cosine distance) of the objects/queries with the most similar object present in our training objects,and the efficiency score suggesting the model’s performance.

Conclusion

WoMAP represents a significant step forward in robotic active perception. By stepping away from the need for expensive human demonstrations and unstable image reconstruction objectives, it offers a scalable “recipe” for teaching robots to see.

The key takeaways are:

  1. Simulation is data: Gaussian Splatting allows us to turn brief videos into infinite training playgrounds.
  2. Don’t reconstruct, distill: Predicting rewards (from foundation models) is more efficient than reconstructing pixels.
  3. Trust, but verify: VLMs have great common sense but poor physical sense. World Models bridge this gap by grounding language proposals in physical reality.

As robots move from structured factories into our messy, unpredictable homes, architectures like WoMAP—which combine semantic reasoning with physical foresight—will be essential for their success.