The dream of general-purpose robotics is a machine that can walk into a messy kitchen, identify a teapot and a cup, and pour you a drink without having been explicitly programmed for that specific teapot or that specific cup.

In recent years, we have seen massive leaps in Vision-Language Models (VLMs). These models (like GPT-4V) have incredible “common sense.” They can look at an image and tell you, “That is a teapot, you hold it by the handle, and you pour liquid from the spout.” However, knowing what to do is very different from knowing exactly how to do it in 3D space.

A VLM might say “pour the tea,” but a robot needs to know: “Move the end-effector to coordinates \((x, y, z)\), rotate by \(\theta\) degrees, and ensure the spout aligns with the cup’s center.” This is the gap between semantic reasoning and spatial precision.

Today, we are diving deep into a research paper that proposes a solution to this problem: OmniManip. This system introduces a way to translate the high-level common sense of VLMs into precise, actionable 3D spatial constraints, creating a robot that can handle open-ended tasks in unstructured environments.

OmniManip demonstrates open-vocabulary manipulation capabilities, bridging high-level reasoning with low-level precision.

1. The Context: Why is this hard?

Before we get into the mechanics of OmniManip, we need to understand the current landscape of robotic manipulation.

The Limitation of VLMs

VLMs are trained primarily on 2D internet data (images and text). While they understand semantics remarkably well, they lack a fine-grained understanding of 3D geometry. If you ask a VLM to control a robot arm directly, it often hallucinates or gives vague instructions that result in the robot grasping the air or knocking the object over.

The Cost of Vision-Language-Action (VLA) Models

One solution is to fine-tune these models on robotic datasets, creating Vision-Language-Action (VLA) models. The problem? It’s incredibly expensive. Collecting high-quality robotic data takes a long time, and the resulting models are often “agent-specific”—meaning if you train on one robot arm, the model might not work on a different one.

The Representation Problem

To control a robot, we need an intermediate representation—a way to describe the object to the robot.

Keypoints: Some methods try to identify specific points on an object (e.g., “handle center”). However, extracting these from 2D images is unstable, especially if the object is partially blocked from view.
6D Poses: Others estimate the full position and orientation of the object. While robust, this usually requires pre-existing 3D models of the specific objects you want to manipulate, which doesn’t work in an “open world” where the robot encounters new things.

2. The OmniManip Solution

The researchers behind OmniManip propose a different approach. Instead of fine-tuning the VLM, they use it for what it’s best at: reasoning. They then bridge the gap to control using a novel concept called Object-Centric Interaction Primitives.

The core insight is simple but powerful: An object’s function defines its geometry.

A teapot has a handle for grasping and a spout for pouring. These features define a “Canonical Space”—a standardized coordinate system based on the object’s function. By mapping objects to this space, OmniManip can generate precise instructions like “grasp at this point” and “align with this vector.”

The Framework Overview

The OmniManip system operates in a dual closed-loop process. It doesn’t just guess a plan and hope for the best; it plans, checks its work, executes, and continuously corrects itself.

The OmniManip framework overview showing the flow from instruction to execution.

As shown in the figure above, the workflow follows these steps:

Instruction & Observation: The robot receives a command (“Pour tea”) and looks at the scene (RGB-D observation).
Task Partitioning: The VLM identifies the relevant objects (Active: Teapot, Passive: Cup) and breaks the task into stages (Stage 1: Grasp Teapot, Stage 2: Pour).
Primitive Extraction: For each stage, the system extracts “Interaction Primitives” (points and directions) in the object’s canonical space.
Spatial Constraints: These primitives are converted into mathematical constraints (e.g., distance and angle rules).
Closed-Loop Execution: The robot moves, using a tracker to update the object’s position in real-time.

3. Core Method: From Pixels to Primitives

This is the heart of the paper. How do we turn a picture of a teapot into a mathematical constraint?

Step 1: Mesh Generation and Canonicalization

First, the system needs a 3D understanding of the objects. It uses a 6D Pose Estimator to understand where the object is and a 3D generation network to create a temporary 3D mesh of the object. This allows the system to establish a Canonical Space—essentially a standard reference frame for that object type.

Step 2: Grounding Interaction Points

The system needs to know where to interact with the object. The authors categorize interaction points into two types:

Visible/Tangible: Points you can see, like a handle.
Invisible/Intangible: Abstract points, like the center of a cup’s opening (which is technically empty space).

To find these, OmniManip uses a visual prompting mechanism. It overlays a grid on the image and asks the VLM to identify the coordinates. For “invisible” points, it uses multi-view reasoning within the canonical space to infer the location accurately.

Interaction points generation showing visible and invisible points.

Step 3: Sampling Interaction Directions

Knowing where to touch isn’t enough; the robot needs to know the orientation. This is where the Canonical Space shines. Objects usually function along “Principal Axes”—the main geometric lines of the object (like the vertical axis of a bottle or the spout axis of a teapot).

OmniManip samples these principal axes as candidate directions. It then uses the VLM to describe these axes semantically and a Large Language Model (LLM) to score them based on the task.

Extraction of interaction directions using principal axes and VLM reasoning.

For example, if the task is “pour,” the system identifies the axis coming out of the spout as the correct interaction direction.

Step 4: Defining Spatial Constraints

Once the system has the Point (\(\mathbf{p}\)) and the Direction (\(\mathbf{v}\)), it combines them into an Interaction Primitive (\(\mathcal{O}\)).

The robot then defines the task as a set of Spatial Constraints (\(C_i\)). These constraints describe the required relationship between the Active Object (e.g., teapot) and the Passive Object (e.g., cup).

These constraints usually include:

Distance Constraints (\(d_i\)): How close should the interaction points be?
Angular Constraints (\(\theta_i\)): How should the interaction directions align?

The mathematical formulation for a set of constraints is:

Equation defining the set of spatial constraints.

4. The Dual Closed-Loop System

A major contribution of this paper is recognizing that “open-loop” systems (plan once, execute blindly) rarely work in the real world. VLM hallucinations or slight physical bumps can cause failure. OmniManip introduces two loops to solve this.

Loop 1: Closed-Loop Planning (Self-Correction)

VLMs can hallucinate. They might think the handle is on the wrong side. To fix this, OmniManip uses a mechanism called RRC (Resampling, Rendering, Checking).

Render: The system simulates the interaction based on its current plan.
Check: It shows this simulation to the VLM and asks, “Does this look right for the task?”
Resample: If the VLM says “No,” the system resamples the interaction primitives (tries a different axis or point) and loops again.

Self-correction mechanism using Rendering, Resampling, and Checking (RRC).

This “imagination” step allows the robot to catch errors before it even moves a muscle.

Loop 2: Closed-Loop Execution (Real-Time Tracking)

Once the plan is verified, the robot begins to move. However, real-world objects move. A grasp might slip, or the table might bump.

OmniManip uses an off-the-shelf 6D object pose tracker to continuously update the positions of the active and passive objects. The motion planning is formulated as a real-time optimization problem. The robot tries to minimize a “Loss Function” to find the best next move for its end-effector (gripper).

The optimization objective is:

Optimization function for the end-effector pose.

This objective is composed of three specific loss terms that the robot balances simultaneously:

Constraint Loss (\(\mathcal{L}_C\)): This ensures the robot adheres to the spatial rules we defined earlier (aligning axes, maintaining distance).
Collision Loss (\(\mathcal{L}_{\text{collision}}\)): This ensures the robot doesn’t hit obstacles. It penalizes the robot if it gets too close to anything other than the target.
Path Loss (\(\mathcal{L}_{\text{path}}\)): This ensures the movement is smooth, balancing translation (moving) and rotation.

5. Experiments and Results

The researchers tested OmniManip on a real Franka Emika Panda robotic arm across 12 diverse tasks. These ranged from rigid object interactions (Pouring tea, Recycling a battery) to articulated objects (Opening a drawer, Closing a laptop).

Quantitative Success

OmniManip significantly outperformed baseline methods like VoxPoser, CoPa, and ReKep.

Table showing quantitative results across 12 real-world tasks.

Note: In the table above, OmniManip (Closed-loop) achieves the highest success rates, particularly in complex tasks like “Recycle the battery” (80%) and “Pick up the cup on the dish” (80%).

The Importance of Stability

One of the biggest advantages of OmniManip is its stability. Because it relies on the object’s canonical space (function-based geometry) rather than just surface pixels, it is much more consistent.

In the comparison below, look at the “ReKep” row vs. the “Ours” (OmniManip) row. ReKep relies on keypoints that can be unstable or clustered incorrectly. OmniManip’s bounding boxes and axes (derived from canonical space) provide a much cleaner plan.

Stability analysis comparing planning visualizations of OmniManip vs ReKep and CoPa.

Viewpoint Invariance

A common failure mode in robotics is camera angle. If you move the camera, pixel-based methods often fail because the object “looks” different.

OmniManip, however, maps the object to a 3D canonical space first. This means it doesn’t matter if the camera is at \(0^\circ\), \(45^\circ\), or \(90^\circ\)—the “up” vector of the teapot is always “up” in canonical space.

Qualitative analysis showing OmniManip’s robustness to different viewpoints compared to ReKep.

The quantitative impact of viewpoint changes is stark. As shown below, OmniManip maintains high success rates across all angles, whereas baseline methods struggle significantly when the view changes.

Table showing the impact of viewpoints on performance.

Why Closed-Loop Matters

The researchers also ran ablation studies to prove the necessity of the closed-loop execution. Without real-time pose tracking (Open-loop), the robot cannot adjust if the object shifts slightly during the grasp or if the grasp wasn’t perfect.

Two failure cases demonstrating what happens without closed-loop execution.

As seen above, without feedback, the robot might execute the correct motion, but because the object slipped or moved, the action fails (e.g., pouring tea onto the table instead of the cup).

Automating Data Generation

Finally, the authors highlighted a powerful application for OmniManip: Data Generation. Training robots via Behavior Cloning (mimicking demonstrations) is popular, but collecting human demonstrations is tedious.

OmniManip works well enough “out of the box” (zero-shot) that it can be used to generate synthetic demonstrations automatically. The researchers collected 150 trajectories using OmniManip and successfully trained a separate policy to perform the tasks.

Table showing success rates of Behavior Cloning using OmniManip-generated data.

6. Conclusion and Key Takeaways

OmniManip represents a significant step forward in general-purpose robotics. It tackles the “VLM Gap”—the disconnect between high-level reasoning and low-level control—not by expensive training, but by structured representation.

Key Takeaways:

Don’t Re-train, Translate: Instead of fine-tuning VLMs, OmniManip translates their output into mathematical spatial constraints.
Canonical Space is King: Understanding objects via their functional 3D axes is far more robust than relying on 2D image pixels.
Check Your Work: The dual closed-loop system (Plan-Check-Refine and Track-Execute) is essential for robustness in the real world.

By combining the semantic “brains” of VLMs with the geometric “logic” of canonical spaces, OmniManip brings us closer to robots that can truly handle the unpredictable nature of our everyday world.

This blog post summarizes the research paper “OmniManip” by Pan et al. (2025). All images and data are credited to the original authors.

1. The Context: Why is this hard?#

The Limitation of VLMs#

The Cost of Vision-Language-Action (VLA) Models#

The Representation Problem#

2. The OmniManip Solution#

The Framework Overview#

3. Core Method: From Pixels to Primitives#

Step 1: Mesh Generation and Canonicalization#

Step 2: Grounding Interaction Points#

Step 3: Sampling Interaction Directions#

Step 4: Defining Spatial Constraints#

4. The Dual Closed-Loop System#

Loop 1: Closed-Loop Planning (Self-Correction)#

Loop 2: Closed-Loop Execution (Real-Time Tracking)#

5. Experiments and Results#

Quantitative Success#

The Importance of Stability#

Viewpoint Invariance#

Why Closed-Loop Matters#

Automating Data Generation#

6. Conclusion and Key Takeaways#