Introduction

The dream of the “generalist robot”—a machine capable of folding laundry, cooking dinner, and tidying a workshop without explicit reprogramming—has long captivated roboticists. Recently, Vision-Language Models (VLMs) like GPT-4 and Gemini have brought us closer to this reality. These models possess immense “common sense” knowledge about the world. They can look at a messy room and tell you what needs to be cleaned up.

However, knowing what to do is very different from knowing how to do it.

In robotics, this is the distinction between high-level planning (deciding to pick up a spatula) and low-level reasoning (determining the exact millimeter-precise coordinate to grasp the handle without knocking over the pot). While VLMs excel at the former, their ability to handle the latter—the precise, low-level spatial reasoning required for physical interaction—remains an open question.

This brings us to ManipBench, a novel benchmark proposed by researchers at the University of Southern California. ManipBench is designed to rigorously evaluate how well VLMs understand the physical consequences of robot actions. By converting complex manipulation tasks into over 12,000 visual multiple-choice questions, the authors provide the first comprehensive look at whether foundation models are ready to act as low-level robotic agents.

ManipBench Overview showing real robot data, fabric manipulation, and simulation data examples.

The Problem: The High-Level vs. Low-Level Gap

Traditionally, robots are programmed with specific policies trained on massive datasets of physical movements. The recent trend, however, attempts to bypass this data bottleneck by using pre-trained VLMs. If a model has seen millions of images of mugs, surely it knows how to hold one?

Not necessarily. A VLM might describe a “mug” perfectly but fail to identify the specific “keypoint” (coordinate) a robot gripper must target to lift it safely. Previous benchmarks have tested high-level logic or physics knowledge (e.g., “Is this object brittle?”), but they rarely test the generation of executable robotic trajectories.

ManipBench fills this gap by focusing on affordance prediction. It asks the model to look at a scene and identify specific interaction points—picking points, placing locations, and motion vectors—that constitute a valid robot action.

Core Method: Constructing ManipBench

The researchers curated a massive dataset of 12,617 questions spanning five categories of manipulation: pick-and-place, articulated objects (e.g., drawers), deformable objects (e.g., fabric), tool use, and dynamic manipulation.

The Pipeline: From Raw Data to VLM Questions

To evaluate a VLM’s ability to control a robot without actually running a physical robot for every single test (which is slow and expensive), the authors devised a clever visual prompting pipeline.

As illustrated in the figure below, the process begins with raw data from real robots or simulations.

  1. Scene Capture: An RGB-D image of the workspace is captured.
  2. Preprocessing (MOKA-style): The system uses tools like GroundedSAM (Segment Anything Model) to detect objects.
  3. Annotation: A grid overlay and specific keypoints (labeled \(p\_1, p\_2\), etc.) are drawn on the image.
  4. Question Generation: The VLM is presented with the annotated image and a prompt (e.g., “Which point should the robot grasp to open the drawer?”).

The ManipBench pipeline showing data processing from real/sim environments to question generation.

This method effectively turns a robotics problem into a Visual Question Answering (VQA) problem, allowing for rapid, large-scale evaluation of dozens of models.

Data Sources and Task Diversity

To ensure the benchmark is robust, the data comes from three distinct sources:

1. Public Robotic Manipulation Datasets (Real World) The authors utilized datasets like DROID and Bridge, which contain thousands of hours of real robot operations. They extracted successful trajectories and asked VLMs to predict the correct “pick” and “place” points used by the expert demonstrators.

A sample Type 2 question from the robotic dataset requiring the model to identify picking and placing points.

2. Fabric Manipulation (Deformable Objects) One of the most challenging areas in robotics is handling deformable objects like cloth. Fabrics change shape unpredictably. ManipBench dedicates a significant portion of its questions to this, testing specific dimensions of understanding such as:

  • Fabric State: Is the cloth crumpled or flat?
  • Folding Logic: Which corner must be moved to fold the cloth diagonally?
  • Inverse Dynamics: If I pull point A to point B, what does the final shape look like?

A sample question testing the VLM’s understanding of temporal sequences in fabric folding.

3. Simulation (Articulated and Dynamic Tasks) For tasks that are difficult to set up repeatedly in the real world, the authors turned to simulation. This includes interacting with articulated objects (like drawers and cabinets) and dynamic tasks that require physics intuition, such as shooting a ball into a hoop.

Simulated tasks including placing carrots, closing drawers, straightening ropes, and ball shooting.

In the simulation tasks, the VLM might be asked to select a specific vector (arrow) that represents the correct force and direction to shoot a basketball, or the correct contact point to close a drawer without jamming it.

Annotated images showing keypoints for closing a drawer.

Experiments and Results

The researchers evaluated 33 representative VLMs across 10 model families, including closed-source giants (GPT-4o, Gemini-1.5/2.5) and open-source contenders (InternVL, Qwen-VL, LLaVA).

The results highlight a significant variance in performance. Unsurprisingly, the most capable closed-source models led the pack.

  • Gemini-2.5-pro emerged as the top performer across most categories, demonstrating strong spatial reasoning.
  • GPT-4o and o1 also performed well but generally trailed slightly behind the top Gemini model in specific low-level spatial tasks.
  • Open-Source Models: Larger open-source models like InternVL2.5-78B and Qwen2.5-VL-72B showed impressive results, often rivaling the closed-source proprietary models. However, smaller models (<7B parameters) struggled significantly, often performing near random chance.

The table below details the performance on simulation tasks. Note the variability; for example, while Gemini-2.5-pro dominates generally, Gemini-2.0-flash actually scored highest on the “Place Carrot” task.

Performance table of various VLMs on simulation tasks.

2. The Scaling Law in Spatial Reasoning

A key insight from the paper is the relationship between model size and low-level reasoning capability. The authors analyzed open-source families (InternVL and Qwen) and found a strong correlation: bigger is better.

As shown in the scaling curves below, accuracy increases linearly with the log of the model size up to a certain point (the “knee”), after which returns diminish but still grow. This suggests that low-level physical reasoning is an emergent property that benefits from the massive scale of training data and parameters.

Log-scaling curves showing accuracy vs. model size for InternVL families.

3. Fabric Manipulation Deep Dive

Fabric manipulation proved to be a nuanced challenge. The radar chart below breaks down performance across different reasoning dimensions.

  • Models are decent at Spatial Reasoning (identifying corners).
  • They struggle more with Fabric-Fabric Interaction (how one cloth affects another) and Inverse Dynamics (predicting the outcome of a move).
  • Crucially, there is still a massive gap between the best models (blue/purple lines) and Human performance (the outer red line), indicating that while VLMs are promising, they do not yet possess human-level physical intuition.

Radar chart comparing VLM performance across fabric manipulation dimensions.

The “Real” Test: Does ManipBench Predict Robot Success?

The most critical question for any benchmark is: Does a high score on the test translate to the real world?

To verify this, the authors conducted physical experiments using a UR5 robot arm. They set up 7 distinct manipulation tasks that were not part of the question bank. They then used the VLMs to generate the control actions for the robot in real-time.

The results were statistically significant. They found a strong positive correlation (Pearson’s coefficient of 0.889) between a model’s score on ManipBench and its success rate in controlling the physical robot.

Diagram of the real-world robot experiment pipeline validating the benchmark.

This validation is crucial. It means that researchers can confidently use ManipBench as a proxy to evaluate and improve their models before undergoing the expensive and time-consuming process of real-world robotic testing.

Conclusion & Future Implications

ManipBench represents a pivotal step in embodied AI. It moves the evaluation of Vision-Language Models from abstract chat interfaces to concrete, actionable spatial reasoning.

Key Takeaways:

  1. VLMs have potential: The best models perform significantly better than random chance at low-level manipulation reasoning.
  2. Size matters: There is a clear scaling law; larger models possess better physical intuition.
  3. The Human Gap: Current SOTA models still lag behind human performance, particularly in complex interactions like fabric manipulation and dynamic physics.
  4. Valid Proxy: Performance on ManipBench strongly predicts real-world robotic capability.

For students and researchers entering the field, this paper highlights that while “Foundation Models” are powerful generalists, specialized benchmarking is required to unlock their potential in robotics. The future of general-purpose robots relies not just on better planning, but on closing the gap in low-level, pixel-precise reasoning—a gap that ManipBench now helps us measure.