Imagine you are looking at a photograph of a room you’ve never visited. Someone asks, “Will that couch fit through the doorway?” Even though you don’t have a tape measure, you can probably make a very good guess. You intuitively know that a standard door is about 80 inches high, and using that mental “ruler,” you estimate the size of the couch. This ability to use context clues to measure the world is second nature to humans.

For Artificial Intelligence, however, this is a monumental struggle.

While state-of-the-art Vision-Language Models (VLMs) like GPT-4 and Gemini can describe a sunset in poetic detail or explain the complex relationships between people in a crowd, they famously stumble when asked simple quantitative questions like, “How many centimeters wide is that laptop?” or “How far away is that chair?”

In a fascinating new paper, Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models, researchers uncover why these models fail and propose a surprisingly simple, zero-shot prompting technique to fix it. By teaching models to “think” like humans—specifically, by identifying and using reference objects—they achieved massive performance improvements without fine-tuning the models or feeding them extra data.

In this deep dive, we will explore the new benchmark they created, the statistical discovery that led to their breakthrough, and the “SpatialPrompt” method that you can use to make VLMs smarter about space.

The Challenge: The Ill-Posed Problem of Monocular Vision

To understand why this research is significant, we first need to understand the difficulty of the task. The researchers focused on quantitative spatial reasoning from a single monocular image.

In computer vision, estimating depth or size from a single 2D image is technically an “ill-posed problem.” Mathematically, a small object close to the camera looks identical to a giant object far away. There is no inherent depth information in a standard JPEG.

Humans solve this ambiguity using semantic priors and contextual cues. We know a coffee mug isn’t the size of a building. We know floor tiles usually come in standard sizes. We use these known quantities to triangulate unknown ones.

Prior to this work, benchmarking how well AI could do this was difficult. Existing benchmarks were either heavily reliant on automatic data generation (which can be noisy) or focused only on qualitative concepts (e.g., “is the cup left or right of the spoon?”). There was a gap in high-precision, human-annotated data for quantitative measurements (e.g., “the cup is 20cm from the spoon”).

Introducing Q-Spatial Bench

To rigorously test VLMs, the authors introduced Q-Spatial Bench, a dataset consisting of 271 high-quality, human-annotated questions.

Figure 1: We introduce a human expert-annotated benchmark dedicated to quantitative spatial reasoning: Q-Spatial Bench.

As shown in Figure 1, the benchmark is divided into two distinct splits to ensure a fair evaluation:

  1. Q-Spatial-ScanNet: This subset repurposes high-quality RGB-D (Red, Green, Blue + Depth) scans from the ScanNet dataset. It includes indoor environments where distances can be verified against 3D point cloud data. The questions cover five categories: object width, object height, horizontal distance, vertical distance, and direct distance.
  2. Q-Spatial++: To ensure models aren’t just memorizing training data (since ScanNet is public and likely in the training sets of models like GPT-4), the researchers collected a fresh set of images using iPhones in diverse environments (indoor, outdoor, day, night). They physically measured distances in the real world to create ground truth labels.

The creation of a human-annotated benchmark is a critical contribution because it offers a “gold standard” compared to previous automated attempts.

Table 1: Comparison of quantitative spatial reasoning benchmark. Q-Spatial Bench is a human expert-annotated benchmark, specifically designed for quantitative spatial questions.

The “Aha!” Moment: Why is GPT-4o Winning?

With the benchmark in place, the researchers tested four commercial giants: Gemini 1.5 Pro, Gemini 1.5 Flash, GPT-4V, and GPT-4o.

The initial results showed a clear winner.

Table 2: GPT-4o outperforms other commercial VLMs in quantitative spatial reasoning. We evaluates the success rate on each split of Q-Spatial Bench.

GPT-4o achieved a success rate of nearly 70% on the ScanNet split, while Gemini 1.5 Pro struggled significantly, often refusing to answer the questions entirely. But the raw numbers weren’t the most interesting part of the study. The researchers wanted to know why GPT-4o was so much better.

The Role of Reference Objects

The team performed a qualitative analysis of GPT-4o’s textual responses. They noticed a pattern: in many of the instances where GPT-4o got the answer right, it spontaneously generated a reasoning path that involved a reference object.

For example, when asked to estimate the height of a stack of towels, GPT-4o might output reasoning like: “The towels are sitting on a standard bathroom counter, which is typically 36 inches high. Based on this, the stack appears to be…”

This mirrors human reasoning. To prove this statistically, the authors annotated whether the model’s response used a reference object and correlated it with accuracy.

Table 4: Contingency table of whether GPT-4o’s responses use any reference objects as guidance and the success rate of the responses.

The data in the table above is striking. When GPT-4o used a reference object, its success rate jumped to 83%. When it didn’t, the success rate was only 64%.

To ensure this wasn’t a fluke (perhaps the questions where references are available are just easier?), they ran a logistic regression model. They controlled for variables like the difficulty of the dataset split and the magnitude of the distance being measured.

Table 5: Logistic regression to analyze the effectiveness of GPT-4o. Using a reference object in reasoning increases the likelihood of generating a response with relative error less than 2, statistically significantly.

The regression analysis confirmed their hypothesis. As indicated by the equation below, the presence of a reference object (\(X_r\)) was a statistically significant predictor of success (\(p(\delta_{\leq 2})\)), increasing the odds of an accurate estimate by a factor of roughly 2.7.

Logistic Regression Equation

The Solution: SpatialPrompt

Inspired by the observation that “reference objects = accuracy,” the authors developed a prompting strategy to force this behavior in all models. They call it SpatialPrompt.

This is a zero-shot technique, meaning it requires no model training. Instead of simply asking “How tall is the chair?”, SpatialPrompt instructs the VLM to follow a specific reasoning structure:

  1. Identify potential reference objects in the image.
  2. Estimate the size of those reference objects based on common knowledge.
  3. Use those references as a scale to measure the target object.

Figure 3: We propose SpatialPrompt, a specialized text prompt designed to improve quantitative spatial reasoning capabilities in VLMs.

The prompt comes in two flavors: SpatialPrompt-Single (a concise instruction) and SpatialPrompt-Steps (a detailed, multi-step breakdown). The “Steps” version explicitly tells the model to propose a plan and perform a coarse-to-fine estimation.

Experiments and Results

The results of applying SpatialPrompt were nothing short of transformative.

Closing the Gap

When applied to underperforming models, SpatialPrompt unlocked capabilities that seemed non-existent before.

  • Gemini 1.5 Pro: Improved its success rate by over 47 points.
  • GPT-4V: Improved by 30 points.
  • Gemini 1.5 Flash: Improved by 20 points.

Even GPT-4o, which was already using this logic spontaneously in some cases, saw improvements in consistency.

Table 8: Full table of the success rate of Gemini 1.5 Pro, Gemini 1.5 Flash, GPT-4V, and GPT-4o.

Table 8 provides the granular details. Notice how Gemini 1.5 Pro with the standard prompt has a success rate of nearly zero (0.59) on the ScanNet split—likely due to safety refusals or inability to ground the request. With SpatialPrompt-Steps, that scores rockets up to 53.65.

The Mechanism of Action

The authors verified that the prompt actually works by doing what it says: increasing the usage of reference objects. They plotted the frequency of reference object usage against the success rate across different models and prompting strategies.

Figure 4: Success rates versus the frequencies of using reference objects. Green corresponds to SpatialPrompt, Red corresponds to Zero-shot CoT, and Blue corresponds to standard prompt.

The scatter plot above reveals a strong positive correlation. The Green triangles (SpatialPrompt) cluster in the top right, indicating high reference usage and high accuracy. The Blue squares (Standard Prompt) cluster in the bottom left. This empirically confirms the paper’s core thesis: Reasoning via reference objects directly improves accuracy.

Does it work on Open Source Models?

The researchers also tested LLaVA, a popular open-source VLM. The results were mixed.

Table 7: Success rate of LLaVA in Q-Spatial-ScanNet and Q-Spatial++.

While LLaVA performed surprisingly well on the ScanNet split, its performance dropped significantly on the diverse Q-Spatial++ split, suggesting it might have memorized ScanNet data during pre-training. Furthermore, SpatialPrompt did not consistently help LLaVA. The authors hypothesize that smaller open-source models may lack the “instruction following” capability or the vast internal knowledge base required to estimate the size of reference objects effectively.

Qualitative Analysis and Failures

While SpatialPrompt is powerful, it isn’t magic. The method relies on the model correctly identifying a reference object and knowing its size. If the model hallucinates the size of the reference object, the final calculation will be wrong.

One common failure mode observed in GPT-4o involved floor tiles.

Figure 13: Common failure cases of GPT-40

In the example above, the model attempts to use floor tiles as a reference. However, floor tiles vary wildly in size (unlike, say, a standard electrical outlet or a door handle). Misjudging the tile size leads to a cascading error in the distance estimation.

This highlights a limitation: the method works best when the scene contains standardized objects (microwaves, doors, outlets) and struggles in barren environments or scenes with non-standard architecture.

Conclusion: Unlocking Capabilities Without Training

The paper Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning offers a compelling lesson for the AI community. Often, we assume that to make a model better at a specific task (like measuring distance), we need to train it on massive amounts of new data or change its architecture.

However, this research demonstrates that the capability to reason spatially is already present in large models. It just needs to be “elicited” correctly. By mimicking the human cognitive process of using reference objects—contextualizing the unknown with the known—we can unlock quantitative reasoning in VLMs today.

For students and practitioners, the takeaways are clear:

  1. Context is King: When working with VLMs, don’t just ask for an answer; ask the model to ground its answer in the visual context.
  2. Prompt Engineering Matters: A well-structured prompt that enforces a logical reasoning path (Chain of Thought) can outperform a raw model significantly.
  3. Benchmarks Drive Progress: The creation of Q-Spatial Bench provides a necessary measuring stick to ensure we aren’t just relying on anecdotal evidence of AI capabilities.

As VLMs continue to be integrated into robotics and real-world assistants, these “reference-based” reasoning paths will be essential for agents that need to navigate and understand the physical dimensions of our world.