Introduction

We are currently witnessing a golden age of Large Multimodal Models (LMMs). Systems like GPT-4o and Gemini have demonstrated an uncanny ability to interpret visual scenes, describe objects in poetic detail, and answer questions about images with human-like fluency. If you show these models a picture of a busy street, they can list the cars, the pedestrians, and the color of the traffic light.

But there is a subtle, yet critical, difference between identifying what is in an image and understanding where it is in physical space.

Most current benchmarks evaluate models on semantic understanding (e.g., “Is there a cat?”) or 2D spatial relationships (e.g., “Is the cat to the left of the dog?”). However, the real world is three-dimensional. To navigate environments, operate robots, or drive autonomously, an AI must master 6D spatial reasoning. This involves understanding an object’s 3D location (X, Y, Z coordinates) and its 3D orientation (rotation and facing direction).

This brings us to a significant gap in current AI evaluation. Do these powerful models actually understand 3D space, or are they just relying on 2D patterns? To answer this, researchers have introduced Spatial457, a new, scalable, and unbiased synthetic benchmark designed to diagnose the limits of LMMs in 6D spatial reasoning.

The Problem: The “2D Bias” in Visual AI

Before diving into the solution, we must understand why measuring spatial reasoning is so difficult.

The primary hurdle is the data itself. Real-world datasets are inherently biased. For example, in datasets like nuScenes (used for autonomous driving), over 70% of objects are clustered into a single predominant orientation. Cars are usually photographed from the front or rear; chairs are usually facing the table. If an AI correctly guesses that a car is facing “forward,” is it reasoning about the geometry, or is it just betting on the statistical probability?

Furthermore, existing benchmarks often stop at 2.5D features (like depth) or simple 2D relationships. There has been no comprehensive framework to evaluate the full spectrum of spatial intelligence—from recognizing an object to predicting if it will crash into something else in 3D space.

Spatial457: A New Diagnostic Framework

To address these limitations, the authors developed Spatial457. This is a synthetic dataset, meaning the images are computer-generated. While “synthetic” might sound less robust, it is actually a superpower in this context. It allows for the generation of unbiased scenes where objects can be placed anywhere and rotated in any direction, forcing the AI to look at the visual evidence rather than relying on training biases.

As illustrated in the figure below, the benchmark is structured around a cascading evaluation system. It starts with simple recognition and culminates in complex collision predictions.

Overview of the Spatial457 benchmark and the distribution of 3D poses. The left side shows the cascading difficulty levels, while the right side compares biased real-world datasets vs. the balanced Spatial457 dataset.

The Four Core Capabilities

The researchers identified four fundamental pillars of spatial reasoning:

  1. Multiple Object Recognition: Can the model identify and distinguish between different items in a cluttered scene?
  2. 2D Locations: Can the model understand relative positions in the flat image plane (left, right, top, bottom)?
  3. 3D Locations: Can the model perceive depth and distance? This is crucial for understanding occlusion (what is in front of what).
  4. 3D Orientation: Can the model determine the precise pose of an object? For example, is a bus facing left, or is it facing the camera?

The 5 Levels of Difficulty

Spatial457 arranges these capabilities into a “roadmap” of increasing complexity, creating seven distinct question types across five difficulty levels. This structure allows researchers to pinpoint exactly where a model’s reasoning breaks down.

Level 1 & 2: The Basics

  • L1 (Single Object): Simple questions like “What color is the double bus?” to establish a baseline for visual recognition.
  • L2 (Multiple Objects): Questions requiring comparison, such as “Is there another object of the same color as the double bus?” This tests the model’s ability to parse a scene with multiple entities.

Level 3: 2D Spatial Reasoning

Here, the benchmark introduces spatial relationships from the camera’s perspective. A typical question might be, “What is the shape of the object to the left of the chopper?” This is the standard for most current Visual Question Answering (VQA) benchmarks.

Level 4: The 3D Leap

This is where Spatial457 diverges from traditional tests.

  • 3D Pose: The model is asked about the orientation of objects. E.g., “What shape is the object parallel to the brown object?”
  • Occlusion: The model must understand depth. E.g., “Is the red wagon occluded by the yellow bus?”

Level 5: 6D Reasoning and Collision

This is the ultimate test. It requires integrating 3D location and 3D orientation to predict interactions.

  • 6D Spatial Relationships: Describing positions relative to the objects themselves in 3D space (e.g., “to the left of the car” relative to the car’s driver, not the camera).
  • Collision Prediction: Questions like “Will the red wagon collide with a yellow bus if it moves forward?”

Examples of question types ranging from 2D spatial queries to complex collision predictions involving future state estimation.

As shown in the examples above, answering a collision question requires the model to know exactly where the object is (3D Location), which way it is pointing (3D Orientation), and project a trajectory (Collision Logic).

Experiments and Results

The researchers evaluated a suite of state-of-the-art models, including proprietary API models (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) and open-source models (LLaVA, Qwen2-VL). The results revealed a stark reality: while these models are linguistic geniuses, they are spatially challenged.

The Performance Drop

The table below summarizes the accuracy of various models across the difficulty levels.

Performance comparison table across all 7 question types and 5 difficulty levels.

Notice the trend:

  1. High Baseline: Humans achieve roughly 80-90% accuracy across the board.
  2. Model Decline: Top models like GPT-4o start strong at Level 1 (~74%) but degrade significantly as complexity increases. By Level 5 (Collision), accuracy drops to roughly 38%.
  3. The Gap: There is a massive chasm between human performance and AI performance in the higher reasoning tiers (L4 and L5).

This confirms the hypothesis: accurate object recognition does not translate to accurate spatial reasoning.

Qualitative Analysis: Why Do They Fail?

To understand why the numbers drop, we can look at specific examples of model outputs. In the figure below, we see GPT-4o attempting different tasks on the same image.

Qualitative example of GPT-4o’s performance. It succeeds at recognition and 2D tasks (green) but fails at 3D pose and 6D spatial reasoning (red).

In the green block, the model correctly identifies sizes and counts objects. However, in the red block (Level 4 and 5), it fails. It cannot correctly identify which object is facing the opposite direction of the road bike, nor can it determine spatial relationships from the perspective of the “fighter plane.” The model “sees” the objects but cannot construct a coherent 3D mental map of their orientations.

Quantifying the Weakness: RPDR

To scientifically measure how much a specific capability (like 3D orientation) hurts performance, the authors introduced a metric called the Relative Performance Dropping Rate (RPDR).

Equations defining the Relative Performance Dropping Rate (RPDR) for different spatial capabilities.

Essentially, this metric calculates the percentage drop in accuracy when a new variable (like depth or rotation) is introduced. The analysis showed that 3D reasoning (both location and orientation) caused the steepest drops in performance for almost all models. GPT-4o and Gemini particularly struggled with 3D orientation, suggesting that while they recognize objects well, they barely understand how those objects are positioned in space.

The Bias Problem: Guessing vs. Knowing

One of the most fascinating findings of the Spatial457 study is the revelation of prediction bias. Because the synthetic dataset is perfectly balanced (objects appear in all colors and orientations equally), any skew in the model’s answers represents an internal bias.

The researchers found that models have a strong tendency to predict certain attributes over others. For example, when unsure about a car’s orientation, models frequently guessed “Front” or “Left,” likely because their training data is full of car photos taken from those angles.

Heatmaps showing the distribution of color and pose attributes. Even with balanced ground truth, models exhibit biased predictions.

The heatmaps above illustrate this clearly. Ideally, the prediction grid should look diagonal (high accuracy) or uniform (random guessing). Instead, we see vertical bands, indicating that regardless of the actual pose, the model prefers to output specific labels.

To quantify this, the authors used the Coefficient of Variation (CV). A lower CV indicates less bias.

Equation for Coefficient of Variation (CV) used to measure prediction bias.

Table showing CV values for different attributes. Higher values in Pose indicate significant bias in orientation prediction.

As shown in the table, the CV for “Pose” is consistently high across models compared to “Size” or “Shape.” This proves that current LMMs are heavily relying on priors (statistical guesses) rather than visual evidence when it comes to 3D orientation.

Real-World Implications

Does this matter outside of synthetic benchmarks? The researchers extended their evaluation to real-world images using the SUN-RGBD dataset. They found that the issues persisted.

Example of real-world 3D pose reasoning. Models often use common sense (chairs face tables) rather than visual geometry to guess orientations.

In the example above, GPT-4o guesses correctly, but Gemini fails. More interestingly, the reasoning provided by the models often relies on “common sense” (e.g., “chairs usually face the counter”) rather than visual geometry. While common sense is useful, it is dangerous for an autonomous system to rely on it over actual visual data—imagine a self-driving car assuming another car is moving forward just because it’s in a lane, ignoring that it has spun out and is facing sideways.

The real-world distribution analysis confirmed the bias:

Distribution of 3D pose predictions in real-world tasks. GPT-4o shows a strong bias toward predicting ‘Front’ and ‘Back’.

Conclusion

The Spatial457 benchmark serves as a reality check for the AI community. It demonstrates that while Large Multimodal Models have made tremendous strides in semantic understanding, their grasp of 6D spatial reasoning remains rudimentary.

The key takeaways are:

  1. Complexity matters: Performance drops sharply as tasks move from recognition to 3D spatial reasoning and collision prediction.
  2. Orientation is a blind spot: Models struggle significantly with 3D poses, often relying on biased guessing rather than visual analysis.
  3. Diagnostic value: By decomposing spatial reasoning into specific capabilities, Spatial457 provides a roadmap for what needs to be fixed.

For AI to truly interact with the physical world—whether through robotics or augmented reality—it needs to move beyond describing pixels and start understanding space. Work like Spatial457 is the first step in bridging that gap.