Introduction

Imagine you are crossing a busy street. You see a white van and a cyclist. Your brain instantly processes not just what these objects are, but where they are in three-dimensional space and where they are going. You instinctively know that the van is facing you (potentially dangerous) while the cyclist is moving parallel to you. This is 3D spatial reasoning, a capability so fundamental to human cognition that we rarely think about it.

However, for Large Multimodal Models (LMMs) like GPT-4o or LLaVA, this is an incredibly difficult task. These models are trained primarily on vast amounts of 2D images and captions. While they can describe a “white van” and a “cyclist” with poetic detail, they often fail to answer simple questions about the physical reality of the scene: Is the van closer than the cyclist? Are they on a collision course?

The inability to reason in 3D limits the potential of AI in robotics, autonomous driving, and physical world interaction. To bridge this gap, researchers have introduced SpatialLLM, a new model designed with a “compound” approach that systematically injects 3D awareness into the data, architecture, and training process.

This infographic illustrates concepts related to ‘Spatial Reasoning’ using a street scene with a white van, pedestrian, cyclist, and buildings as visual examples. It highlights three key areas: Orientation Relationships, Spatial Reasoning Relationships, and Distance Relationships.

In this article, we will deconstruct the research behind SpatialLLM. We will explore why current models struggle with space, how the researchers engineered a 3D-informed design, and the results that show this method outperforming state-of-the-art proprietary models.

The Problem: The 2D Bias in AI

Current LMMs possess impressive visual capabilities, but they are fundamentally biased toward 2D representations. This bias stems from their training data. Most internet-scale datasets consist of images paired with captions like “a dog sitting on a mat.” These captions rarely describe 3D properties such as “the dog is 1.5 meters away from the camera and facing 45 degrees to the left.”

Consequently, while models can identify objects, they struggle with three specific types of spatial relationships:

  1. Distance Relationships: Judging depth and relative distance (e.g., “Which car is closer?”).
  2. Orientation Relationships: Understanding the 3D pose of an object (e.g., “Is the person facing the camera?”).
  3. Complex Spatial Reasoning: Combining location and orientation to understand interactions (e.g., “Is the car driving towards the pedestrian?”).

Previous attempts to fix this, such as SpatialVLM, focused primarily on distance. However, they largely ignored 3D orientation. Without understanding orientation, a model cannot distinguish between a car parking and a car speeding toward an intersection.

Benchmarking the Gap: SpatialVQA

To solve a problem, you must first measure it. Because existing benchmarks focused mostly on 2D relationships (like “left of” or “right of”), the researchers created SpatialVQA. This dataset consists of 1,323 questions derived from the Omni3D dataset, covering urban and indoor scenes.

The image displays four distinct panels arranged in a two-by-two grid, each illustrating examples of spatial reasoning tasks involving 3D location and orientation relationships between objects marked as ‘Region [n]’.

As shown above, SpatialVQA challenges models to reason about:

  • Distance: Which bookshelf is closer to the chair?
  • Orientation: Are the two cars facing the same direction?
  • Spatial Reasoning: If you were in the driver’s seat, where would the bus be relative to you?

When standard models like LLaVA-v1.5 attempt these questions, they achieve roughly 47.7% accuracy—essentially guessing in many cases. The goal of SpatialLLM is to push this number significantly higher.

Core Method: A Compound 3D-Informed Design

The core contribution of this paper is not just a single new algorithm, but a compound design strategy. The authors argue that fixing spatial reasoning requires intervention across the entire lifecycle of the model: Data, Architecture, and Training Setup.

This diagram illustrates a multi-stage framework for training a vision-language model (VLM). It consists of four main sections: Data sources, Training stage components, Architecture flow, and Vision Encoder options.

Let’s break down each pillar of this design space.

1. Data: The Foundation of 3D Awareness

The most critical component is the data. Standard LMM training uses noisy image-text pairs or detailed captions that lack spatial precision. The researchers developed a pipeline to generate 3D-informed data by leveraging existing 3D datasets (like ImageNet3D) and using auxiliary tools to estimate depth and pose in other images.

They created two distinct types of data:

A. 3D-Informed Probing Data (3DI-Pb)

This data focuses on fundamental, objective properties of objects. It teaches the model “what” 3D attributes look like.

  • Content: Questions about depth, azimuth (horizontal angle), elevation (vertical angle), and distances between objects.
  • Scale: They curated datasets from OpenImages (1 million samples) and ImageNet3D (166k samples). The ImageNet3D data is particularly valuable because it contains human-annotated 3D orientations, which are cleaner than machine-generated labels.

B. 3D-Informed Instruction Tuning Data (3DI-Ft)

This data focuses on high-level reasoning. It teaches the model “how” to think about the 3D attributes it perceives.

  • Content: Complex conversations about spatial relationships (e.g., “Describe the spatial arrangement of the vehicles”).
  • Scale: 1 million samples.

The diagram compares two approaches to training Large Language Models (LLMs) using visual data. Left: Standard LMM Training Data. Right: 3D-Informed Training Data which includes orientation, depth, and spatial reasoning conversations.

Figure 4 above highlights the stark difference. Where standard data produces a caption like “A car on the road,” the 3D-informed pipeline produces data cards containing metric depth, camera calibration, and specific 3D orientation tags.

Below are qualitative examples of what this training data looks like. Notice how the questions explicitly demand reasoning about “real-world 3D orientations” rather than just 2D pixel positions.

The image displays examples of the training data. For instance, a question asks if a Corvette and another car are facing the same direction based on real-world 3D orientations.

2. Architecture: Enhancing the Vision Encoder

Most LMMs use CLIP as their vision encoder. CLIP is excellent at matching images to text, but it is trained on 2D internet images and often discards precise geometric information.

To improve 3D perception, the researchers experimented with Mixed Visual Encoders. They combined the standard CLIP encoder with DINOv2.

  • Why DINOv2? DINOv2 is a self-supervised model. Because it learns to understand images without relying on text captions, it tends to preserve better local geometric features and spatial awareness than CLIP.
  • The Result: Merging features from both encoders allows the model to retain semantic understanding (from CLIP) while gaining geometric precision (from DINOv2).

3. Training: The SpatialLLM Roadmap

Having the data and architecture is not enough; you must introduce them at the right time. The training pipeline for SpatialLLM involves multiple stages.

The authors propose a specific “roadmap” that upgrades a standard model (like LLaVA) into SpatialLLM step-by-step.

  1. Stage 1: 3D-Informed Alignment. In this stage, the vision encoder is connected to the Language Model (LLM). Instead of just using standard captions, they mix in the 3D-Informed Probing Data. This forces the connector to pass 3D-relevant information (like orientation and depth) to the LLM.
  2. Stage 2: 3D-Informed Instruction Tuning. Finally, the model is fine-tuned on the 3D-Informed Instruction Tuning Data. This teaches the LLM to use the information it receives to answer complex user questions.

The image contains two figures. Figure 5 compares the design instantiation of SpatialLLM against LLaVA and SpatialVLM. Figure 6 shows a bar chart of performance gains, illustrating the roadmap from LLaVA-v1.5 (47.7%) to the final SpatialLLM model (62.7%).

The visualization above (Figure 5 & 6) summarizes this progression.

  • Standard LLaVA (Blue): Uses standard data and a single encoder.
  • SpatialVLM (Green): Introduced spatial data only at the very end (Instruction Tuning).
  • SpatialLLM (Orange): Injects 3D data at the Alignment stage and the Tuning stage, and utilizes a mixed encoder.

The bar chart (Figure 6) reveals the impact of these choices. Starting from a baseline of 47.7%, adding the mixed encoder gives a small boost. Switching to a better LLM (Llama-3) helps further. But the massive jump comes from the 3D-informed data, pushing the score over 60%.

Experiments and Results

Does this compound design actually work in practice? The results on the SpatialVQA benchmark are decisive.

State-of-the-Art Comparison

SpatialLLM was compared against top-tier proprietary models (like GPT-4o and Claude 3.5 Sonnet) and open-source spatial models (like SpatialVLM).

Table 1. Comparison with the state-of-the-arts including proprietary and open source models. SpatialLLM achieves 62.7% average accuracy, beating GPT-4o (54.0%) and SpatialVLM (52.2%).

Key takeaways from the results:

  • Beating the Giants: SpatialLLM achieves an average accuracy of 62.7%, surpassing GPT-4o (54.0%) by a significant margin of 8.7%.
  • Orientation Mastery: The biggest gap is in 3D Orientation. SpatialLLM scores 86.3% on orientation tasks, while GPT-4o manages only 59.4% and the standard LLaVA model sits at 50.6%. This validates the inclusion of orientation-specific training data.
  • Consistent Gains: The model outperforms competitors across all three categories: distance, orientation, and spatial relationships.

Analyzing the “Why”: Ablation Studies

The researchers performed an ablation study to understand which parts of their design contributed most to the success.

Table 2 shows a thorough exploration of the design space. Figure 7 shows a qualitative comparison where GPT-4o fails to judge distance and orientation, while the 3D-LMM (SpatialLLM) succeeds.

Referring to the table in the image above:

  1. Architecture helps slightly: Moving from CLIP to Mixed Encoders (CLIP+DINOv2) provided a modest gain (~0.3%). Upgrading the LLM from Vicuna to Llama-3 provided a ~1.0% gain.
  2. Data is the game-changer:
  • Adding 3D data to Stage 2 (Instruction Tuning) resulted in a massive +10.7% jump.
  • Adding 3D data to Stage 1 (Alignment) provided an additional +3.0%.

The qualitative example in Figure 7 (bottom of the image above) illustrates this capability difference. When asked which car is closer or if chairs are facing similar directions, GPT-4o often hedges its bets (“cannot provide a definitive answer without visually assessing”). In contrast, SpatialLLM confidently and correctly identifies that the car in Region 1 is closer and that the chairs are facing different directions.

Conclusion and Implications

The development of SpatialLLM highlights a critical insight for the future of computer vision: Data curation is as important as model architecture.

Standard LMMs have hit a ceiling in spatial reasoning because they are trained on data that ignores the 3D nature of the world. By systematically constructing datasets that explicitly label 3D orientation and distance—and injecting this data at both the alignment and instruction tuning stages—the researchers were able to drastically outperform significantly larger, proprietary models.

Key Takeaways:

  • Compound Design: Success came from optimizing Data, Architecture, and Training together, not in isolation.
  • Orientation Matters: Previous works focused on depth. Adding 3D orientation was the key to unlocking complex spatial reasoning.
  • Open Source Wins: With the right specialized data, open-source models can beat general-purpose giants like GPT-4o on specific, complex tasks.

For students and researchers, SpatialLLM provides a roadmap for building “embodied” AI—models that don’t just look at the world, but understand their place within it.