Introduction

Imagine trying to pour water into a cup while closing one eye and looking through a paper towel tube. You lose depth perception, and your field of view is restricted. This is essentially how many modern robots operate when powered by standard Vision-Language Models (VLMs).

In recent years, the field of robotics has been revolutionized by Vision-Language-Action (VLA) models. These models leverage the massive knowledge base of internet-scale 2D data to help robots understand commands and recognize objects. However, there is a fundamental mismatch: these models are trained on flat, 2D images, but robots live and work in a complex, geometric 3D world. When a robot tries to grasp a bottle or open a drawer based solely on 2D inputs, it often struggles to reason about spatial depth and the precise geometry required for manipulation.

While 3D-specific imitation learning methods exist, they face a catch-22: we don’t have enough large-scale 3D datasets to train them to be as “smart” as their 2D counterparts.

Enter 3DS-VLA, a new approach that bridges this divide. This research proposes a method to equip powerful, pre-trained 2D models with comprehensive 3D spatial awareness without requiring massive new datasets or losing the semantic intelligence of the original model.

Figure 1: 3DS-VLA achieves comprehensive 3D spatial awareness by encoding 3D spatial observations with a pretrained 2D vision-language model and establishing 3D spatial constraints to facilitate spatial-temporal reasoning. It demonstrates generalizable capabilities across tasks,embodiments,and environmental settings.

As illustrated in Figure 1, the 3DS-VLA framework introduces a way to encode 3D point clouds using 2D encoders and utilizes “3D spatial constraints” to guide the robot’s actions. The result is a robust system capable of handling multi-task manipulation, different robot types (embodiments), and diverse environmental settings.

Background: The 2D vs. 3D Dilemma

To understand why 3DS-VLA is necessary, we must first look at the current landscape of robotic learning.

The Rise of VLA Models

Models like RT-2 have shown that if you discretize robot actions (turning movement into tokens, just like words) and feed them into a large Transformer alongside images and text, the robot can learn to perform tasks. These models are excellent at semantic understanding (e.g., knowing what a “soda can” looks like vs. a “soup can”). However, they often treat the world as a flat image. They map perception directly to action, often missing the “where” and “when” of spatial interactions.

The Limits of Native 3D Learning

On the other side of the spectrum, we have 3D imitation learning. These methods use point clouds (data points in X, Y, Z space) as input. They use architectures like PointNet or voxel-based grids. While they understand geometry perfectly, they lack the “common sense” reasoning that comes from pre-training on billions of internet images. Furthermore, collecting 3D robotic demonstrations is expensive and slow, leading to a scarcity of data that limits generalization.

The Middle Ground Attempts

Previous attempts to combine these worlds usually involved projecting 3D data into 2D images (multi-view) or trying to “lift” 2D features into 3D space. Both approaches are lossy; you either lose geometric precision during projection or fail to capture the raw geometric details when lifting features.

The researchers behind 3DS-VLA asked a critical question: How can we directly inject raw 3D geometric information into a pre-trained 2D model so that the model understands it natively?

Core Method: 3DS-VLA Architecture

The proposed solution is an autoregressive generative model. This means the robot looks at the current state and predicts the next action step-by-step. The architecture is built upon a pre-trained Vision-Language Model (specifically using LLaMA and CLIP components), adapted efficiently using LoRA (Low-Rank Adaptation) to keep training computational costs manageable.

The innovation lies in two main pillars:

  1. 3D Spatial Observation: A clever mechanism to force a 2D encoder to understand 3D point clouds.
  2. 3D Spatial Constraint: A guidance system that uses text and keypoints to tell the robot where and when to interact with the world.

Figure 2: Model Architecture. Given the current observation, task instruction, and keypoint constraints,3DS-VLA predicts the next-frame pose. It incorporates 3D spatial observations and 3D spatial constraints to enhance 3D spatial awareness. The first component uses a 2D visual encoder to encode both the 2D image and the 3D point cloud. The second component guides the model to follow 3D constraint priors between the robot and the environment.

Let’s break down the architecture shown in Figure 2.

1. 3D Spatial Observation via Alignment

The left side of Figure 2 details how the model “sees.” The system takes two visual inputs: a standard RGB image and a Point Cloud (derived from depth cameras).

The Tokenizer Challenge

Standard 2D models break images into square patches (tokens). 3D data doesn’t fit this grid. To solve this, the researchers implement a non-parametric 3D tokenizer. They use Farthest Point Sampling (FPS) to select representative points from the cloud and k-Nearest Neighbors (kNN) to group local geometric features. This converts a raw point cloud into a sequence of high-dimensional tokens that look, structurally, like the tokens the VLM expects.

2D-to-3D Positional Alignment

Here is the most ingenious part of the method. If you just feed these 3D tokens into a 2D model, the model won’t know where those points are in space. Standard Transformers use “Positional Embeddings” (PEs) to understand the order and location of patches in an image.

The researchers realized that since the point cloud and the image capture the same scene, they should share positional context. They developed a 2D-to-3D Positional Alignment mechanism:

  1. Take a 3D token (which represents a cluster of points).
  2. Project its center point back onto the 2D image plane using camera parameters.
  3. Identify which 2D image patch corresponds to that location.
  4. Assign the pre-trained 2D Positional Embedding of that image patch to the 3D token.

By doing this, the model receives a 3D token but “thinks” it is looking at a specific part of the 2D image it already knows how to process. This allows the 3D geometric data to piggyback on the spatial reasoning capabilities the 2D model already possesses. The 2D and 3D tokens are concatenated and passed through a shared visual encoder (CLIP).

2. 3D Spatial Constraints

Perception is only half the battle. The robot also needs to understand the relationship between itself and the environment over time. This is addressed in the right side of Figure 2.

Extracting Keypoints

Rather than just giving the robot a command like “pour water,” the system breaks the world down into actionable keypoints. It uses an external model (Grounded SAM) to identify objects mentioned in the instruction (e.g., a bottle or a cup). It extracts the 3D center points of these objects.

Text-Based Constraints

Instead of feeding these keypoints as raw numbers into a separate motion planner, 3DS-VLA integrates them directly into the language prompt. The system formulates the task as a question-answer pair.

For example, the input prompt might look like:

  • Instruction: “Pour water into the cup.”
  • Condition: “The target is close to [Keypoint 1 - Bottle].”
  • Next Step Condition: “The target is close to [Keypoint 2 - Cup].”

This explicitly encodes Where (the coordinate of the object) and When (the sequence: grasp bottle first, then move to cup) directly into the language model’s reasoning process. This turns the physical constraints of the world into a language structure the LLM can understand and predict.

Experiments and Results

The researchers rigorously tested 3DS-VLA in both simulated environments (RLBench) and the real world using a Franka Emika robot.

Simulation Performance

In the RLBench simulator, the model was tested against state-of-the-art baselines, including:

  • 3D approaches like 3D Diffusion Policy (DP3) and 3D Diffuser Actor (3DA).
  • 2D VLA approaches like OpenVLA and CogACT.

Single-Arm Manipulation

As seen in Table 1 below, 3DS-VLA (Ours) achieves a significantly higher average success rate (0.66) compared to OpenVLA (0.43) and even purely 3D methods like DP3 (0.64).

Table 1: Single-Arm Multi-Task Performance on RLBench of 21 task.

The method shines in tasks requiring precise geometric interaction, such as “Insert USB” or “Stack Blocks.” 2D-only methods frequently failed during the final moments of these tasks—the contact phase—because they couldn’t perceive the precise depth needed to insert or stack an object without knocking it over.

Dual-Arm Manipulation

The flexibility of the autoregressive architecture means 3DS-VLA can handle dual-arm setups without architectural changes—it simply predicts poses for two arms instead of one.

Table 2: Dual-Arm Multi-Task Performance on RLBench2 of 5 tasks

Table 2 shows a massive performance gap. In complex bimanual tasks like “Straighten Rope” or “Lift Tray,” 3DS-VLA dominates the baselines, more than doubling the success rate of the next best method in some cases. This confirms that the spatial constraints effectively coordinate the two arms.

Ablation Studies: What Makes It Work?

Is the complex alignment mechanism necessary? Or is it just the extra data? The ablation study in Table 3 provides the answer.

Table 3: The effectiveness of each proposed component.

  • Row 1 (Baseline): Removing both constraints and 3D tokens results in poor performance (0.18 success).
  • Row 3 (Misalignment): Interestingly, if you include 3D tokens but don’t align the positional embeddings (using the original 2D order instead), performance drops compared to the full model. This proves that semantic alignment is crucial—the model gets confused if the 3D geometry doesn’t spatially match the 2D visual features.

Real-World Generalization

Robots often fail when the real world looks slightly different from their training data. 3DS-VLA was tested on generalization across four axes: Instance (new objects), Position (moved objects), Background (clutter), and View (camera angle).

Figure 3: Demonstrations of execution process and four types of generalization settings.

Figure 3 visualizes these variations. The quantitative results for real-world tasks are detailed in Table 4 below.

Table 4: We compare 3DS-VLA with baselines on 1O real-world tasks and evaluate its robustness across test setings that vary from the training dataset domain. * denotes long-horizon tasks.

The model achieves a 54% average success rate in the real world, outperforming the strong baseline DP3 (49%) and CogAct (50%). It showed remarkable robustness to background variation (cluttered tables), likely because the spatial constraints help the model focus strictly on the relevant objects regardless of the noise around them.

Visualizing Success

Figures 5 and 6 provide a visual “flow” of how the model executes these tasks.

Figure 5: Visualization of simulation tasks. We conduct on both single-arm and dual-arm simulation tasks.

In the simulation (Figure 5), we see the diversity of tasks, from turning on lamps to intricate dual-arm lifting.

Figure 6: Visualization of real-world tasks. The tasks are shown in key-frame flow.

In the real world (Figure 6), the “Stack Cup” and “Pour Water” tasks highlight the precision. The robot isn’t just waving its arm in the general direction of the cup; it is identifying the handle, orienting the gripper, and executing a pour at the correct height. This level of fine-grained interaction is a direct result of the 3D spatial awareness injected into the model.

Conclusion

3DS-VLA represents a significant step forward in robotic manipulation. It successfully navigates the trade-off between the semantic richness of 2D foundation models and the geometric necessity of 3D data.

By aligning 3D point clouds with 2D positional embeddings, the authors have found a way to “trick” a 2D brain into thinking in 3D. Furthermore, by translating spatial coordinates into text-based constraints, they leverage the logical reasoning capabilities of LLMs to solve physical planning problems.

The implications for this are broad. It suggests that we may not need to train massive 3D foundation models from scratch to achieve high-level robotic intelligence. Instead, we can build better bridges between the geometric world of robots and the semantic world of existing AI models. As we look toward future developments, improving the speed of inference and integrating more complex planning agents could make systems like 3DS-VLA the standard for general-purpose household robots.