Imagine asking a robot to “pick up the small coffee table between the sofa and the lamp.” For a human, this is trivial. We instantly parse the scene, identify the sofa, the lamp, and the specific table that sits between them. For an AI, however, this task—known as 3D Referential Grounding (or 3D-REFEXP)—is notoriously difficult.

The AI must understand natural language, perceive 3D geometry, handle the noise inherent in real-world sensors, and reason about spatial relationships. Historically, models attempting this have relied on “cheats,” such as pre-processed, perfect 3D meshes or human-annotated segmentation maps. But the real world doesn’t come with pre-labeled meshes; it comes as a messy stream of RGB-D (color and depth) sensor data.

Enter LOCATE 3D, a new model from researchers at FAIR at Meta. This paper presents a significant leap forward by performing object localization directly on sensor observation streams. At the heart of this system lies 3D-JEPA, a novel Self-Supervised Learning (SSL) algorithm that brings the power of “contextualized embeddings”—the secret sauce behind Large Language Models (LLMs)—to 3D point clouds.

In this deep dive, we will unpack how Locate 3D works, exploring how it lifts 2D foundation models into 3D space, how it learns context through self-supervision, and how it achieves state-of-the-art results on real robots.

The Core Problem: Why is 3D Grounding Hard?

To understand the contribution of Locate 3D, we must first look at the limitations of prior work.

  1. Dependency on Perfect Data: Many existing 3D grounding methods require detailed 3D meshes or “ground-truth” object proposals during inference. This makes deployment on a robot (which only sees what its cameras see right now) impossible.
  2. The “Local” vs. “Global” Trap: Simply projecting 2D features (like those from CLIP) onto 3D points gives you a “bag of words” understanding of the scene. The model might know a point belongs to a “chair,” but it lacks the context to know it’s “the chair next to the door.”
  3. Data Scarcity: Compared to the billions of images available for 2D training, annotated 3D data is scarce and expensive to collect.

Locate 3D addresses these issues with a three-phase pipeline designed to work on raw, noisy sensor data while leveraging the massive knowledge embedded in 2D foundation models.

The Architecture of Locate 3D

The model architecture is elegant in its progression from raw data to semantic understanding. As illustrated below, the pipeline consists of Preprocessing, Contextualized Representation, and 3D Localization.

Figure 1: Overall Architecture of Locate 3D. The pipeline moves from raw RGB-D data to “lifted” features, refines them with the 3D-JEPA encoder, and finally decodes them into specific object masks and bounding boxes based on a text query.

Let’s break down these three phases.

Phase 1: Preprocessing and “Lifting” Features

The researchers do not train a vision model from scratch. Instead, they stand on the shoulders of giants. They utilize 2D Foundation Models—specifically CLIP (for language-image alignment), DINOv2 (for robust visual features), and SAM (Segment Anything Model, for object masks).

The process works as follows:

  1. RGB-D Input: The system takes a stream of posed RGB-D images (video frames with depth and camera position).
  2. Feature Extraction: For every 2D frame, features are extracted using CLIP and DINO. Because CLIP features are global (one vector per image), the authors use SAM to generate instance masks, computing CLIP features for each specific mask.
  3. Lifting: These 2D features are “lifted” into 3D space. The RGB-D data is unprojected into a 3D point cloud. The system then voxelizes this cloud (creates a 3D grid). For every voxel, the features from the 2D pixels that fall into that space are averaged.

The result is a Featurized Point Cloud, denoted as \(\mathbf{PtC}_{\text{lift}}\). Each point in this cloud contains rich semantic information from the 2D models. However, this information is local. A point on a table knows it looks like a table, but it doesn’t inherently know about the chair next to it.

Phase 2: 3D-JEPA and Contextualized Representations

This is arguably the most significant contribution of the paper. To transform local features into global scene understanding, the authors introduce 3D-JEPA (Joint Embedding Predictive Architecture for 3D).

In Natural Language Processing (NLP), we moved from simple word embeddings (Word2Vec) to contextualized embeddings (BERT). The word “bank” means something different depending on whether it’s near “river” or “money.” 3D-JEPA attempts to do the exact same thing for 3D points.

The Self-Supervised Learning (SSL) Task

3D-JEPA uses a masked prediction strategy. The idea is simple: if you hide a part of the scene, a model that truly “understands” the scene should be able to predict what’s missing based on the context of what remains.

Figure 2: The 3D-JEPA training framework. A context encoder processes a masked point cloud. A predictor then tries to guess the latent features of the masked regions. The target features come from a “Target Encoder” which is an Exponential Moving Average (EMA) of the context encoder.

The training process involves two pathways:

  1. Target Encoder: Processes the full point cloud to generate “ground truth” feature representations.
  2. Context Encoder: Processes a masked version of the point cloud (where chunks of the scene are removed).

A Predictor network then takes the output of the Context Encoder and tries to predict the feature representations of the missing regions.

The mathematical objective minimizes the distance between the predicted features and the target features:

Equation 1: The loss function for 3D-JEPA. It minimizes the distance between the predictor’s output (based on masked input) and the target encoder’s output (based on full input). Note the stop-gradient (sg) on the target encoder, which is crucial for stability.

Why Masking Matters

The authors found that how you mask the scene is critical. If you just mask random individual points, the model can “cheat” by interpolating from immediate neighbors. To force the model to learn high-level geometry and semantics, they use block-wise masking (specifically, serialized percent masking), which removes contiguous chunks of the scene.

Figure 4: Different masking strategies. The “Serialized Masking” (bottom row) was found to be most effective, forcing the model to infer missing structures rather than just interpolating local noise.

By training on this task, the 3D-JEPA encoder learns to look at a point and understand its relationship to the rest of the room. It produces contextualized features.

Phase 3: The Language-Conditioned Decoder

Once the encoder is pre-trained via 3D-JEPA, it is fine-tuned alongside a Language-Conditioned Decoder. This component is responsible for taking the rich 3D features and the user’s text query (e.g., “the chair near the window”) and finding the object.

Figure 3: The Language-Conditioned Decoder. It uses a series of attention blocks. Queries (derived from text) attend to the 3D point features (cross-attention) to refine their representation.

The decoder uses a Transformer architecture with \(N=8\) blocks. It employs:

  1. Self-Attention: Allowing the model to reason about relationships between different parts of the query/objects.
  2. Cross-Attention: Integrating the text query with the 3D point cloud features.

Prediction Heads

The decoder doesn’t just output a generic “location.” It has specialized heads that predict three things jointly:

  1. 3D Mask: A dense segmentation of the object points.
  2. 3D Bounding Box: The spatial extent of the object.
  3. Text Alignment: Which part of the text corresponds to this object.

Figure 7: The prediction heads. Top Left: Token Prediction (matching objects to words). Top Right: Mask Head (predicting point-wise probabilities). Bottom: Bounding Box Head (predicting coordinates).

The authors developed a novel bounding box head that uses cross-attention between the refined point features and the queries, significantly outperforming standard MLP (Multi-Layer Perceptron) approaches.

Training Strategy

Training a complex multimodal system like this requires a careful balancing act. You have a powerful pre-trained encoder (3D-JEPA) and a randomly initialized decoder. If you train them together immediately, the noise from the decoder could destroy the pre-trained knowledge in the encoder.

To solve this, the authors utilize a Composite Loss Function and a Stage-Wise Learning Rate Schedule.

The Loss Function

The model optimizes for multiple goals simultaneously: accurate segmentation (Dice/Cross-Entropy), accurate boxes (L1/GIoU), and text alignment.

Equation 2: The composite loss function. It combines Dice and Cross-Entropy for masks, L1 and GIoU for boxes, and an alignment loss for text grounding.

The Learning Rate Schedule

The schedule is designed to “warm up” the system. Initially, the encoder is frozen or learns very slowly while the decoder learns the basics. Gradually, the encoder is “unlocked” to fine-tune its representations for the specific task of grounding.

Figure 8: The learning rate schedule. Note how the encoder (left) keeps a much lower learning rate compared to the decoder (right) and ramps up later in the training process to preserve pre-trained features.

The Dataset: L3DD

To systematically study generalization, the researchers introduced the Locate 3D Dataset (L3DD). Existing datasets were often limited to a single capture setup (ScanNet). L3DD aggregates annotations across ScanNet, ScanNet++, and ARKitScenes, totaling over 130,000 language annotations.

This diversity is crucial. It ensures the model isn’t just memorizing the quirks of one specific camera type or room layout style.

Figure 10: A sample from the L3DD dataset (ScanNet++ split). The model sees the RGB scene (left) and must predict the mask (right) for the query “indoor plant on the table.”

Experiments and Results

So, how well does it work? The short answer: It sets a new State-of-the-Art (SoTA).

Benchmark Performance

The authors evaluated Locate 3D on standard benchmarks (SR3D, NR3D, ScanRefer). Crucially, they evaluated on sensor point clouds—raw data derived from RGB-D frames—rather than the clean meshes used by previous “state-of-the-art” methods.

Despite this harder setting, Locate 3D outperforms prior methods, including those that relied on ground-truth proposals or meshes.

Table 2: Comparison with prior methods. Locate 3D achieves significantly higher accuracy (Acc@25 and Acc@50) across all benchmarks compared to baselines like 3D-VisTA and VLM-based approaches.

When trained with the additional L3DD data (referred to as Locate 3D+), the performance jumps even further, demonstrating the scalability of the approach.

Does 3D-JEPA Actually Help?

A key question is whether the complex SSL pre-training is necessary. Could you just use the “lifted” CLIP features and train the decoder?

The ablation studies (Table 3) give a clear “Yes, it helps.”

Table 3: Ablation study. Using raw RGB is poor (28.9%). Using lifted features (ConceptFusion/CF) is better (53.9%). But initializing with 3D-JEPA yields the best results (61.7%), proving the value of contextualized representation learning.

Visualizing the “Context”

To visualize what 3D-JEPA is doing, the authors used PCA to project the high-dimensional features into colors.

In the figure below, look at the middle column (“Fine-tuned 3D-JEPA Features”). You can see smooth, semantic groupings—the encoder has grouped the “bed” and “floor” into distinct semantic regions. The decoder (right column) then sharpens these features, focusing intensely on the specific object referenced in the query.

Figure 9: Feature visualization. Left: RGB. Middle: 3D-JEPA features (smooth, semantic). Right: Decoder features (sharp, localized). This illustrates the transition from general scene understanding to specific object localization.

Real-World Robot Deployment

Perhaps the most exciting result is the deployment on a Boston Dynamics Spot robot. In a test apartment (unseen during training), the robot was tasked with navigating to objects described by natural language (e.g., “coffee table in front of the sofa”) and picking them up.

Locate 3D+ achieved an 8/10 success rate, significantly outperforming VLM-based baselines (which use GPT-4o or Llama-3 to propose 2D bounding boxes and project them). This proves that learning 3D representations directly is more robust than chaining together 2D models for physical tasks.

Figure 12: Robot experiments. The Spot robot (left images) successfully localizes objects like “coffee table in front of the sofa” (top) and “dresser in the hallway” (bottom), visualized by the bounding boxes in the point clouds (right).

Conclusion

Locate 3D represents a maturity point for 3D AI. We are moving away from systems that rely on perfect, pre-processed maps and toward systems that can interpret the messy, noisy reality of raw sensor data.

By combining the “knowledge” of 2D foundation models (via lifting) with the “wisdom” of 3D contextual understanding (via 3D-JEPA), Locate 3D bridges the gap between language and geometry. The introduction of the L3DD dataset further paves the way for generalist 3D agents that can operate in diverse environments.

For students and researchers, the takeaways are clear:

  1. Lifting works: You don’t need to retrain vision models from scratch for 3D; you can project 2D intelligence into 3D space.
  2. Context is King: Local features aren’t enough. SSL methods like JEPA are essential for teaching models how scene elements relate to one another.
  3. Sensor-First Design: Designing models that work on raw sensor data is harder but necessary for real-world robotics.

As we look toward a future of helpful household robots and advanced AR assistants, techniques like Locate 3D will be the foundational blocks that allow machines to understand not just what an object is, but where it is in the context of our lives.