Point and Click (in Real Life): Teaching Robots to Understand 3D Pointing Gestures

Imagine you are working with a colleague to assemble a piece of furniture. You realize you need the screwdriver on the table behind them. You don’t say, “Please retrieve the object located at coordinates X: 1.2, Y: -0.5, Z: 0.8.” You simply point and say, “Pass me that.” Your colleague immediately understands the gesture, estimates the direction, identifies the specific object you mean, and hands it over.

This interaction is seamless for humans, but it is incredibly difficult for robots.

While we have robots that can navigate maps and recognize objects, the bridge between human gesture and 3D spatial understanding is still under construction. Most current systems rely on awkward interfaces—like clicking on a screen or giving precise coordinate descriptions—which defeats the purpose of natural interaction.

In this post, we are diving deep into Pointing3D, a recent research paper that proposes a new way to solve this problem. The researchers introduce a method that takes a simple image of a human pointing and a 3D scan of the room to accurately segment exactly what is being pointed at.

The Problem: Direction vs. Extent

Why hasn’t this been solved yet? The challenge lies in the difference between direction and extent.

Previous research has become quite good at estimating the pointing direction (a vector shooting out from your finger). However, a vector is just a line. In a cluttered room, that line passes through empty space, grazes a chair, hits a table, and eventually hits a wall. A robot needs to know that when you point, you aren’t pointing at the wall or the empty space; you are referring to the specific volume of the object (the “extent”) intended by the gesture.

To bridge this gap, the authors of Pointing3D introduce a new task: Pointing-based 3D Object Segmentation.

The Foundation: The POINTR3D Dataset

Deep learning models are only as good as the data they are fed. The researchers quickly realized that no existing dataset combined all the necessary ingredients:

Images of humans pointing.
3D point clouds of the environment.
3D segmentation masks (labels) of the specific objects being referred to.

To fix this, they built POINTR3D.

Scene meshes from the POINTR3D dataset showing indoor environments with colored segmentation masks.

As shown in Figure 1, the dataset captures diverse indoor environments. The team recorded scenes using RGB cameras and captured the 3D geometry (point clouds) of the rooms. They then had human participants stand in these rooms and point at various objects—everything from robotic arms and trash cans to potted plants and sofas.

The resulting dataset is massive, containing roughly 65,000 frames. Crucially, it provides the ground truth needed to train a machine: the image of the person pointing and the exact 3D mask of the object they want.

Examples from the dataset: Left side shows a human pointing in an RGB image; Right side shows the corresponding 3D mask of the target object.

Figure 2 illustrates the data pairing. On the left, we see the robot’s perspective (a camera view of a person). On the right, we see the “answer key”—the segmented 3D object (like a quadruped robot or a stool) highlighted in the spatial map.

With this data in hand, the authors developed a two-stage architecture to solve the task.

The Pointing3D Method

The proposed model, appropriately named Pointing3D, mimics the human thought process:

Look at the person: Where is their hand, and where are they pointing?
Look at the world: Based on that direction, which object in the 3D environment fits the description?

Let’s break down the architecture.

Architecture diagram of the Pointing3D model showing the flow from image input to 3D segmentation output.

The system takes two inputs: a 2D RGB image of the user and a 3D point cloud of the scene.

Stage 1: The Pointing Direction Module

The first step is to figure out the geometry of the human pose. The model uses a state-of-the-art human pose estimator (based on the SMPL body model) to extract a 3D skeleton from the 2D image. This gives the system the precise location of the skeletal joints, specifically the active hand.

But how does the model learn to point? The researchers define the “ground truth” pointing direction (\(\mathbf{d}_{gt}\)) mathematically. It is the normalized vector connecting the hand joint (\(\mathbf{j}_{hand}\)) to the target object’s location (\(\mathbf{m}\)).

Equation defining the ground truth pointing direction vector.

The model uses a Multi-Layer Perceptron (MLP)—a simple neural network—to predict this vector based on the body pose. To train this network, they use a loss function that combines two goals: minimizing the angle error (cosine similarity) and minimizing the absolute distance error (L1 norm).

Equation showing the loss function for pointing direction.

This ensures the predicted arrow is pointing in the right direction and is robust. Once the model predicts the pointing direction (\(\mathbf{d}_{pred}\)) and the hand position, it passes this information to the second, more complex stage.

Stage 2: The 3D Instance Segmentation Module

This is the heart of the paper. We have a 3D point cloud of the room and a ray (a line) shooting out from a hand. How do we turn that into a precise object mask?

The authors use a Transformer-based architecture. If you are familiar with Natural Language Processing or modern Computer Vision, you know Transformers are excellent at paying attention to specific parts of data.

1. Query Initialization (The “Flashlight” Technique)

In standard segmentation transformers (like Mask3D), the model essentially guesses objects all over the scene. Here, we want to guide the model. The researchers use the pointing ray as a prompt.

They cast the ray into the 3D point cloud. They look for points that fall within a narrow cone around this ray (imagine shining a flashlight). The point closest to the hand within this cone is selected as the “target point” (\(\mathbf{p}_{target}\)).

This point is used to initialize a Segmentation Query (\(Q^0\)). This query is a vector that tells the Transformer: “Focus on the object that contains this point.”

Equation for initializing the segmentation query using positional encoding and feature embedding.

As seen in the equation above, the initial query is a combination of the target point’s position (\(\mathrm{PE}_{target}\)) and its visual features (\(\mathcal{F}^0_{target}\)) extracted from the scene backbone.

2. The Transformer Decoder

Now the Transformer takes over. It has a “query” representing the user’s intent. It iterates through several layers to refine this understanding.

At each layer, the model compares the query against the features of the 3D point cloud. It calculates a heatmap (\(H^l\)), which assigns a probability score to every point in the cloud, determining how likely it is to be part of the referred object.

Equation for calculating the instance heatmap.

The query itself is then updated. It “looks” at the points it thinks are part of the object (Cross Attention) and gathers more context to refine its shape.

Equation for refining the query using cross-attention.

This loop repeats (as shown in the architecture diagram), with the query getting smarter and the segmentation mask getting sharper at every step. Finally, the model outputs a binary mask: 1 for points belonging to the object, 0 for everything else.

Experimental Results

How well does it work? The researchers tested Pointing3D against several strong baselines.

1. Can it estimate direction?

First, they checked if the first module could simply figure out where the user was pointing. They compared their method against heuristics (like drawing a line from the person’s eyes to their hand, or shoulder to hand) and “DeePoint,” a leading video-based pointing method.

Table 1 comparing pointing accuracy and angular deviation.

The results (Table 1) show that Pointing3D achieves the highest accuracy and the lowest angular deviation. Interestingly, simple heuristics like “Shoulder to Hand” performed surprisingly well, but the learned Pointing3D model was superior, likely because it can account for subtle variations in body language.

2. Can it segment the object?

This is the main event. They compared Pointing3D against:

Region Growing: A classic algorithm that starts at the pointed spot and expands until it hits edges.
Interactive 2D (SAM): Using the “Segment Anything Model” on the 2D image and projecting it to 3D.
Non-interactive 3D (Mask3D): Finding all objects in 3D first, then picking the one closest to the pointing ray.

Table 2 comparing segmentation performance (IoU).

Table 2 reveals that Pointing3D significantly outperforms the baselines. The metric used is Intersection-over-Union (IoU), where higher is better.

Region growing fails because it doesn’t understand semantic objects (it might stop at a shadow or merge two objects).
2D Interactive methods struggle with occlusion—if an object is partially blocked in the 2D image, the 3D projection will be wrong.
Non-interactive 3D fails if the underlying segmentation model misses the object entirely. If Mask3D doesn’t “see” the trash can as an object, pointing at it won’t help.

Pointing3D succeeds because it is promptable. It uses the pointing information during the segmentation process, not just as a filter afterwards.

Visual Proof

Visual comparisons often tell the best story. In the figure below, the user is pointing at a gray box.

Qualitative comparison showing Pointing3D correctly segmenting a box where other models fail.

Interactive 2D (left) over-segments, grabbing the box and the items inside/near it as one blob.
Non-interactive 3D (middle) completely underestimates the object, selecting only a tiny fragment.
Pointing3D (right) correctly identifies the full extent of the box, matching the Ground Truth closely.

Why This Matters

This paper represents a significant step toward “natural” robotics. We are moving away from forcing humans to speak the robot’s language (coordinates) and teaching robots to understand ours (gestures).

There are still limitations. The current system assumes a static 3D scan of the room (it doesn’t account for furniture moving around in real-time) and focuses on single users. However, the introduction of the POINTR3D dataset and the Pointing3D architecture lays the groundwork for future systems where you can simply point at a dirty plate and tell your robot, “Clean that up,” and have it actually understand what “that” means.

Key Takeaways:

New Task: Pointing-based 3D object segmentation combines gesture analysis with spatial understanding.
Data is King: The 65,000-frame POINTR3D dataset enables supervised learning for this specific task.
Prompting works: Using a pointing vector to initialize a Transformer query allows for more accurate segmentation than just filtering pre-detected objects.

The Problem: Direction vs. Extent#

The Foundation: The POINTR3D Dataset#

The Pointing3D Method#

Stage 1: The Pointing Direction Module#

Stage 2: The 3D Instance Segmentation Module#

Experimental Results#

1. Can it estimate direction?#

2. Can it segment the object?#

Visual Proof#

Why This Matters#