Teaching Robots to Move: Mining 3D Trajectories from First-Person Video

Imagine asking a robot to “pick up the knife on the counter.” To a human, this is trivial. To a robot, it requires a complex understanding of 3D space, object affordance (where to grab), and the specific motion trajectory required to execute the action safely.

For years, the gold standard for teaching robots these skills has been Imitation Learning—showing the robot examples of humans performing the task. However, this method has a massive bottleneck: data scarcity. Collecting high-quality 3D data usually requires expensive motion capture (MoCap) labs, instrumented gloves, and tedious setups. We simply cannot scale this up to cover every object and action in the real world.

But what if the data we need is already out there? What if we could teach robots by having them “watch” YouTube videos of people doing chores?

In a recent paper titled “Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision,” researchers from Kyoto University and partners propose a novel framework to do just that. They present a method to automatically extract 3D object manipulation trajectories from 2D egocentric (first-person) videos and use that data to train models that can generate complex motion paths from text descriptions.

In this post, we will dive into how they turn flat video into 3D robotic knowledge and what this means for the future of embodied AI.


The Challenge: From Pixels to Poses

The core problem the researchers address is 6DoF Object Manipulation Trajectory Generation.

Let’s break that down:

  • 6DoF (Six Degrees of Freedom): To fully describe an object’s position in space, you need 3 coordinates for position (\(x, y, z\)) and 3 parameters for rotation (roll, pitch, yaw).
  • Trajectory: It’s not enough to know where the object starts and ends; the robot needs the path of movement in between.
  • Action Description: The input is a natural language command, like “Pour the water into the cup.”

The goal is depicted in the image below. The model receives an image of the initial state and a text description, and it must “imagine” the sequence of 3D poses the object should take.

Figure 1. 6DoF object manipulation trajectory. This task aims to generate a sequence of 6DoF object poses from an action description and an initial state comprising the visual input and the object’s initial pose.

The primary hurdle is that we don’t have enough labeled data where every frame of a video is paired with perfect 3D coordinates. The researchers’ solution? Automated annotation. They leveraged the massive Ego-Exo4D dataset—a collection of first-person videos of people cooking, repairing bikes, and doing daily activities—and built an AI pipeline to “extract” the 3D truth from the 2D footage.


The Core Method: Mining Trajectories from Video

The researchers developed a sophisticated four-stage pipeline to turn a raw video clip into a clean, training-ready 3D trajectory. This pipeline is fully automated, relying on a chain of state-of-the-art vision models.

Step 1: Temporal Action Localization (Finding the “What” and “When”)

First, the system needs to know what is happening. The researchers use GPT-4 as a reasoning engine. They feed the video frames and the associated textual description (e.g., “cutting the red pepper”) into GPT-4.

  • Action Span: GPT-4 identifies exactly which frames contain the start and end of the action (\(t_{start}\) to \(t_{end}\)).
  • Active Object: It identifies the object being manipulated (e.g., “knife”) and checks if it is rigid (deformable objects like cloth are excluded for now).

Once the object is identified, they use Grounded SAM (Segment Anything Model) to create a segmentation mask, isolating the object pixels from the background.

Step 2: Position Sequence Extraction (Finding the “Where”)

Knowing what pixels correspond to the object is great, but we need its 3D location.

  • Depth Estimation: They use a model called Depth Anything to estimate how far away each pixel is from the camera.
  • 3D Tracking: They employ SpaTracker, a dense point tracker. This tracks specific points on the object across frames, combining the 2D pixel movement with the depth map to create a moving 3D point cloud of the object.

Step 3: Trajectory Projection (Stabilizing the View)

This is a critical step for egocentric video. In first-person footage, the camera (the person’s head) is constantly moving. If the user turns their head left, the object appears to move right, even if it’s stationary.

To fix this, the system calculates the camera’s motion between frames using point cloud registration. It essentially “freezes” the world coordinate system to the first frame (\(f_{start}\)). By subtracting the camera’s movement, they isolate the true motion of the object relative to the world.

The relationship between 2D pixels \((i, j)\) and 3D points \((x, y, z)\) is defined by the camera intrinsic parameters \(K\) and the depth \(d_{ij}\):

\[ \left[ \begin{array} { l } { { x } } \\ { { y } } \\ { { z } } \end{array} \right] = d _ { i j } K ^ { - 1 } \left[ \begin{array} { l } { { i } } \\ { { j } } \\ { { 1 } } \end{array} \right] , \]

Step 4: Rotation Sequence Extraction (Finding the Orientation)

Finally, they need the rotation. They take the point cloud of the object at the start and compare it to the point cloud at each subsequent step. Using Singular Value Decomposition (SVD), they mathematically calculate the rotation matrix that best aligns the object’s shape from one frame to the next.

The result is a clean sequence of 6DoF poses—a dataset they call EgoTraj, containing over 28,000 trajectories.

Figure 2. Trajectory extraction from egocentric videos. Four steps of (1) temporal action localization, (2) position sequence extraction, (3) trajectory projection, and (4) rotation sequence extraction.


The Generation Model: Motion as Language

With the EgoTraj dataset created, the researchers moved to the second phase: training a model to generate these trajectories from scratch.

They adopted a Vision-Language Model (VLM) approach. The insight here is to treat 3D motion just like a language.

  1. Tokenization: They discretize the continuous values of the trajectory (x, y, z, roll, pitch, yaw) into 256 distinct “bins.” Each bin becomes a token, just like a word in a vocabulary.
  2. Next-Token Prediction: The task becomes predicting the next pose token given the image features and the text prompt.

Architecture

They tested several backbone architectures, including BLIP-2 and PointLLM.

  • Inputs: The model takes an RGB image, a Depth map, or a Point Cloud of the scene, along with the text prompt (e.g., “Open the drawer”).
  • Processing: A visual encoder processes the scene. The Language Model (LLM) then attends to these visual features and the text instruction.
  • Output: The LLM autoregressively generates the sequence of tokens representing the object’s path.

Figure 3. Overview of model architecture. Our model architecture utilizes visual and point cloud-based language models as backbones and extends them by incorporating extended vocabularies for trajectory tokenization.

This architecture is powerful because it leverages the “common sense” reasoning capabilities of Large Language Models and applies them to physical motion.


Experiments & Key Results

The researchers evaluated their models on the HOT3D dataset, a high-quality dataset with ground-truth 3D tracking. They compared their VLM-based approach against traditional methods like Seq2Seq models and the “Uncertainty-aware State Space Transformer” (USST).

1. VLMs Outperform Traditional Baselines

The results showed that VLM-based models (specifically PointLLM and BLIP-2) significantly outperformed standard Seq2Seq models.

Table 2. Comparison of 3DoF and 6DoF object manipulation trajectory generation.

Notably, PointLLM (which uses point clouds as input) achieved the best results in 3D position accuracy (ADE/FDE). This highlights that for 3D manipulation tasks, giving the model explicit 3D geometric data (point clouds) is far superior to giving it flat 2D images alone.

2. The Power of “Probabilistic Sampling”

One difficulty in robotics is that there is often more than one right way to do something. You can pick up a cup from the handle or the rim. Because the model is based on an LLM, it is probabilistic. By using techniques like nucleus sampling, the model can generate multiple different valid trajectories for the same prompt. The researchers found that generating just 10 samples and picking the best one reduced the error rate significantly.

3. Qualitative Understanding

Perhaps the most interesting result is qualitative. The model doesn’t just memorize paths; it seems to understand the semantics of the verbs.

  • When asked to “Transfer” a spoon, it generates a lifting and moving trajectory.
  • When asked to “Stir” with the spoon, the trajectory stays within the bowl and moves in a circular/agitated motion.

Figure 6. PointLLM [97] results with different action descriptions. The generations are visualized with 3D bounding boxes.

4. Scale Matters

The researchers also analyzed how the size of their mined dataset affected performance. As expected, there is a clear correlation: more data equals better physics.

Figure 5. Comparison of performance across different dataset scales for PointLLM [97].

The error rates (ADE/FDE) drop consistently as they train on more of the EgoTraj dataset. This validates their automated extraction pipeline—even if the automatically mined data is slightly noisy, the sheer scale of it provides a strong learning signal.


Conclusion and Future Implications

This paper represents a significant step toward “General Purpose” robots. By successfully automating the extraction of 6DoF trajectories from ubiquitous video data, the researchers have unlocked a massive source of training material that was previously inaccessible to roboticists.

Key Takeaways:

  1. Video is a Goldmine: We don’t always need expensive labs to teach robots. We can mine knowledge from existing video archives using clever AI pipelines.
  2. Motion is Language: Treating physical trajectories as sequences of tokens allows us to use powerful Transformers and LLMs to solve robotic control problems.
  3. Geometry Matters: While 2D Vision-Language models are good, models that explicitly ingest 3D data (like point clouds) are superior for manipulation tasks.

While the current framework is limited to rigid objects (sorry, laundry-folding robots!), it sets a new baseline. As computer vision models like SAM and Depth Anything improve, the quality of this “mined” data will only get better, bringing us closer to robots that can watch a cooking tutorial and immediately start chopping vegetables alongside us.