LatentAct: Teaching AI 'How to Do That' by Tokenizing Hand Interactions

Imagine teaching someone how to repair a bicycle. You rarely give them a list of coordinate geometries or vector rotations. Instead, you show them. You demonstrate how the hand should grip the wrench, the specific twisting motion required, and exactly where the fingers need to apply pressure.

In the world of robotics and computer vision, this natural form of instruction—demonstrating the “how”—is incredibly difficult to replicate. Most current systems rely on precise 3D models of objects to plan interactions. But what happens when we want an agent to interact with everyday objects—things that are thin, transparent, deformable, or simply don’t have a pre-existing 3D scan?

This is the challenge addressed in the paper “How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions.” The researchers introduce LatentAct, a novel framework that predicts 3D hand motions and contact maps from a single image. By learning a “vocabulary” of stylized hand movements, LatentAct can forecast how a hand should interact with an object, even without a perfect 3D model of that object.

LatentAct predicts future hand motion and contact points based on input imagery.

As shown in Figure 1 above, the system takes an input image, a text description (like “push graphic card”), and a contact point, and synthesizes a realistic sequence of hand motions.

In this post, we will dissect how LatentAct achieves this, from its clever use of latent codebooks to its massive semi-automatic data engine.

The Core Problem: The Gap in 3D Interaction

To understand why LatentAct is necessary, we first need to look at the limitations of current Human-Object Interaction (HOI) research.

Existing methods usually fall into two camps:

Constrained Setups: They work well with pick-and-place tasks involving rigid objects where the 3D geometry is known perfectly.
2D Limitations: They work on natural videos but only predict 2D bounding boxes or segmentation masks, lacking the 3D spatial understanding required for a robot or AR system to actually perform the task.

The real world is messy. Objects like bags are deformable; glasses are transparent; tools are occluded by the hands holding them. Obtaining a clean 3D mesh for every object in a kitchen is impossible.

The authors of LatentAct propose a shift in perspective. Instead of obsessing over the exact geometry of the object, they focus on the hand. They observe that while objects are infinitely diverse, the way hands interact with them is “stylized.” There are only a few prototypical ways to screw something in, whether it’s a tripod leg or a jar lid. If a model can learn these prototypical motions, it can generalize to new objects.

The Method: Tokenizing Interaction Trajectories

The heart of LatentAct is the idea of tokenizing interactions. Just as Large Language Models (LLMs) break down text into tokens, LatentAct breaks down physical interactions into a learned codebook of motions.

1. Representing the Hand and Contact

Before building the model, the researchers needed a robust way to represent the interaction. They use the MANO hand model, which provides a parametric mesh of the hand.

However, the position of the hand isn’t enough. We need to know how it touches the object.

Visualizing contact points as binary masks on the hand mesh.

As illustrated in Figure 2, the team defines a Contact Map. This is a binary mask over the 778 vertices of the MANO hand mesh. For every timestep in a trajectory, the model tracks:

Hand Pose: The shape and joint angles.
Global Trajectory: The movement in 3D space relative to the camera.
Contact Map: Which specific parts of the hand are touching the object (highlighted in red in the figure).

2. The Interaction Codebook (InterCode)

The first stage of the framework is training the Interaction Codebook. This is a Vector Quantized Variational AutoEncoder (VQ-VAE).

The goal of this module is to learn a latent “dictionary” of affordances.

Input: A sequence of 3D hand poses and contact maps (the ground truth trajectory).
Encoder: Compresses this trajectory into a lower-dimensional feature space.
Quantization: The features are mapped to the nearest entry in a learned “codebook.” This effectively snaps the continuous motion to a discrete “token” representing that specific type of movement (e.g., a specific type of grasp or twist).
Decoder: Reconstructs the original trajectory from the codebook entry.

The Interaction Codebook architecture using VQVAE.

This process forces the model to learn the fundamental, reusable patterns of human hand interaction. It acts as a prior—a memory bank of “legal” and “natural” ways hands move.

3. The Indexer and Predictor

Once the Codebook is trained to understand movements, the second stage is training the model to predict these movements from visual inputs. This is where the Learned Indexer and Interaction Predictor come in.

At test time, the model doesn’t have the full trajectory (that’s what we are trying to predict!). It only has:

A single RGB image of the object.
A text description (e.g., “open bottle”).
A 3D contact point (a starting location on the object).

The Indexer and Interaction Predictor workflow.

Here is the flow:

The Indexer: This module takes the image, text, and contact point features and acts as a retrieval system. It predicts which “token” (index) from the Codebook best fits the situation. It effectively asks: “Given this picture of a bottle and the command ‘open’, which motion from my memory bank should I use?”
The Interaction Predictor: Once the Indexer retrieves the latent embedding from the Codebook, the Predictor refines it. It takes the retrieved code and aligns it with the specific 3D contact point and visual context of the scene to generate the final sequence of MANO meshes and contact maps.

This two-stage approach—first learning the “language” of motion (Codebook), then learning how to “speak” it based on visual cues (Indexer)—allows the model to generalize much better than trying to predict raw coordinates directly from pixels.

The Data Engine: Scaling Up 3D Annotations

Deep learning models are only as good as their data. The researchers faced a significant hurdle: there were no large-scale datasets of everyday ego-centric videos with accurate 3D hand and contact annotations. Existing datasets were either too small, too artificial, or only had 2D labels.

To solve this, they built a semi-automatic data engine using the HoloAssist dataset.

The Data Engine pipeline generating 3D annotations.

As shown in Figure 4, the pipeline combines several state-of-the-art tools:

Input: Egocentric videos of people performing tasks (repairing, cooking, etc.).
Segmentation: They use SAMv2 (Segment Anything Model) to track object masks across video frames.
Hand Reconstruction: They use HaMeR, a transformer-based model, to predict 3D hand meshes from the images.
Contact Calculation: By combining the 3D hand mesh with the 2D object mask, they computationally derive the 3D contact points.

The result is a massive dataset containing 800 tasks across 120 object categories and 24 action categories. This is 2.5x to 10x larger than previous datasets like GRAB or ARCTIC, providing the diversity needed to learn robust interaction priors.

Experiments and Results

The researchers evaluated LatentAct on two primary tasks:

Forecasting: Predicting the future hand motion given only the start frame and text.
Interpolation: Predicting the motion given the start frame, end frame, and text.

They tested the model’s ability to generalize to novel objects (things it hasn’t seen before), novel actions, and novel scenes.

Quantitative Success

LatentAct was compared against adapted versions of strong baselines, including HCTFormer (a transformer approach) and HCTDiff (a diffusion-based approach).

The results were decisive. LatentAct achieved lower error rates in Mean Per Joint Position Error (MPJPE) and better F1 scores for contact map accuracy across the board.

Charts showing performance improvement with dataset scale.

Figure 5 highlights the importance of the Data Engine. Both LatentAct and the baseline improve as the amount of training data increases (from 20% to 100%), but LatentAct maintains a consistent lead. This proves that the architecture is capable of ingesting large-scale data to refine its priors.

Qualitative Visualization

Numbers are great, but in computer vision, seeing is believing.

Visual comparison of LatentAct predictions vs baselines.

Figure 6 showcases the generated trajectories.

Columns: The left shows the input. The “Camera View” and “Another View” columns show the predicted 3D hand (white mesh). The “Contact Map” column shows the predicted touch zones (red).
Performance: Look at the “rotate lens” or “mix/stir coffee” examples. The baseline often produces hands that are jittery or oriented incorrectly. LatentAct, however, produces hand poses that align naturally with the object’s geometry. The contact maps are sharp and localized, indicating the model understands not just that the hand is touching the object, but precisely how to manipulate it.

Why does it work better?

The authors performed ablation studies (removing parts of the model to see what breaks) and found two key insights:

The Codebook is crucial: Directly predicting motion from images (without the intermediate codebook) performs significantly worse. The codebook acts as a stabilizing “prior,” preventing the model from generating physically impossible or unnatural hand contortions.
Contact Maps help Hand Pose: Interestingly, training the model to predict the contact map (the red zones on the hand) actually improved the accuracy of the hand’s skeletal pose. Knowing where to touch helps the model figure out how to position the fingers.

Conclusion and Implications

LatentAct represents a significant step forward in synthesizing human-object interactions. By decoupling the “what” (the object) from the “how” (the interaction style), and by leveraging a massive new dataset of everyday activities, the model can hallucinate plausible 3D interactions even for objects it has never seen in 3D before.

Key takeaways for students and researchers:

Representation Matters: Moving from simple coordinates to “Interaction Trajectories” (Pose + Contact Map) provides richer supervision signal.
Priors are Powerful: The VQVAE Codebook allows the model to “remember” prototypical movements, making it robust to noisy inputs.
Data Engines Enable Scale: Using foundational models (like SAM and HaMeR) to label data semi-automatically is a viable strategy to overcome data scarcity in 3D tasks.

This work paves the way for future applications where robots or virtual assistants need to understand how to manipulate the messy, unmodeled world around them—simply by looking at a picture and knowing “how to do that.”

The Core Problem: The Gap in 3D Interaction#

The Method: Tokenizing Interaction Trajectories#

1. Representing the Hand and Contact#

2. The Interaction Codebook (InterCode)#

3. The Indexer and Predictor#

The Data Engine: Scaling Up 3D Annotations#

Experiments and Results#

Quantitative Success#

Qualitative Visualization#

Why does it work better?#

Conclusion and Implications#