Beyond 2D: Predicting “Where” and “How” Humans Interact in 3D Space
Imagine a robot assistant observing you in the kitchen. You are making tea. You’ve just boiled the water. A truly helpful assistant shouldn’t just recognize that you are currently “standing.” It should anticipate that in the next few seconds, you will walk to the cabinet, reach your arm upward to grab a mug, and then move to the fridge to get milk.
To do this effectively, the AI needs to understand the world not as a flat sequence of video frames, but as a persistent 3D environment. It needs to answer two critical questions: Where will you go next? and How will you move your body to interact with the objects there?
This is the core challenge addressed in the research paper “FICTION: 4D Future Interaction Prediction from Video”. The researchers propose a novel approach that moves beyond traditional 2D video analysis, predicting long-term human behavior in the full context of a 3D scene.

The Problem: The Flat World of Computer Vision
For years, computer vision has excelled at “What.” Models can look at a video frame and tell you “this is a person,” or “they are holding a cup.” More recently, predictive models have attempted to guess what happens next. However, these methods typically suffer from two major limitations:
- The 2D Trap: Most models treat the world as a series of 2D images. They ignore the persistent 3D spatial layout of the room. If the camera looks away from the fridge, a 2D model essentially forgets the fridge exists.
- Short-Term Memory: Existing pose forecasting models usually predict motion only a few seconds into the future (e.g., continuing a walking stride). They struggle to predict complex, multi-step activities that unfold over minutes, like cooking a meal or repairing a bike.
The authors of this paper argue that human movement is inextricably linked to the environment. You don’t just “reach”; you reach for something specific, located at a specific coordinate in the room. Therefore, to predict the future, an AI must understand the 4D context—the 3D space plus the dimension of time.
Enter FICTION: A 4D Approach
The researchers introduce FICTION (Future Interaction prediCTION), a model designed to forecast physically grounded interactions.
The goal is ambitious. Given a video of a person performing an activity, along with a 3D representation of the scene, the model must predict:
- Where: Which objects the person will interact with in the next time period (up to 3 minutes).
- How: The specific 3D body poses (skeletal configuration) the person will adopt during those interactions.
The Insight: Context is King
The driving hypothesis is that you cannot predict the interaction without the environment. If a person is making tea, and the water dispenser is wall-mounted, they will extend their arm. If it’s a pitcher on a low shelf, they will stoop. The “how” is dictated by the “where.”
The Methodology
The FICTION architecture is a study in multimodal fusion. It doesn’t rely on just one type of data; it synthesizes three distinct streams of information to build a comprehensive picture of the past before predicting the future.

1. The Inputs (The “Observation”)
As shown in the architecture diagram above (Figure 2), the model takes in three inputs covering the observation period up to time \(\tau_o\):
- Visual Stream: Egocentric video (first-person view) features extracted using a large pretrained encoder (EgoVLPv2). This provides the semantic context of what is happening (e.g., “cooking”).
- Pose Stream: The sequence of the person’s body movements, represented by SMPL parameters. This captures the biomechanics of the actor.
- Spatial Stream: A voxelized (3D grid) representation of the scene. This tells the model where objects are located relative to the person.
These inputs are processed by specific “Mappers” to translate them into a common language, which is then fed into a Multimodal Transformer Encoder. This encoder learns a rich representation (\(\bar{\mathbf{r}}\)) of the current state of the world and the activity.
2. Predicting “Where” (Interaction Location)
The first output of the model is the location prediction. The researchers treat this as a classification problem over the 3D voxel grid. The model learns a function \(\mathcal{F}_o\) that outputs a set of 3D points where interactions will occur.

In this equation:
- \(\mathcal{V}\) is the video observation.
- \(\mathcal{P}\) is the pose/location data.
- The output is a set of coordinates \(\mathbf{x}_{\tau_k}\) where the person interacts with an object \(\mathcal{O}\) at a future time.
Essentially, the model looks at the encoded history and “lights up” the voxels in the 3D grid where it believes the person will touch an object in the next 3 minutes.
3. Predicting “How” (Pose Distribution)
Predicting body pose is trickier because it is stochastic. If you are going to open a cabinet, you might use your left hand or your right hand; you might lean in or stand back. There isn’t just one “correct” future pose.
To handle this, FICTION uses a Conditional Variational Autoencoder (CVAE). Instead of predicting a single deterministic pose, it predicts a distribution of likely poses.

Here, the function \(\mathcal{F}_p\) takes the history (\(\mathcal{V}, \mathcal{P}\)) and a specific future location (\(\mathbf{x}_{\tau_k}\)) predicted in the previous step. It outputs a probability distribution \(\mathbb{P}(\theta, t)\).
During inference (testing), the model can sample from this distribution to generate multiple plausible body poses for interacting with that specific object.
Training the Model
The model is trained using a combination of losses to ensure accuracy in both shape and position. The training objective minimizes the difference between the predicted body mesh and the actual human body recorded in the dataset.

The loss function includes:
- Surface Loss (\(P - \hat{P}\)): Measuring the error in the SMPL parameters (body shape/rotation).
- Joint Loss (\(J - \hat{J}\)): Measuring the physical distance error of the 3D skeleton joints.
- KL Divergence: Ensuring the learned distribution is well-formed (standard in CVAE training).
Building the Dataset
One of the significant contributions of this paper is the creation of the training data itself. There wasn’t a pre-existing dataset that combined long-term video, accurate 3D object locations, and human pose interactions.
The researchers built upon the Ego-Exo4D dataset, which provides skilled human activities captured with Aria glasses (offering high-quality SLAM/3D data). However, raw video doesn’t tell you when an interaction happens.
To solve this, they used a clever pipeline involving Large Language Models (LLMs) and geometry:
- 3D Object Detection: They used Detic (an object detector) and mapped pixels to 3D point clouds to find objects like “stove” or “fridge.”
- Pose Extraction: They used WHAM (a state-of-the-art pose estimator) to get the 3D skeleton of the person.
- Interaction Spotting: They used Llama-3, an LLM, to read the video narrations (e.g., “person picks up the cup”) and determine if a physical touch occurred.
- Geometric Verification: They cross-referenced the LLM’s output with geometry. An interaction is confirmed only if the person’s hand is physically inside the 3D bounding box of the mentioned object.

This rigorous process resulted in a dataset of over 100,000 interaction instances across cooking, bike repair, and health scenarios.
Experiments and Results
The researchers compared FICTION against two main types of baselines:
- Autoregressive Models: Transformers that predict the “next token” (like GPT but for motion). These are good at short sequences but tend to drift over long durations.
- Video-to-3D Models: Models that try to infer 3D depth directly from video without explicit 3D inputs.
Quantitative Success
The results were decisive. As shown in Table 1 below, FICTION outperformed all baselines across all metrics.

Key Takeaways from the Data:
- Location Prediction (Left side): In the “Cooking” scenario, FICTION achieved a Precision-Recall AUC of 21.0, compared to 11.2 for the hierarchical baseline (HierVL). That is nearly double the performance.
- Pose Prediction (Right side): The error metric (MPJPE, measured in millimeters) dropped significantly. For cooking, the error went from ~473mm (4D-Humans) down to 229mm with FICTION.
- Ablation Studies: The rows labeled “w/o video,” “w/o pose,” and “w/o env” reveal something interesting. Removing the environment (“w/o env”) caused the biggest drop in performance. This proves the paper’s core hypothesis: you cannot predict human interaction without knowing the layout of the room.
Qualitative Visualizations
The numbers are impressive, but the visualizations clarify why the model works better.
In the example below (Figure 4), we see the model predicting interactions in a kitchen.

Look at the Bike Repair example (bottom row of the image). The model observes the person working on a wheel. It correctly predicts that the person will eventually move to the frame of the bike to reattach the wheel. An autoregressive model might just predict that the person keeps holding the wheel in place. FICTION understands the procedure implies moving to a different location in the 3D space.
Comparing directly with baselines (Figure 5 below), we can see how competitors fail. The autoregressive models (Green) often predict the person floating in empty space or drifting away. The Video-to-3D models (Brown) keep the person somewhat grounded but fail to identify the correct future object. FICTION (Pink) correctly places the ghost-avatar at the stove or the bike rack.

Conclusion and Implications
The FICTION paper represents a significant step forward in embodied AI. By bridging the gap between video understanding and 3D spatial reasoning, it allows machines to anticipate human needs in a much more helpful way.
The implications are vast:
- Assistive Robotics: A robot can open the fridge for you because it knows you are heading there before you arrive.
- Smart Homes: Systems can optimize lighting or appliances based on anticipated movement.
- Collaborative AI: In scenarios like bike repair or surgery, an AI agent can prepare tools for the next step of the procedure, knowing exactly where that step will physically take place.
By treating the world as a 4D space—rich with objects, depth, and time—FICTION brings us closer to AI that truly understands the human dance of daily life.
](https://deep-paper.org/en/paper/2412.00932/images/cover.png)