Solving the Invisible Hand: How ManiVideo Masters 3D Occlusion in Video Generation
If you have ever tried to draw a hand, you know the struggle. Getting the proportions right is hard enough, but the real nightmare begins when the fingers start curling, overlapping, and gripping objects. Suddenly, parts of the hand disappear behind the object, or behind other fingers.
Now, imagine teaching an AI to not just draw this interaction, but to generate a temporally consistent video of it.
This is one of the “final frontiers” in computer vision and generative AI. While we have models that can generate beautiful landscapes or static portraits, generating dexterous hand-object manipulation remains incredibly difficult. Why? Because of occlusion and data scarcity.
In this post, we are diving deep into ManiVideo, a new research paper that proposes a novel framework for generating realistic bimanual (two-handed) manipulation videos. We will explore how the researchers solved the “disappearing finger” problem using a clever Multi-Layer Occlusion representation and how they taught the model to handle objects it has never seen before.
The Twin Challenges: Occlusion and Generalization
Before we get into the solution, let’s concretely define the problem. Current diffusion-based methods for generating hand-object interactions (HOI) usually rely on 2D conditions—like depth maps or segmentation masks—to guide the generation.
This approach has two fatal flaws:
- The Occlusion Blind Spot: A 2D depth map only shows you the surface closest to the camera. If a finger is wrapped around the back of a cup, the 2D map doesn’t know it exists. Consequently, the AI often forgets to generate that finger when it reappears, or it “melts” the hand into the object.
- The Generalization Gap: To learn how hands interact with the world, models need video data. But HOI video datasets are small and contain very few object categories. If you train a model on mugs and bowls, it will have no idea what to do when asked to generate a video of a hand holding a stapler.
Enter ManiVideo
The researchers behind ManiVideo address these issues by moving away from simple 2D conditions. Instead, they introduce a 3D-aware pipeline that “understands” the layers of the scene.
Let’s look at the high-level architecture:

As shown in Figure 2, the system takes raw hand-object signals (pose parameters and 3D meshes) and splits them into two powerful streams of information before feeding them into a Denoising UNet (the core of a diffusion model):
- Multi-Layer Occlusion (MLO) Representation: To handle the geometry and overlaps.
- Object Representation: To handle the appearance and identity of the object.
Let’s break these down.
1. Multi-Layer Occlusion (MLO) Representation
The core innovation of this paper is the MLO representation. Instead of treating the image as a flat surface, the researchers treat the hand-object interaction as a series of layers.
Seeing Through the Obstructions
In a standard depth map, hidden pixels are lost. The MLO strategy, however, renders Occlusion-free Normal Maps (\(H\)). This involves rendering the scene multiple times, layer by layer—from far to near. For example, the system renders the object, then the palm, then the thumb, then the index finger, and so on, independently.
This ensures that even if the index finger is hiding the thumb from the camera’s view, the model still receives a signal that the thumb exists and where it is located in 3D space.
The Confidence Map
Just knowing where the hidden parts are isn’t enough; the model needs to know what is hidden and what is visible. This is where Occlusion Confidence Maps (\(D\)) come in. These are depth maps that indicate the degree of occlusion. Darker regions in the confidence map indicate severe occlusion, while lighter regions indicate visibility.

As visualized on the right side of Figure 1, this representation allows the model to “reason” about the scene. It learns that although it can’t see the pinky finger right now, it is located behind the cup, preventing the AI from hallucinating a new pinky or merging the finger into the cup’s surface.
Why MLO Matters: An Ablation
Does this extra complexity actually help? The visual evidence is striking. Below in Figure 5, look at the difference between the baseline (w/o MLO) and the ManiVideo method (Ours).

In the top row (w/o MLO), notice the bounding box drift—the model struggles to define where the object ends and the hand begins. In the “Ours” column, the grasp is tight, and the boundary between hand and object is crisp. This is the power of explicitly modeling occlusion.
2. Solving Data Scarcity with Objaverse
The second major hurdle is the lack of diverse video data. If you only train on videos of people holding apples, your model won’t generate a convincing video of a person holding a drill.
To solve this, the researchers integrated Objaverse, a massive dataset of 3D objects.
Bridging Static and Dynamic
The clever trick here is how they use static 3D models to improve video generation. They take a 3D object from Objaverse and simulate motion trajectories (rotations and translations). They then render this “fake” movement to create training data.
This teaches the model two things:
- Appearance Consistency: By rendering the object from multiple views (front, back, top, bottom), the model learns that a cup looks different from the top than from the side, but it is still the same cup.
- Geometry Consistency: It reinforces the structural integrity of objects during motion.

Figure 6 shows the impact of this strategy. Without the Objaverse data (“w/o Obja”), the model struggles with the texture and geometry of complex objects (look at the blurred spatula). With the extra data (“Ours”), the object maintains its shape and texture fidelity.
Furthermore, Figure 7 demonstrates that the model can generalize to completely random objects from the Objaverse dataset, maintaining their structure over time.
![Figure 7. Results on objaverse [6]. For each example, the first row shows the generated results, while the second row is the ground truth. The results show that our model learns the consistency of objects from Objaverse.](/en/paper/2412.16212/images/013.jpg#center)
Inside the Architecture: How It Works
We’ve discussed the inputs (MLO and Object data), but how are they actually processed? ManiVideo uses a modified UNet architecture typical of diffusion models, but with specific injection points for these new signals.
Injecting 3D Geometry (\(H\) and \(D\))
The MLO representation is injected into the network in two distinct ways to maximize its impact:
1. Noise Injection (The Pose Guider): The normal maps (\(H\)) are processed by a lightweight “Pose Guider” network (\(G\)). The features extracted are added directly to the noisy latent code (\(z_t\)) at the very beginning of the process. This helps align the generated image spatially right from the start.

2. Cross-Attention Injection: For deeper understanding, both the normal maps (\(H\)) and the confidence maps (\(D\)) are concatenated and processed into embeddings (\(E_F\)). These are injected into the transformer blocks of the UNet via cross-attention. This allows the network to query 3D occlusion information at various stages of the generation process.

Injecting Object Appearance
To ensure the object looks like the reference image, the researchers use a separate network called AppearanceNet (\(R\)). This network extracts features from reference images of the object (taken from different angles) and the background.
These features are fused with the main UNet features (\(f_0\)) using convolution:

Additionally, the geometry of the object (represented by point clouds \(P\) and normals \(H_o\)) is injected via cross-attention, ensuring the model respects the rigid shape of the object.

Training Strategy: The Two-Stage Approach
Training a model on such disparate data sources (video, 3D objects, human images) requires a careful strategy. The authors use a two-stage training process:
- Image Stage: The model is trained to generate single frames. Here, they mix real HOI video frames with synthetic renderings from Objaverse. This teaches the model spatial structure and object diversity.
- Temporal Stage: The spatial layers are frozen, and “temporal layers” (motion modules) are added. The model is then trained on video sequences to learn how hands and objects move smoothly over time.
Experimental Results
So, how does it compare to the state-of-the-art? The researchers compared ManiVideo against leading methods like HOGAN, Affordance Diffusion (ADiff), and ControlNet-based Diffusion (CDiff).
Qualitative Comparison
Figure 3 (below) shows results on the DexYCB dataset. Pay attention to the hands. In the HOGAN and CDiff results, fingers often look garbled or detached. ManiVideo (Ours) maintains a coherent hand structure, even when fingers are interlocking or partially hidden.
![Figure 3. Qualitative comparison of different methods on DexYCB dataset [4]. Our results perform best in cases of hand-object mutual occlusion and finger self-occlusion.](/en/paper/2412.16212/images/008.jpg#center)
Similarly, on their collected dataset (Figure 4), ManiVideo shows superior stability. The contact points between the fingers and the objects are much more realistic compared to ADiff, where the fingers sometimes fail to touch the object properly.

Quantitative Comparison
The numbers back up the visuals. In Table 1, ManiVideo achieves the best scores across the board.
- FID (Fréchet Inception Distance): Measures image quality (lower is better). ManiVideo scores 49.96 vs ADiff’s 53.95.
- MPJPE (Mean Per-Joint Position Error): Measures how accurately the hand pose is reconstructed. ManiVideo achieves the lowest error, indicating high geometric accuracy.

Going Further: Human-Centric Video Generation
One of the most exciting implications of this work is its application to full-body human video generation. Because the framework is flexible, the researchers could fine-tune it on human-centric datasets (like Human4DiT).
By using a human reference image as the “background” condition and optionally injecting a skeleton pose (\(S\)) via a pose guider:

…the model can generate videos of specific people manipulating objects.

As seen in Figure 8, the model preserves the identity of the person (the reference) while animating their hands to interact with objects, opening up possibilities for virtual avatars and advanced video editing.
Conclusion
ManiVideo represents a significant step forward in generating dynamic, interacting worlds. By explicitly modeling what we can’t see (occlusions) and leveraging massive 3D datasets to understand object geometry, the researchers have created a system that can handle the complex dance of fingers and objects.
For students and researchers in the field, the key takeaway is the power of structured representations. Relying on raw 2D pixel data or simple depth maps often isn’t enough for complex 3D tasks. Sometimes, you need to break the scene down into layers—normals, confidence maps, and separate object priors—to give the AI the context it needs to create a convincing reality.
](https://deep-paper.org/en/paper/2412.16212/images/cover.png)