Solving the Invisible Hand: How ManiVideo Masters 3D Occlusion in Video Generation

If you have ever tried to draw a hand, you know the struggle. Getting the proportions right is hard enough, but the real nightmare begins when the fingers start curling, overlapping, and gripping objects. Suddenly, parts of the hand disappear behind the object, or behind other fingers.

Now, imagine teaching an AI to not just draw this interaction, but to generate a temporally consistent video of it.

This is one of the “final frontiers” in computer vision and generative AI. While we have models that can generate beautiful landscapes or static portraits, generating dexterous hand-object manipulation remains incredibly difficult. Why? Because of occlusion and data scarcity.

In this post, we are diving deep into ManiVideo, a new research paper that proposes a novel framework for generating realistic bimanual (two-handed) manipulation videos. We will explore how the researchers solved the “disappearing finger” problem using a clever Multi-Layer Occlusion representation and how they taught the model to handle objects it has never seen before.

The Twin Challenges: Occlusion and Generalization

Before we get into the solution, let’s concretely define the problem. Current diffusion-based methods for generating hand-object interactions (HOI) usually rely on 2D conditions—like depth maps or segmentation masks—to guide the generation.

This approach has two fatal flaws:

  1. The Occlusion Blind Spot: A 2D depth map only shows you the surface closest to the camera. If a finger is wrapped around the back of a cup, the 2D map doesn’t know it exists. Consequently, the AI often forgets to generate that finger when it reappears, or it “melts” the hand into the object.
  2. The Generalization Gap: To learn how hands interact with the world, models need video data. But HOI video datasets are small and contain very few object categories. If you train a model on mugs and bowls, it will have no idea what to do when asked to generate a video of a hand holding a stapler.

Enter ManiVideo

The researchers behind ManiVideo address these issues by moving away from simple 2D conditions. Instead, they introduce a 3D-aware pipeline that “understands” the layers of the scene.

Let’s look at the high-level architecture:

Figure 2. The overall framework of ManiVideo. Given raw hand-object signals, we first transform them into multi-layer occlusion (MLO) representation and object representation. MLO structure is designed to enforce the 3D consistency of HOI, which includes occlusion-free normal maps H and occlusion confidence maps D. Object representation contains the appearance and geometry information, ensuring the dynamic consistency of objects. Then, we inject MLO representation and object representation into the denoising UNet and AppearanceNet.

As shown in Figure 2, the system takes raw hand-object signals (pose parameters and 3D meshes) and splits them into two powerful streams of information before feeding them into a Denoising UNet (the core of a diffusion model):

  1. Multi-Layer Occlusion (MLO) Representation: To handle the geometry and overlaps.
  2. Object Representation: To handle the appearance and identity of the object.

Let’s break these down.

1. Multi-Layer Occlusion (MLO) Representation

The core innovation of this paper is the MLO representation. Instead of treating the image as a flat surface, the researchers treat the hand-object interaction as a series of layers.

Seeing Through the Obstructions

In a standard depth map, hidden pixels are lost. The MLO strategy, however, renders Occlusion-free Normal Maps (\(H\)). This involves rendering the scene multiple times, layer by layer—from far to near. For example, the system renders the object, then the palm, then the thumb, then the index finger, and so on, independently.

This ensures that even if the index finger is hiding the thumb from the camera’s view, the model still receives a signal that the thumb exists and where it is located in 3D space.

The Confidence Map

Just knowing where the hidden parts are isn’t enough; the model needs to know what is hidden and what is visible. This is where Occlusion Confidence Maps (\(D\)) come in. These are depth maps that indicate the degree of occlusion. Darker regions in the confidence map indicate severe occlusion, while lighter regions indicate visibility.

Figure 1. ManiVideo: We propose a novel framework for generalizable and dexterous hand-object manipulation video generation. Left: Given several reference images of unseen objects, our method generates realistic and plausible manipulation video driven by hand-object signals. By integrating multiple datasets, ManiVideo supports applications such as human-centered manipulation video generation. Right: To ensure hand-object consistency, we introduce a multi-layer occlusion representation capable of learning 3D occlusion relationships from occlusion-free normal maps and occlusion confidence maps.

As visualized on the right side of Figure 1, this representation allows the model to “reason” about the scene. It learns that although it can’t see the pinky finger right now, it is located behind the cup, preventing the AI from hallucinating a new pinky or merging the finger into the cup’s surface.

Why MLO Matters: An Ablation

Does this extra complexity actually help? The visual evidence is striking. Below in Figure 5, look at the difference between the baseline (w/o MLO) and the ManiVideo method (Ours).

Figure 5. Ablation study of the multi-layer occlusion (MLO) representation. Without MLO structure, basic 2D conditions fail to ensure accurate structure and occlusion relationships between objects and fingers. Incomplete embedding (w/o MLO*) diminishes the effectiveness of the MLO representation.

In the top row (w/o MLO), notice the bounding box drift—the model struggles to define where the object ends and the hand begins. In the “Ours” column, the grasp is tight, and the boundary between hand and object is crisp. This is the power of explicitly modeling occlusion.

2. Solving Data Scarcity with Objaverse

The second major hurdle is the lack of diverse video data. If you only train on videos of people holding apples, your model won’t generate a convincing video of a person holding a drill.

To solve this, the researchers integrated Objaverse, a massive dataset of 3D objects.

Bridging Static and Dynamic

The clever trick here is how they use static 3D models to improve video generation. They take a 3D object from Objaverse and simulate motion trajectories (rotations and translations). They then render this “fake” movement to create training data.

This teaches the model two things:

  1. Appearance Consistency: By rendering the object from multiple views (front, back, top, bottom), the model learns that a cup looks different from the top than from the side, but it is still the same cup.
  2. Geometry Consistency: It reinforces the structural integrity of objects during motion.

Figure 6. Ablation study of object augmentation training. Utilizing Objaverse helps the model learn dynamic consistency from large object datasets.

Figure 6 shows the impact of this strategy. Without the Objaverse data (“w/o Obja”), the model struggles with the texture and geometry of complex objects (look at the blurred spatula). With the extra data (“Ours”), the object maintains its shape and texture fidelity.

Furthermore, Figure 7 demonstrates that the model can generalize to completely random objects from the Objaverse dataset, maintaining their structure over time.

Figure 7. Results on objaverse [6]. For each example, the first row shows the generated results, while the second row is the ground truth. The results show that our model learns the consistency of objects from Objaverse.

Inside the Architecture: How It Works

We’ve discussed the inputs (MLO and Object data), but how are they actually processed? ManiVideo uses a modified UNet architecture typical of diffusion models, but with specific injection points for these new signals.

Injecting 3D Geometry (\(H\) and \(D\))

The MLO representation is injected into the network in two distinct ways to maximize its impact:

1. Noise Injection (The Pose Guider): The normal maps (\(H\)) are processed by a lightweight “Pose Guider” network (\(G\)). The features extracted are added directly to the noisy latent code (\(z_t\)) at the very beginning of the process. This helps align the generated image spatially right from the start.

Equation 1

2. Cross-Attention Injection: For deeper understanding, both the normal maps (\(H\)) and the confidence maps (\(D\)) are concatenated and processed into embeddings (\(E_F\)). These are injected into the transformer blocks of the UNet via cross-attention. This allows the network to query 3D occlusion information at various stages of the generation process.

Equation 2

Injecting Object Appearance

To ensure the object looks like the reference image, the researchers use a separate network called AppearanceNet (\(R\)). This network extracts features from reference images of the object (taken from different angles) and the background.

These features are fused with the main UNet features (\(f_0\)) using convolution:

Equation 3

Additionally, the geometry of the object (represented by point clouds \(P\) and normals \(H_o\)) is injected via cross-attention, ensuring the model respects the rigid shape of the object.

Equation 4

Training Strategy: The Two-Stage Approach

Training a model on such disparate data sources (video, 3D objects, human images) requires a careful strategy. The authors use a two-stage training process:

  1. Image Stage: The model is trained to generate single frames. Here, they mix real HOI video frames with synthetic renderings from Objaverse. This teaches the model spatial structure and object diversity.
  2. Temporal Stage: The spatial layers are frozen, and “temporal layers” (motion modules) are added. The model is then trained on video sequences to learn how hands and objects move smoothly over time.

Experimental Results

So, how does it compare to the state-of-the-art? The researchers compared ManiVideo against leading methods like HOGAN, Affordance Diffusion (ADiff), and ControlNet-based Diffusion (CDiff).

Qualitative Comparison

Figure 3 (below) shows results on the DexYCB dataset. Pay attention to the hands. In the HOGAN and CDiff results, fingers often look garbled or detached. ManiVideo (Ours) maintains a coherent hand structure, even when fingers are interlocking or partially hidden.

Figure 3. Qualitative comparison of different methods on DexYCB dataset [4]. Our results perform best in cases of hand-object mutual occlusion and finger self-occlusion.

Similarly, on their collected dataset (Figure 4), ManiVideo shows superior stability. The contact points between the fingers and the objects are much more realistic compared to ADiff, where the fingers sometimes fail to touch the object properly.

Figure 4. Qualitative comparison of different methods on videos we collect. Our approach achieves the best results.

Quantitative Comparison

The numbers back up the visuals. In Table 1, ManiVideo achieves the best scores across the board.

  • FID (Fréchet Inception Distance): Measures image quality (lower is better). ManiVideo scores 49.96 vs ADiff’s 53.95.
  • MPJPE (Mean Per-Joint Position Error): Measures how accurately the hand pose is reconstructed. ManiVideo achieves the lowest error, indicating high geometric accuracy.

Table 1. Quantitative comparison on DexYCB and our dataset. Our ManiVideo outperforms other methods.

Going Further: Human-Centric Video Generation

One of the most exciting implications of this work is its application to full-body human video generation. Because the framework is flexible, the researchers could fine-tune it on human-centric datasets (like Human4DiT).

By using a human reference image as the “background” condition and optionally injecting a skeleton pose (\(S\)) via a pose guider:

Equation 5

…the model can generate videos of specific people manipulating objects.

Figure 8. Human-based hand-Object manipulation video generation. Using human reference images as input and training on humancentered datasets, our model is capable of generating human-centric hand-object manipulation videos.

As seen in Figure 8, the model preserves the identity of the person (the reference) while animating their hands to interact with objects, opening up possibilities for virtual avatars and advanced video editing.

Conclusion

ManiVideo represents a significant step forward in generating dynamic, interacting worlds. By explicitly modeling what we can’t see (occlusions) and leveraging massive 3D datasets to understand object geometry, the researchers have created a system that can handle the complex dance of fingers and objects.

For students and researchers in the field, the key takeaway is the power of structured representations. Relying on raw 2D pixel data or simple depth maps often isn’t enough for complex 3D tasks. Sometimes, you need to break the scene down into layers—normals, confidence maps, and separate object priors—to give the AI the context it needs to create a convincing reality.