Solved: Why AI Struggles with Hands and How FoundHand Fixes It

If you have ever played with generative AI tools like Midjourney or Stable Diffusion, you have likely encountered the “hand problem.” You prompt for a photorealistic image of a person, and the face looks perfect, the lighting is cinematic, but the hands are a disaster. Extra fingers, impossible joints, or what looks like a bowl of spaghetti made of flesh.

Why is this specific body part the Achilles’ heel of modern AI? The answer lies in complexity. Hands are highly articulated, capable of intricate self-occlusion (fingers hiding behind other fingers), and appear in endless orientations. Furthermore, in the massive datasets used to train models like Stable Diffusion, hands often occupy a very small number of pixels relative to the whole image, meaning the model rarely gets a “good look” at them during training.

In this post, we are diving deep into FoundHand, a new research paper that tackles this problem head-on. The researchers propose a domain-specific model trained on a massive new dataset that doesn’t just generate hands—it controls them with surgical precision.

Overview of FoundHand capabilities, dataset composition, and applications.

As shown above, FoundHand is not just a generator; it is a comprehensive system capable of gesture transfer, fixing malformed images, and even synthesizing video, all driven by a new dataset called FoundHand-10M.

The Data Bottleneck: Introducing FoundHand-10M

Before we can fix the model, we must fix the data. General-purpose datasets (like LAION-5B) are broad but shallow regarding specific anatomical structures. Existing hand-specific datasets are often too small, captured in sterile lab environments, or lack diverse lighting and textures.

The researchers’ first contribution is FoundHand-10M, a massive dataset comprising 10 million hand images. Instead of capturing new data from scratch, they employed a clever aggregation strategy. They combined 12 existing datasets—ranging from egocentric views (Ego4D) to sign language datasets and motion capture libraries (DexYCB, ARCTIC).

The Unification Challenge

Merging datasets isn’t as simple as copying files into a folder. Each dataset has different annotation formats. Some use 3D meshes, others use 2D bounding boxes. To create a unified training ground, the authors re-annotated the entire collection using a standardized pipeline:

MediaPipe was used to extract 2D keypoints (the skeletal structure of the hand).
Segment Anything Model (SAM) was used to generate precise segmentation masks (separating the hand from the background).

This resulted in a dataset rich in diversity—containing hands holding objects, interacting with other hands, and performing gestures—all unified by a common “language” of 2D keypoints and masks.

The FoundHand Architecture

The core of this research is the FoundHand model itself. It is a pose-conditioned diffusion model. Unlike text-to-image models where you type “a hand waving,” here you provide a visual condition: the skeletal pose you want the hand to take.

Why 2D Keypoints?

Previous attempts at controllable hand generation often relied on 3D meshes (like the MANO model). While accurate, 3D meshes are computationally expensive and difficult to obtain for “in-the-wild” images.

The authors’ key insight is that 2D keypoints are a universal representation. A 2D projection of a hand naturally encodes both the articulation (how the fingers are bent) and the camera viewpoint. If you know where the 2D joints are, you implicitly know how the hand is oriented relative to the camera.

The Model Pipeline

Let’s look at how the model actually processes data. FoundHand treats generation as an image-to-image translation task.

Architecture diagram showing the training process, inputs, and 3D self-attention mechanism.

As illustrated in Figure 2, the workflow involves two main inputs:

Reference: This provides the visual style (skin tone, lighting, texture).
Target: This provides the structure (the desired pose).

The architecture builds upon a Vision Transformer (ViT) backbone, specifically a Diffusion Transformer (DiT). Here is the step-by-step flow:

Input Encoding: The model takes the reference image, reference keypoints, and reference mask. It also takes the target keypoints (the pose we want).
Heatmap Representation: Instead of feeding raw coordinate numbers, the 2D keypoints are converted into Gaussian heatmaps. This is a crucial detail. There are 21 channels for the right hand and 21 for the left. This channel separation helps the model handle occlusion—if one finger is behind another, their heatmaps are on different channels, preventing the “merging” artifacts common in other models.
Shared Embedder: The image features (from a VAE), heatmaps, and masks are spatially aligned and passed through a shared embedder. This grounds the visual features in the physical structure of the hand.
3D Self-Attention: This is the “magic” component. The transformer uses self-attention not just within a single image, but between the reference and target frames. This allows the model to “copy” the texture of a ring or a tattoo from the reference image and “paste” it accurately onto the geometry of the target pose.

Training Strategy

To make the model robust, the researchers trained it on pairs of images.

Motion Pairs: Two frames from the same video (same hand, different pose). This teaches the model how hands move.
Multi-view Pairs: Two cameras looking at the same frozen hand moment. This teaches the model about 3D structure and viewpoints.

They also employed stochastic conditioning, randomly dropping out the reference image during training. This forces the model to learn a strong prior—it learns what a hand should look like even without a reference, which is vital for fixing malformed hands from scratch.

Core Capabilities

Once trained, FoundHand unlocks several powerful capabilities that go far beyond simple image generation.

1. Gesture Transfer

This is the most direct application. You have a photo of a hand (Reference) and a stick figure of a pose (Target). The model generates the hand in the new pose while preserving the identity.

Comparison of gesture transfer against other models. FoundHand preserves details like fingernails and lighting.

In the figure above, look at the bottom row (Ours). Notice how FoundHand preserves fine-grained details like fingernail polish, skin texture, and lighting conditions.

Comparing this quantitatively reveals a significant gap between FoundHand and previous state-of-the-art methods:

Table showing quantitative metrics. FoundHand achieves higher PSNR and SSIM scores.

The table highlights that FoundHand achieves superior scores in PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index), metrics that measure image fidelity. It also scores lowest on LPIPS and FID, which measure perceptual distance (lower is better), meaning the generated images look more natural and closer to the reference.

2. Domain Transfer (Sim-to-Real)

Synthetic data (from 3D renders) is easy to generate but looks fake. FoundHand can take a synthetic render and “style transfer” it into a realistic image, effectively bridging the “sim-to-real” gap.

Visualizing the transformation of synthetic white hands into realistic hands.

This is incredibly useful for training computer vision models. By converting synthetic datasets into realistic ones, researchers can create better training data for 3D hand pose estimation.

Table showing improved performance of the HaMeR model after fine-tuning on FoundHand data.

As shown in the table above, when the 3D mesh recovery model (HaMeR) was fine-tuned on data generated by FoundHand, its error rates (PA-MPJPE) dropped significantly. This proves the generated images aren’t just pretty; they are anatomically accurate enough to train other AI models.

3. Novel View Synthesis (NVS)

Perhaps the most surprising capability is Novel View Synthesis. Given a single image of a hand, FoundHand can generate what that hand looks like from the side, top, or back.

Generating diverse viewpoints from a single input image.

It achieves this without being explicitly trained with 3D camera parameters. Because the model learned from multi-view pairs in the FoundHand-10M dataset, it implicitly understands 3D geometry. It “knows” that if a hand rotates 90 degrees, the thumb usually disappears behind the palm.

The performance here is staggering when compared to models designed specifically for 3D synthesis:

Comparison of Novel View Synthesis against ZeroNVS and ImageDream.

In Figure 6, notice how competitors like ZeroNVS often create blurry or distorted artifacts. FoundHand (Ours) maintains crisp edges and correct lighting. The quantitative data backs this up:

Table showing NVS evaluation. FoundHand outperforms explicit 3D representation methods.

FoundHand achieves a PSNR of 27.72, significantly higher than ImageDream (19.97). It also performs competitively in video generation tasks, despite not being a native video model.

Zero-Shot Applications: The Emergent Behaviors

The true test of a generative model is its ability to perform tasks it wasn’t explicitly trained for (zero-shot learning). FoundHand excels here in two fascinating ways.

Hand Fixing (In-painting)

This is the “killer app” for many users. If you have an AI-generated image with a mangled hand, FoundHand can fix it. You mask out the bad hand, provide a target skeleton (or let the model infer one), and it regenerates a physically plausible hand that matches the lighting and style of the original image.

Comparison of hand fixing capabilities. FoundHand fixes malformations while preserving context.

Unlike other “hand fixers” that paste a generic hand over the image, FoundHand respects the context. If the character is holding a glowing orb or a game controller, FoundHand generates a hand interacting with that object.

Implicit Object Understanding

This leads to the most intriguing discovery: Object Permanence and Physics.

The model was trained on hands, not objects. However, because the training data contained millions of hands holding things, the model learned about objects by association.

Sequence showing a hand interacting with a cup and squishing a sponge.

In the figure above, the model is given a reference image of a hand holding a pink sponge. As the target pose closes (simulating a squeeze), the model automatically deforms the pink sponge (squishing it). It was never explicitly taught the physics of sponges; it learned that when a hand closes around a pink object, the pink object gets smaller.

Similarly, in the top row, it understands rigid motion—moving the cup along with the hand rather than leaving it floating in mid-air.

Video Synthesis

Finally, FoundHand can generate coherent video sequences. By using a technique called stochastic conditioning—where each new frame is conditioned on the first frame (for consistency) and the previous frame (for smooth motion)—it creates videos that are temporally stable.

Comparison of video synthesis. FoundHand produces smoother, more anatomical motion.

Figure 9 compares FoundHand against video-specific diffusion models. FoundHand produces motion that respects the anatomical limits of the human skeleton, whereas other models often allow fingers to bend backward or “noodle” unnaturally during motion.

Conclusion and Future Implications

FoundHand represents a significant step forward in generative AI. By shifting the focus from “bigger models” to “better, domain-specific data,” the researchers have solved one of the most persistent visual artifacts in AI generation.

The success of this approach relies on three pillars:

Scale: A 10-million image dataset (FoundHand-10M) provides the necessary volume.
Representation: Using 2D keypoint heatmaps provides a lightweight yet information-dense signal that encodes both pose and view.
Architecture: The image-to-image translation approach with 3D self-attention ensures that style and structure are perfectly aligned.

The implications extend beyond just pretty pictures. The ability to perform Domain Transfer means we can generate infinite training data for robotics, helping robot hands learn to grasp objects by practicing on AI-generated images. The implicit understanding of object physics suggests that scaling up visual data can teach models a surprising amount about how the physical world works.

While the model is currently limited to a resolution of 256x256 (requiring upscaling for high-res outputs), the methodology proves that we don’t necessarily need complex 3D engines to achieve 3D-consistent results. Sometimes, 2D data—if you have enough of it and look at it the right way—is all you need.

Solved: Why AI Struggles with Hands and How FoundHand Fixes It#

The Data Bottleneck: Introducing FoundHand-10M#

The Unification Challenge#

The FoundHand Architecture#

Why 2D Keypoints?#

The Model Pipeline#

Training Strategy#

Core Capabilities#

1. Gesture Transfer#

2. Domain Transfer (Sim-to-Real)#

3. Novel View Synthesis (NVS)#

Zero-Shot Applications: The Emergent Behaviors#

Hand Fixing (In-painting)#

Implicit Object Understanding#

Video Synthesis#

Conclusion and Future Implications#