If you have experimented with recent video generation models like Sora, Stable Video Diffusion, or MovieGen, you have likely noticed a recurring pattern. These models can generate breathtaking landscapes, cyberpunk cities, and surreal abstractions with ease. But the moment you ask for a video of a human speaking or performing a complex action, the cracks begin to show.
Faces distort, hands morph into eldritch horrors, and movements defy the laws of physics.
The reason for this isn’t necessarily a flaw in the model architecture (like the Diffusion Transformer); rather, it is a data problem. Existing large-scale video datasets are often low-resolution, watermarked, or lack the specific “human-centric” metadata required to teach a model how a person actually moves and looks.
Enter OpenHumanVid.
In a recent paper from researchers at Fudan University and Baidu Inc., the team introduced a massive, high-quality dataset designed specifically to bridge this gap. In this post, we will tear down the paper to understand how they curated over 13 million high-quality clips, the pipeline they built to filter “bad” data, and the specific training strategies that allow for photorealistic human video generation.
The Bottleneck: Why “Big Data” Isn’t Enough
To train a video generation model, you generally need two things: a massive amount of video and text descriptions (captions) that match those videos.
Previous datasets like WebVid-10M or Panda-70M provided millions of clips. However, they suffer from significant limitations when applied to human-centric tasks:
- Low Resolution: Many are capped at 360p or are riddled with watermarks.
- Generic Captions: A caption might say “Man walking,” which is insufficient for a model to learn facial micro-expressions or complex gestures.
- Lack of Motion Data: They provide video pixels but rarely include structural data like skeletal poses or depth maps.
When models are trained on this data, they learn “average” human motion, which results in the “uncanny valley” effect we often see. OpenHumanVid was created to solve this by focusing exclusively on high-quality, diverse human data.
Introducing OpenHumanVid
OpenHumanVid is not just a collection of random videos; it is a curated library derived from high-production-value sources like films, TV series, and documentaries. This ensures that the lighting, camera work, and aesthetic quality are already professional-grade before the computer even looks at them.
As illustrated in Figure 1 below, the dataset is massive. It starts with raw footage and filters down to 13.2 million high-quality clips. Crucially, it doesn’t just offer video-text pairs; it includes skeletal sequences (for pose control) and speech audio (for lip-syncing).

How It Compares to the Competition
To understand the scale of this contribution, we can look at the comparison in Table 1. While datasets like WebVid-10M have high volume, they lack the “human” specificity. OpenHumanVid combines the scale of general datasets with the detailed annotations (skeletons, audio) usually reserved for small, niche datasets like UCF-101.
![Table 1. The comparative analysis of our dataset against previous general and human video datasets. We enhance the textual captions by incorporating short, long, and structured formats that reflect human characteristics. Additionally, we integrate skeleton sequences derived from DWPose [64] and corresponding speech audio filtered through SyncNet [41] to enrich the dataset with contextual human motion data.](/en/paper/2412.00115/images/003.jpg#center)
The Core Method: Building the Pipeline
The most educational aspect of this paper for students of data science and AI is the processing pipeline. You cannot simply scrape 100,000 hours of video and feed it into a GPU. The noise would destroy the model’s convergence.
The researchers engineered a four-step pipeline to distill raw footage into gold-standard training data.

Step 1: Video Preprocessing
Before quality analysis, basic cleanup is required:
- Codec Standardization: Everything is converted to H.264.
- Subtitle Removal: Using a method called CRAFT, they crop out subtitles (text overlay is terrible for training generative models unless you want the model to randomly generate gibberish text).
- Scene Splitting: They use
SceneDetectto chop videos into 2-20 second clips based on cuts or transitions.
Step 2: Video Quality Filtering
This is where the magic happens. The team employed a “survival of the fittest” approach using five key metrics:
- Luminance: Too dark or too bright? Deleted.
- Blur: Detected via edge analysis. Blurry footage is discarded.
- Aesthetic Quality: Using a CLIP-based predictor to score artistic composition.
- Motion: Using Optical Flow to ensure the video actually moves (static shots are bad for video training).
- Technical Quality: A general score for compression artifacts and noise.
The result of this filtering is stark. Figure 6 shows examples of kept (white numbers) vs. deleted (red numbers) videos. Notice how the deleted videos are often dark, blurry, or lack clear subjects.

The impact of this filtering is measurable. As shown in Figure 4 below, the distribution of quality in OpenHumanVid (the blue “Filtered” areas) is consistently higher and tighter than datasets like Panda-70M (the green areas), particularly in aesthetic quality and motion smoothness.

Step 3: Human-Centric Annotation
Once the video quality is secured, the pipeline focuses on the content.
- Captions: They didn’t rely on a single model. They used MiniCPM and CogVLM to generate descriptions, then used a voting strategy with BLIP2 to pick the best one. Finally, Llama 3.1 rewrote them into “Structured,” “Short,” and “Long” formats.
- Skeletons: DWPose was used to extract wireframe skeletons of the actors.
- Audio: SyncNet was used to verify that the lip movements in the video actually matched the audio track, enabling high-quality lip-sync training.
Step 4: Human Quality Filter
The final boss of the pipeline is the Human Quality Filter. It’s not enough to have a high-quality video; the text must align with the human in the video.
- Appearance Alignment: Does the text “woman in a red dress” actually match the pixels?
- Motion Alignment: Does “waving hand” match the action?
- If the alignment score (calculated via BLIP2) is low, the clip is tossed. This ensures the model doesn’t learn incorrect associations.
The Validation Model: Extended Diffusion Transformer
To prove the dataset works, the researchers needed to train a model. They chose a baseline Diffusion Transformer (DiT), similar to the architecture used by Sora and CogVideoX.
However, training a massive DiT from scratch is computationally expensive. Instead, they utilized Low-Rank Adaptation (LoRA).
How it Works (Simplified)
- 3D Causal VAE: This compresses the video into a latent space (a smaller mathematical representation) to make processing manageable.
- Expert Transformer: The core brain that predicts the noise.
- LoRA Integration: Instead of retraining every weight in the network (Billions of parameters), LoRA injects small, trainable rank matrices into the attention layers. This allows the model to learn the new “OpenHumanVid” style without forgetting its original training, and it does so with a fraction of the compute.

Experiments and Results
The researchers conducted rigorous experiments to see exactly which parts of their pipeline contributed to the improved results. They focused on metrics like Face Consistency (does the face stay the same or morph?) and VBench scores (a standard video generation benchmark).
Insight 1: Frame Rate Matters
One of the most critical findings was the impact of the video sampling rate (FPS). Training on 24 FPS data yielded significantly better results than the standard 8 FPS used in many previous works.
Why? Human motion—especially facial expressions and hand gestures—contains subtle, fast-paced information. At 8 FPS, a smile might look like a sudden glitch. At 24 FPS, the model learns the transition.
Table 3 confirms that increasing FPS improves both Face and Body Consistency scores.

Insight 2: Alignment is Everything
The experiments also highlighted the value of the text-video alignment filters. By filtering out data where the text didn’t perfectly match the human’s appearance or motion, the generated results improved drastically.
We can see the visual proof in Figure 8.
- Row (a): Shows that high sampling rates (24 FPS) prevent the “melting” look during fast motion.
- Row (b): The “Human Appearance Filter” ensures faces remain structured and handsome/beautiful, rather than distorted.
- Row (d): The “Face Motion” alignment allows the model to accurately render specific emotions like “sorrowful” or “smile,” which the baseline model fails to capture.

The Final Verdict
When compared against the baseline CogVideoX model, the version trained on OpenHumanVid (Ours) demonstrated superior performance across almost all metrics. Table 5 highlights improvements in I2V Consistency (keeping the identity stable) and Motion Smoothness.

Conclusion and Implications
OpenHumanVid represents a significant step forward in generative video. It shifts the focus from model architecture wars to a perhaps more important frontier: Data Curation.
By treating data processing as a first-class citizen—implementing rigorous aesthetic filters, ensuring high frame rates, and enforcing strict text-video alignment—the researchers demonstrated that we can generate human videos that are not only high-resolution but also coherent and emotionally expressive.
For students and researchers entering the field, the takeaway is clear: A model is only as good as the data it sees. If you want to solve complex problems like human motion, you cannot rely on noisy, web-scraped data. You need structured, high-quality, and semantically aligned datasets. OpenHumanVid provides exactly that foundation for the next generation of video AI.
](https://deep-paper.org/en/paper/2412.00115/images/cover.png)