Imagine being able to direct a short film entirely from your laptop. You provide a photo of an actor, a script for their lines, and a description of the scene — and an AI model generates a high-quality video that brings your vision to life. This is the promise of Human-Centric Video Generation (HCVG), a rapidly evolving field that’s reshaping content creation.

Traditionally, producing even a short video is a complex, expensive endeavor involving casting, location scouting, filming, and post-production. Generative AI aims to democratize this process by allowing creators to craft videos from simple multimodal inputs: text for describing scenes and actions, images for defining character identity, and audio for speech.

However, getting these three modalities — text, images, and audio — to work together harmoniously is a major challenge. Existing models often struggle to balance them. For example, a model might excel at matching a person’s identity from a photo but fail to follow the text prompt accurately. Another might achieve perfect lip-sync with an audio file but lose the subject’s original appearance. This trade-off has long hindered progress.

A new paper from researchers at Tsinghua University and ByteDance introduces HuMo, a unified framework designed to solve this problem. HuMo enables collaborative multimodal control across text, image, and audio, achieving state-of-the-art results in all aspects simultaneously — a leap forward in creating realistic, controllable, and versatile human videos.

A collage of videos generated by HuMo, showcasing its ability to handle text-image, text-audio, and text-image-audio inputs for diverse subjects like humans, stylized characters, and animations.

Figure 1: HuMo is a versatile framework that can generate videos from various combinations of text, image, and audio inputs. It works for realistic humans, stylized art, and even animations.

In this article, we’ll explore the innovations behind HuMo, examining how the researchers tackled the core challenges of data scarcity and collaborative control to build a truly multimodal video generation powerhouse.


The Balancing Act: Why Previous Methods Fall Short

To understand HuMo’s contributions, it’s useful to examine the limitations of prior approaches. Most HCVG methods fall into one of two categories:

1. The “Generate-then-Animate” Pipeline
Methods like OmniHuman-1 first use a text-to-image (T2I) model to create a “start frame” that contains the subject and background. Then, an image-to-video (I2V) model animates this frame based on an audio track.

Drawback: This pipeline is rigid. Once the start frame is generated, the scene is locked. If your text prompt says “a man playing with his dog” but the T2I model forgets the toy, you can’t add it later. The quality of the final video is heavily dependent on that first frame.

2. The “Subject-Consistent” Pipeline
Methods like Phantom focus on subject consistency (S2V). You provide a reference image and a text prompt, and the model generates a video where the subject consistently matches the reference. These models maintain identity well and allow more textual control over the scene.

Drawback: They usually can’t handle audio, so you can make a video of a specific person walking, but you can’t make them talk.

Recent attempts to combine subject preservation with audio-visual sync often fall short. As shown below, emphasizing the reference image can degrade lip-sync, while focusing on audio sync can cause identity drift or poor text adherence.

A comparison of different video generation models. HuMo successfully generates a man in a suit playing with a black lab, following the text prompt and reference images, and syncing with spoken words, while other models fail in at least one of these aspects.

Figure 2: Models like OmniHuman-1 are constrained by a start frame. Phantom can’t integrate audio. HunyuanCustom struggles to balance all modalities. HuMo excels at text control, subject consistency, and audio-visual sync.

The researchers identified two root causes for these issues:

  1. Data Scarcity: Few large, high-quality datasets contain perfectly paired triplets — text descriptions, reference images, and synchronized audio.
  2. Collaborative Control Difficulty: Training a single model to master text following, subject preservation, and audio sync simultaneously is challenging because the objectives often conflict.

HuMo was designed from scratch to address both challenges.


The HuMo Framework: Data, Training, and Inference

HuMo’s success is built on:

  • a novel data processing pipeline
  • a progressive training paradigm
  • a smart inference strategy.

An overview of the HuMo framework, showing the model architecture on the left and the three-stage progressive data processing and training pipeline on the right.

Figure 3: HuMo is built on a DiT-based video generation backbone, trained progressively — first learning subject preservation (Stage 1), then adding audio-visual sync (Stage 2) — all powered by a meticulously curated dataset.

Part 1: Building a Better Dataset

Since no suitable dataset existed, the team built one using a multi-stage process:

  • Stage 0 (Text): Start with a large video pool and use powerful vision-language models (VLMs) to create detailed descriptions, forming text-video pairs.
  • Stage 1 (Text + Image): Retrieve cross-attribute reference images — i.e., images of the same subject but with different clothing, poses, and backgrounds — from a billion-scale image corpus. This forces the model to learn core identity features, not just copy pixels, improving text editability.
  • Stage 2 (Text + Image + Audio): Filter videos for clean, synchronized speech using lip-sync analysis, resulting in a high-quality subset of triplets: text, reference images, and audio.

This pipeline produces rich, well-aligned multimodal data, essential for balanced training.

Part 2: Progressive Multimodal Training

The HuMo model extends a Diffusion Transformer (DiT) video backbone with flow matching:

\[ \mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{t,z_0,z_1} \| v_{\theta}(z_t,t,c) - (z_1 - z_0) \|_2^2 \]

Here, \(v_{\theta}\) learns to transform random noise \(z_0\) into a target video latent \(z_1\), conditioned on multimodal inputs \(c\).

Stage 1: Subject Preservation (Text + Image)

A minimal-invasive image injection strategy ensures the model integrates reference identities without losing text following ability:

  • No Architectural Changes: Concatenate the VAE latents of the reference image at the end of the video latent sequence to encourage attention-based identity extraction across frames.
  • Limited Fine-Tuning: Freeze most parameters, updating only the self-attention layers to preserve pre-trained synthesis and alignment abilities.

Stage 2: Audio-Visual Sync (Text + Image + Audio)

Once text-image skills are solid, introduce audio:

  • Audio Cross-Attention: Add audio cross-attention layers to process speech features, aligning them with video frames via: \[ \operatorname{Attention}\left(h_z, c_a\right) = \operatorname{softmax}\left(\frac{\mathbf{Q}_z \mathbf{K}_a^{\top}}{\sqrt{d}}\right) \mathbf{V}_a \]
  • Focus-by-Predicting: Instead of hard-coding face regions, use a mask predictor in later DiT blocks to estimate likely facial regions from internal features, supervised by ground-truth masks with binary cross-entropy: \[ \mathcal{L}_{\text{mask}} = \frac{hw}{\sum_{i=1}^{h} \sum_{j=1}^{w} \mathbf{M}_{\text{gt}}^{(i,j)}} \cdot \mathrm{BCE}(\mathbf{M}_{\text{pred}}, \mathbf{M}_{\text{gt}}) \] This softly guides attention without restricting motion modeling.
  • Progressive Curriculum: Start Stage 2 with 80% text-image and 20% text-image-audio tasks, gradually moving to 50/50 to preserve existing skills.

Part 3: Flexible and Fine-Grained Inference

During generation, HuMo uses two key strategies:

1. Flexible Multimodal Control (CFG)

Extend classifier-free guidance to three modalities, with separate scales \((\lambda_{txt}, \lambda_{img}, \lambda_a)\), allowing fine control over influence. Missing conditions are replaced with null tokens.

2. Time-Adaptive CFG

Different modalities matter more at different denoising stages:

  • Early: Text dominates for scene layout.
  • Late: Image and audio refine identity and lip movement.

HuMo dynamically switches CFG weights mid-generation to maximize both structure and detail.

An illustration of time-adaptive CFG, showing how different CFG weights can produce different results, and how combining them adaptively leads to a balanced output that has both strong text adherence and high identity preservation.

Figure 4: Time-adaptive CFG adjusts guidance priorities over time, balancing text adherence and identity preservation.


Experiments and Results

The team tested HuMo against leading models in subject preservation and audio-visual sync.

Subject Preservation Task (Text + Image)

Qualitative:
Qualitative comparison for the subject preservation task. HuMo generates more coherent and text-aligned videos, correctly depicting subjects wearing gloves, entering a temple, or flying on a broomstick, outperforming other models.
Figure 5: Only HuMo correctly depicts “step into a temple” while maintaining identities of all subjects.

Quantitative:
Table showing quantitative results for the subject preservation task. HuMo-17B achieves the highest scores in nearly all metrics, including video quality, text following, and subject consistency.
Table 1: HuMo-17B tops video quality, structure plausibility, text following, and identity metrics.

Audio-Visual Sync Task (Text + Image + Audio)

Qualitative:
Qualitative comparison for the audio-visual sync task. HuMo generates videos that follow the text prompt (e.g., adding a guitar or specific lighting) while maintaining identity and lip-sync, unlike start-frame-based methods.
Figure 6: HuMo adds prompt-specified elements (guitar, golden light) and keeps identities intact, outperforming start-frame-based methods.

Quantitative:
Table showing quantitative results for the audio-visual sync task. HuMo-17B leads in video quality and text following, while achieving competitive scores in subject consistency and audio-visual sync.
Table 2: HuMo achieves best aesthetics and text following, with competitive lip-sync.


Why Components Matter: Ablation Studies

Removing key features hurts performance.

Qualitative ablation study showing degraded results when removing progressive training or face location predictor.
Figure 7: Without components like progressive training or focus-by-predicting, output quality drops.

Table showing quantitative ablation study results. The full model outperforms reduced variants.
Table 3: Shows measurable drops without each strategy.


Showcasing Controllability

Text Controllability:
A grid of images demonstrating text controllability. The same reference person is placed in different scenes (near an airplane, in an office, in nature) with different clothing based on varying text prompts.
Figure 8: Same reference, different prompts — identity stays consistent while clothing and setting change.

Image Controllability:
A collage recreating scenes from Game of Thrones. HuMo generates new video clips based on audio and captions from the show, both with and without a new reference actor’s face.
Figure 9: “Re-casts” Game of Thrones by integrating a different actor’s face into original scenes using text-audio-image inputs.


Conclusion and Future Implications

HuMo represents a decisive advance in human-centric video generation. By co-designing a multimodal dataset pipeline and progressive training, the framework resolves the long-standing conflict between text, image, and audio control. Its ability to produce high-quality, consistent, and richly controllable videos from multimodal inputs opens up new creative frontiers.

With innovations like cross-attribute reference retrieval, focus-by-predicting, and time-adaptive CFG, HuMo sets a blueprint for future multimodal generation research. While ethical considerations — including risks of deepfakes and non-consensual content — must be addressed, the potential for democratizing filmmaking and storytelling is immense. Tools like HuMo could empower anyone, anywhere, to become a creator.