The dream of “Harry Potter”-style moving photographs has been a driving force in computer vision for decades. We want to take a single static photo of a person and animate it using a driving video—making the subject dance, speak, or walk while preserving their identity.

While recent advances in diffusion models have made this possible, there is a lingering “uncanny valley” effect in current state-of-the-art methods. You might see a person dancing perfectly, but their hair behaves like a solid helmet, their dress moves like rigid cardboard, and the background remains frozen in time. The person moves, but the dynamics—the physics of wind, gravity, and momentum—are missing.

Enter X-Dyna, a new framework presented by researchers from ByteDance, USC, Stanford, and UCLA. This paper introduces a method to generate animations that aren’t just accurate in pose, but vivid in detail. It allows for flowing hair, rippling clothes, and even dynamic backgrounds (like waterfalls or fireworks), all from a single image.

In this deep dive, we will unpack how X-Dyna solves the “statue effect” in video generation. We will explore the architecture of diffusion models, the mathematics of attention mechanisms, and the clever engineering tricks used to separate identity from motion.

Sample animations generated by X-Dyna showing dynamic backgrounds and fluid motion.

1. The Core Problem: The Trade-off Between Identity and Dynamics

To understand why X-Dyna is necessary, we first need to understand the current landscape of Human Image Animation.

The task is straightforward:

Input: A Reference Image (\(I_R\)) of a person.
Input: A Driving Video (providing pose and expression).
Output: A video of the person in the Reference Image performing the actions from the Driving Video.

The challenge lies in the “Reference Image.” You need to extract the person’s appearance (identity, clothes, hair color) and paint it onto every frame of the new video.

The Two Failed Extremes

Prior to X-Dyna, methods generally fell into two categories, each with a fatal flaw:

The CLIP/IP-Adapter Approach: These methods encode the reference image into a high-level vector (using tools like CLIP). They inject this general “concept” of the person into the video generator.

Pro: Great dynamics. The model is free to hallucinate flowing hair and lighting changes.
Con: It loses identity. The shirt might change color, or the face might look different. It captures the vibe but not the pixels.

The ReferenceNet Approach (The Status Quo): This is currently the most popular method (used in models like Animate Anyone). It uses a copy of the neural network (a parallel UNet) to extract detailed features from the reference image and forces the video generator to use them via strict connections.

Pro: Perfect identity preservation.
Con: The “Statue” effect. Because the model is forced to look so closely at the static reference image, it becomes terrified of changing anything. Backgrounds freeze. Hair freezes. The motion looks rigid.

Comparison of IP-Adapter, ReferenceNet, and Dynamics-Adapter. Note how IP-Adapter loses identity, ReferenceNet is static, and X-Dyna preserves both.

As shown in the figure above:

IP-Adapter (a) creates a vivid fire scene but turns the person into a generic shadow (identity loss).
ReferenceNet (b) keeps the person perfect but the fireflies and background are frozen dead (dynamics loss).
Dynamics-Adapter (c)—the X-Dyna method—gets the best of both worlds.

2. Background: Diffusion and Control

Before dissecting the solution, let’s establish the technical foundation. X-Dyna is built on top of Stable Diffusion (SD), a Latent Diffusion Model.

The Backbone

The core of Stable Diffusion is a UNet. This is a neural network shaped like the letter ‘U’. It takes a noisy image, compresses it down to capture high-level concepts (the bottom of the U), and then expands it back up to pixel-level details, removing noise at every step.

Inside the UNet are Attention Blocks. These are the decision-makers. They look at the image data and ask: “Which parts of this image relate to which other parts?”

Self-Attention: Pixels looking at other pixels in the same frame (e.g., “This blue pixel is part of the sky”).
Cross-Attention: Pixels looking at external prompts (e.g., “This shape corresponds to the word ‘dog’ in the text prompt”).

Controlling the Motion

To make the image move, the researchers use ControlNet. This is a neural network structure that allows you to plug spatial conditions (like a skeleton stick figure) into a pre-trained diffusion model. X-Dyna uses a specific Pose ControlNet (\(C_P\)) to tell the model where the arms and legs should be in every frame.

3. The X-Dyna Method

The researchers propose a pipeline that fundamentally changes how the reference image is “fed” to the generation process.

The full architecture of X-Dyna. Note the Dynamics Adapter (D) in the center and the ControlNets on the bottom left.

The architecture consists of three main innovations:

The Dynamics-Adapter: A new way to inject appearance without killing motion.
S-Face ControlNet: A clever way to handle facial expressions.
Harmonic Data Fusion: A training strategy involving nature videos.

Let’s break these down, dedicating the most time to the Dynamics-Adapter, as it is the primary contribution.

3.1 The Dynamics-Adapter

The goal is to transfer the appearance of the reference image (\(I_R\)) to the generated video frame (\(I_i\)) while allowing for changes in shape (cloth folding, hair blowing).

Why ReferenceNet Failed

To understand the solution, look at how the previous state-of-the-art (ReferenceNet) worked. ReferenceNet ran a parallel network and concatenated features directly. It essentially told the model: “Copy these features exactly.” This creates a strong spatial constraint. If the reference image has hair hanging straight down, the model struggles to generate hair flying sideways because the “spatial guidance” says it should be straight.

The X-Dyna Solution: Residual Cross-Frame Attention

X-Dyna takes a softer approach. Instead of a separate heavy network, it uses a lightweight module called the Dynamics-Adapter.

Here is the conceptual comparison of the architectures:

Comparison of architectures: IP-Adapter vs ReferenceNet vs Dynamics-Adapter.

Notice in (c) how X-Dyna uses a “Partially Shared Weight UNet.” It doesn’t run a totally separate heavy process. It piggybacks on the main process.

The Mathematics of Dynamics

Let’s look at the math inside the Attention layers.

Standard Self-Attention in a Diffusion UNet calculates relationships between pixels in the current frame being generated (\(I_i\)). It computes a Query (\(Q\)), Key (\(K\)), and Value (\(V\)) from the input noise.

Equation for standard self-attention.

This equation essentially says: “Check how similar my current pixel (\(Q_i\)) is to every other pixel (\(K_i\)), and update my value based on that weighted sum (\(V_i\)).”

The Innovation: X-Dyna injects the reference image here. It computes a second attention map. It takes the Query from the current frame (\(Q'_i\)) but looks at the Key and Value from the reference image (\(K_R, V_R\)).

Equation for cross-frame reference attention.

This equation asks: “Which parts of the Reference Image correspond to this pixel in the Current Frame?”

Finally, X-Dyna combines these two worlds. It doesn’t replace the original attention (which would destroy the model’s knowledge of physics/motion); it adds the reference info as a residual (a helper).

The final output equation combining original attention and reference attention.

In this equation:

The first term \((A_i W_O)\) is the standard generation (physics, lighting, composition).
The second term \((A'_i W'_O)\) is the Dynamics-Adapter injection (identity, texture, color).
The \(+\) sign is crucial. It means the model balances “what makes sense physically” with “what the person looks like.”

By initializing the weights of the output projector (\(W'_O\)) to zero, the training starts with the standard diffusion model behavior and slowly “fades in” the influence of the reference image. This prevents the shock that usually freezes dynamics.

3.2 Implicit Local Face Expression Control (\(C_F\))

Getting the body to move is one thing; getting the face to emote is another. Previous methods used simple facial landmarks (dots on eyes and mouth).

Problem: Landmarks are sparse. They don’t capture the subtlety of a smirk or a furrowed brow.
Problem: If you just feed the image of the person from the driving video, the model might copy their identity (face shape) onto your reference character.

X-Dyna introduces S-Face ControlNet.

The “Cross-Identity” Trick

To teach the network to look at expressions but ignore identity, the researchers use a clever training strategy:

They take the driving video frame.
They use a pre-existing face-swapping network to swap the driver’s face with a random identity, but keep the expression.
They feed this “Frankenstein” face into the ControlNet.

Because the face in the ControlNet doesn’t look like the person in the Reference Image, the X-Dyna model cannot rely on it for identity. It is forced to learn only the motion (the expression) from the control signal, while pulling the identity strictly from the Reference Image. This results in “Identity-Disentangled” control.

3.3 Harmonic Data Fusion Training

Neural networks are only as good as their data. If you only train on videos of people dancing in studios with white walls, the model will never learn how wind moves trees or how water splashes.

X-Dyna employs Harmonic Data Fusion. They train the model simultaneously on:

Human Motion Videos: (Dancing, walking).
Natural Scene Videos: (Waterfalls, time-lapses of clouds, fireworks).

When training on nature videos, they simply leave the skeleton/pose inputs blank. This teaches the backbone (the UNet) and the Dynamics-Adapter how to hallucinate realistic environmental physics independent of human movement.

4. Experiments and Results

Does this complex architecture actually work? The researchers evaluated X-Dyna against top competitors like MagicAnimate, Animate-Anyone, MagicPose, and MimicMotion.

4.1 Quantitative Analysis: The DTFVD Metric

Measuring “how good” a video looks is hard. The researchers utilized a metric called DTFVD (Dynamic Texture Frechet Video Distance).

Lower is better.
It specifically measures the quality of dynamic textures (water, fire, hair).

Table 1: Quantitative comparisons on dynamics texture generation.

As shown in Table 1, X-Dyna achieves a DTFVD score of 1.518, significantly lower (better) than MimicMotion (3.590) or MagicAnimate (2.601). This numerically proves that the textures in X-Dyna videos are more realistic and temporally consistent.

They also measured standard image quality metrics (PSNR, SSIM) and Identity Preservation (Face-Cos).

Table 2: Comparisons on human video animation quality.

In Table 2, we see X-Dyna holds the highest Face-Cos (Face Cosine Similarity) score of 0.497, meaning it preserves the person’s identity better than any other method while also having better dynamics.

4.2 Qualitative Comparisons

Numbers are great, but visual results matter most in graphics.

Human in Dynamic Scenes

The image below compares X-Dyna against MagicPose and MimicMotion. Look at the bottom row (X-Dyna). The particles/sparks in the hand are generated with fluid continuity. In contrast, other methods often blur these details or keep them static.

Qualitative comparison of human dynamics. Note the detailed particles in the ‘Ours’ row.

Face and Pose Control

Here we see the results of the S-Face ControlNet. In the column “Ours,” the facial expression matches the reference but adapts naturally to the pose. Other methods (like Animate Anyone) often result in blurry faces when the head turns, or distortions (MagicPose).

Comparison on poses and facial expressions.

4.3 User Study

The researchers didn’t just trust algorithms; they asked 100 human participants to rate the videos on three criteria: Foreground Dynamics, Background Dynamics, and Identity.

Table 3: User study results.

The results were overwhelmingly in favor of X-Dyna, particularly in Background Dynamics (BG-Dyn), where it scored 4.26/5, compared to the next best at 2.78. This huge gap confirms the success of the Harmonic Data Fusion training.

5. Conclusion and Future Implications

X-Dyna represents a significant step forward in generative video. By identifying the specific architectural bottleneck—the “strict teacher” nature of ReferenceNet—and replacing it with the “helpful guide” of the Dynamics-Adapter, the researchers have unlocked a new level of realism.

Key Takeaways:

Separation of Concerns: X-Dyna successfully separates the what (Appearance/Reference) from the how (Dynamics/Motion).
Architecture Matters: Simply adding more layers (like ReferenceNet) isn’t always better. The parallel residual design of the Dynamics-Adapter allows the underlying diffusion physics to shine through.
Data Diversity: Training on non-human data (nature scenes) improved the animation of human videos, proving that general dynamic understanding helps specific tasks.

While the model still faces challenges—extreme camera zooms or complex hand poses remain difficult—this work paves the way for highly realistic virtual avatars, movie production tools, and immersive digital experiences where the wind blows and the water flows just as it does in the real world.

This blog post explains the paper “X-Dyna: Expressive Dynamic Human Image Animation” by Xie et al.

1. The Core Problem: The Trade-off Between Identity and Dynamics#

The Two Failed Extremes#

2. Background: Diffusion and Control#

The Backbone#

Controlling the Motion#

3. The X-Dyna Method#

3.1 The Dynamics-Adapter#

Why ReferenceNet Failed#

The X-Dyna Solution: Residual Cross-Frame Attention#

The Mathematics of Dynamics#

3.2 Implicit Local Face Expression Control (\(C_F\))#

The “Cross-Identity” Trick#

3.3 Harmonic Data Fusion Training#

4. Experiments and Results#

4.1 Quantitative Analysis: The DTFVD Metric#

4.2 Qualitative Comparisons#

Human in Dynamic Scenes#

Face and Pose Control#

4.3 User Study#

5. Conclusion and Future Implications#