Introduction
In the rapidly evolving world of Generative AI, animating human characters has become a frontier of intense research. We have seen impressive results where a single photograph of a person can be brought to life, driven by a video of a dancer or a speaker. Models like AnimateAnyone and MagicAnimate have set the standard for this “reference-based” animation. However, these models share a significant limitation: they are generally bound to the viewpoint of the original reference image.
Imagine you have a close-up portrait of a character, but you want to generate a video of them walking away from the camera, revealing their back and legs. Or conversely, you have a wide shot, but you want to generate a dramatic close-up of their face. Current models struggle immensely with this. They tend to hallucinate details, distort anatomy, or simply fail to generate high-frequency details when the camera distance or angle changes drastically between the reference and the target video.
In a recent paper titled “Free-viewpoint Human Animation with Pose-correlated Reference Selection,” researchers from HKUST and Adobe Research propose a robust solution to this problem. They introduce a method that moves beyond the single-image paradigm, utilizing multiple reference images to create a cohesive, “free-viewpoint” video.

As shown in Figure 1 above, their approach can take reference images of a subject (left) and synthesize a video where the camera pans, zooms, and rotates around the character (right), maintaining identity and appearance consistency in ways previous methods could not.
The Core Problem: The Limits of a Single View
To understand the innovation here, we must first understand the bottleneck. Most state-of-the-art human animation models use a specific architecture called a ReferenceNet coupled with a Denoising UNet.
- ReferenceNet extracts features from a single source image (the reference).
- Denoising UNet generates the video frame-by-frame, using a skeleton (pose) as a guide.
The issue arises when the driving pose asks for information that the reference image simply doesn’t have. If your reference is a side profile, it contains zero data about the person’s other eye. If the reference is a long shot, the facial features are comprised of very few pixels. When the model tries to generate a close-up from that long shot, it has to “guess” the details, leading to blurry or uncanny results.
The intuitive solution is to give the model more reference images—a front view, a side view, a close-up, etc. However, simply dumping multiple images into a diffusion model creates a new problem: computational explosion and feature confusion. The model gets overwhelmed by redundant or irrelevant data.
The Solution: Adaptive Reference Selection
The researchers propose a new framework that allows the use of multiple reference images efficiently. The core of their approach is to selectively use information. Instead of processing every pixel of every reference image for every generated frame, the model intelligently determines which parts of which reference image are relevant to the current target pose.
System Architecture
The overall framework is built on the popular “Double UNet” backbone but with significant modifications to handle multiple inputs.

As illustrated in Figure 2, the process works as follows:
- Input: A set of \(N\) reference images \(\{\mathbf{I}_{ref}^i\}\) and a sequence of target poses (skeletons).
- Reference Feature Extraction: A Reference UNet extracts feature maps from all reference images.
- Pose Correlation: A specialized module compares the target pose with the reference poses to see which reference image matches the current action best.
- Selection: The system filters out redundant information, keeping only the “Top-K” most useful features.
- Generation: These selected features guide the Denoising UNet to create the final video frame.
Let’s break down the mathematical machinery that makes this selection possible.
1. Pose Correlation Learning
The heart of this paper is the Pose Correlation Module (PCM). Its job is to answer the question: “Given that I want to generate a person raising their left hand (Target Pose), which of my 10 reference images contains the best information about the left hand?”

The PCM takes the skeleton of a reference image and the skeleton of the target frame. It uses a lightweight encoder to turn these stick figures into feature vectors.

Here, \(\mathcal{E}^P\) represents the pose encoder. It extracts features \(\mathbf{F}\) from the reference pose (\(P_{ref}\)) and the target pose (\(P_{tgt}\)).
Next, the model uses a Transformer block with Cross-Attention layers to find the relationship between these poses. Interestingly, the authors design the attention mechanism such that the Target Pose acts as the Key (\(K\)) and Value (\(V\)), while the Reference Pose acts as the Query (\(Q\)).

In this equation:
- \(\mathcal{T}\) is the transformer block.
- \(f_{zero}\) is a zero-initialized convolution layer (ensuring the training starts neutrally).
- \(\mathbf{R}^{i,j}\) is the resulting Correlation Map.
This map, \(\mathbf{R}\), is a heatmap. It highlights specific regions in the reference image that are structurally correlated with the target pose. If the target pose involves a specific head tilt, the map will light up the head region of the reference image that has a similar tilt.
Once the correlation map is generated, it is used to enhance the reference features. The features extracted from the reference image (\(\mathbf{F}^i_l\)) are multiplied by the interpolated correlation map:

This step effectively “turns up the volume” on the useful parts of the reference image and “mutes” the irrelevant parts.
2. The Selection Strategy: Filtering the Noise
Now that the system has enhanced features based on pose correlation, it faces the computational bottleneck. If we have 10 reference images, we have 10 times the data to process. To solve this, the authors implement an Adaptive Reference Selection Strategy.
First, they flatten and concatenate all the correlation values (\(\mathbf{r}\)) and reference features (\(\mathbf{f}\)) from all \(N\) reference images.

The system then sorts these features based on their correlation scores. It looks for the features with the highest “relevance” score.

Using argsort, the model identifies the indices of the top \(K_l\) features (where \(K\) is a fixed number of allowed features for layer \(l\)). Essentially, the model says, “I only have budget for 1000 feature tokens; I will take the 1000 that have the highest correlation score with my target pose.”
It then fuses these top features with their correlation scores:

This filtered set of features, \(\mathbf{f}_{cor}\), is what is finally sent to the Denoising UNet. This keeps the computational cost manageable regardless of how many reference images are available, while ensuring that the most critical visual data is preserved.
A Clever Training Trick: Random Sampling
One interesting detail in the paper is a specific training strategy. Relying solely on “Top-K” selection during training is risky because the argsort operation is non-differentiable—gradients can’t flow through it easily to update the model weights. Furthermore, if the model always picks the “best” features, it might get stuck in a local minimum and fail to explore other useful contexts.
To counter this, the researchers mix the “Top-K” features with a set of randomly sampled features during the training phase.

By forcing the model to occasionally look at random parts of the reference images (via \(S_{uni}\), uniform sampling), the training stabilizes, and the pose correlation module learns more robustly.
The MSTed Dataset
To train a model for “Free-viewpoint” animation, you need data that actually contains free viewpoints. Existing datasets (like DyMVHumans) are often captured in studios with fixed camera rings. These provide multiple angles, but the distance to the subject is usually constant. They lack the dynamic “zoom-in/zoom-out” variance found in real cinematic footage.
The authors introduced the Multi-Shot TED Video Dataset (MSTed). They curated over 15,000 video clips from TED talks.

Why TED talks? Because professional videography covers speakers with multiple cameras: wide shots of the stage, medium shots of the torso, and tight close-ups of the face. This natural variance in shot type and angle makes it the perfect training ground for a model designed to handle viewpoint shifts. MSTed contains over 1,000 unique identities, offering a massive diversity of appearances compared to the 33 identities in DyMVHumans.
Experiments and Results
The researchers compared their method against leading single-reference models: MagicAnimate, AnimateAnyone, and Champ.
Quantitative Performance
The results were measured using metrics like L1 (pixel error), LPIPS (perceptual similarity), and FVD (Fréchet Video Distance—a measure of how realistic the video movement looks).

On the MSTed dataset (Table 2), the proposed method outperforms the competitors significantly. Notably, even when restricted to a single reference image (R=1), the proposed model performs better than the others. This suggests that the training process itself—learning to look for pose correlations—makes the model more robust even when data is limited. When a second reference image is added (R=2), the FVD score drops dramatically (from 20.88 to 7.044), indicating much smoother and more realistic video generation.

Table 3 shows that the model scales well. Using 10 reference images (R=10) on the DyMVHumans dataset yields the best results across almost all metrics. This validates the “Selection Strategy”—the model effectively utilizes the extra data without getting confused.
Qualitative Performance
The numbers are impressive, but the visual comparison is where the difference becomes obvious.

In Figure 4, look at the rows for MagicAnimate and AnimateAnyone. You can often see artifacts or a loss of identity—the face might look slightly generic, or the clothes lose their texture. The Ours column maintains sharp facial features and consistent clothing patterns, closely matching the Ground Truth.
Ablation Study: Does the “Correlation” really work?
To prove that their specific modules (Pose Correlation and Reference Selection) were doing the heavy lifting, the authors performed an ablation study. They started with a baseline model and added features one by one.

As shown in Table 4, adding a second reference (baseline+2ref) improves the FVD score from 26.32 to 9.82. Adding the Pose Correlation Module (+H) further refines the quality to 7.60.
The researchers also visualized the Correlation Maps to prove the model wasn’t just guessing.

On the right side of the image above, you can see the Correlation Maps. Notice the thermal-like heatmaps overlaid on the reference poses. When the target pose (far right) shows a specific hand gesture or body orientation, the correlation map “lights up” the corresponding limbs in the reference images. This confirms the PCM is successfully identifying the most informative regions of the source data.
Conclusion
The paper “Free-viewpoint Human Animation with Pose-correlated Reference Selection” represents a significant step forward in generative video. By acknowledging that a single image is rarely enough to fully describe a 3D human in motion, the authors moved to a multi-reference paradigm.
The cleverness of their approach lies not just in using more data, but in using it selectively. The Pose Correlation Module acts as an intelligent director, spotlighting exactly which reference image offers the best angle for the current frame, while the Adaptive Reference Selection ensures the computational budget remains manageable.
Coupled with the release of the MSTed dataset, this work paves the way for more cinematic AI video generation, where camera angles are no longer constraints, but creative choices. As this technology matures, we can expect to see AI-generated characters that can withstand close scrutiny from any angle, effectively bridging the gap between 2D image generation and 3D-aware video synthesis.
](https://deep-paper.org/en/paper/2412.17290/images/cover.png)