Introduction
We have all seen them: AI-generated portraits that look almost right, but something is off. Perhaps the skin texture is too plastic, the eyes lack a certain spark, or the anatomy twists in ways human bones simply shouldn’t. Despite the massive leaps in diffusion models like Stable Diffusion, generating truly photorealistic humans remains one of the hardest challenges in computer vision.
The core issue often lies in how these models are fine-tuned. Typically, researchers use methods like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). These methods train the model by showing it two generated images—one “good” and one “bad”—and telling it to prefer the good one. But there is a ceiling to this approach: if the model’s “good” image is still artificial and flawed, the model is only learning to be the “best of a bad bunch.” It isn’t learning what real looks like.
Enter HG-DPO (Human image Generation through Direct Preference Optimization). In a new paper from researchers at Kakao, a novel approach is proposed that fundamentally shifts the goalposts. Instead of asking the model to prefer a slightly better generated image, HG-DPO asks the model to prefer actual real images.

As shown in Figure 1, the results are striking. By anchoring the training process to reality, the model learns to correct anatomical distortions and capture fine-grained details like lighting and texture. However, telling a model to “just mimic reality” is mathematically chaotic due to the massive difference between generated noise and real pixels.
In this post, we will deconstruct how HG-DPO solves this problem using a clever three-stage curriculum learning strategy, bridging the gap between artificial noise and photorealism.
The Background: DPO and the Reality Gap
To understand why this paper is significant, we first need to look at Direct Preference Optimization (DPO).
In standard DPO for diffusion models, the training data consists of triplets: a prompt, a “winning” image (\(x_w\)), and a “losing” image (\(x_l\)). The objective is to adjust the model so that it becomes more likely to generate \(x_w\) and less likely to generate \(x_l\).
Existing methods rely on datasets where both the winner and loser are generated by the AI. This works for alignment (following instructions), but it doesn’t help much with realism. The researchers argue that to achieve true realism, the “winning” image should be a real photograph.
However, simply swapping in real photos breaks the training. The statistical distribution of a real photo is vastly different from what a diffusion model generates. If you try to force the model to jump straight to reality, the training becomes unstable. This is known as the domain gap.
The Solution: A Three-Stage Curriculum
The authors propose a solution inspired by human education: Curriculum Learning. You don’t teach a child calculus before they learn addition. Similarly, HG-DPO teaches the model realism in three distinct stages, progressively increasing the difficulty.

As illustrated in Figure 2, the pipeline moves from the Generative Domain (Easy) to an Intermediate Domain (Normal) and finally to the Real Domain (Hard). Let’s break down each stage.
Stage 1: The Easy Stage (Anatomy & Alignment)
The goal of the first stage is basic quality control. The model, denoted as \(\epsilon_{base}\), often generates distorted limbs or ignores parts of the prompt.
In this stage, the researchers stick to the standard DPO approach but refine how the data is selected. They generate a pool of images for a single prompt and use an AI scorer (PickScore) to rank them.

As shown in Figure 3, the “winner” is simply a generated image that got lucky—it has correct anatomy and follows the prompt. The “loser” is a generated image with distortions. By training on these pairs, the model learns to stop generating six fingers or twisted torsos.
The Image Pool Strategy
Instead of generating just two images, the authors generate a pool of \(N\) images.
They then score these images (\(S_{gen}\)) to find the best and worst examples.
The best image becomes the winner (\(x^{\mathbf{w}}\)) and the worst becomes the loser (\(x^{\mathbf{l}}\)).

Solving the Color Shift
There was a catch during the Easy stage experiments. The model started producing images with weird color casts (e.g., oversaturated or shifted hues). This happens because the statistical distribution of the model’s latent space drifts away from the original base model.
To fix this, the authors introduced a Statistics Matching Loss (\(\mathcal{L}_{stat}\)).

This loss function forces the channel-wise mean of the model’s latent features to stay close to the base model’s statistics.

Figure 15 demonstrates the impact of this loss. The images on the left (without \(\mathcal{L}_{stat}\)) look washed out or tint-shifted, while the images on the right retain natural lighting and color balance.
Stage 2: The Normal Stage (Bridging the Gap)
Once the model (\(\epsilon_{\mathbb{E}}\)) can generate anatomically correct humans, it’s time to tackle realism. However, we still can’t jump straight to real photos. The gap is too wide.
The Normal stage introduces an Intermediate Domain. The researchers create synthetic “winning” images that act as a bridge. They use a technique called Stochastic Differential Reconstruction (SDRecon).
How SDRecon Works
They take a real photograph, add a specific amount of noise to it (diffuse it forward in time), and then use the base model to “reconstruct” or denoise it back to an image.

Figure 17 visualizes this spectrum.
- \(t_1\): Very little noise added. The reconstructed image looks almost exactly like the real photo.
- \(t_T\): A lot of noise added. The reconstruction looks like a purely AI-generated image.
For the Normal stage, they select images from the middle of this spectrum (\(t_4\) to \(t_7\)). These images have the composition and pose of the real photograph but the texture and noise patterns of a generated image.
In this stage:
- Winner: An intermediate image (Realism + Generative Texture).
- Loser: The winner from the Easy stage (Purely Generative).
This teaches the model to prefer realistic poses and compositions without shocking it with pixel-perfect real textures yet.

Stage 3: The Hard Stage (Photorealism)
Now the model (\(\epsilon_{\mathbb{N}}\)) understands anatomy and realistic composition. It is ready for the final exam: Real Images.
In the Hard stage, the “winning” images are drawn from the \(t_1\) domain—images so close to the real photos that they are indistinguishable to the human eye.
- Winner: Real image (technically \(t_1\) reconstruction).
- Loser: The winner from the Normal stage.

This final step forces the model to refine fine details: the texture of skin, the reflection in eyes, and the subtle shading that makes a photo look “real.”

As seen in Figure 21, the Hard stage lifts the “plastic” look often associated with AI art, introducing vivid shading and sharpness.
Improving Text Alignment
While the U-Net (the image generator) is training, the researchers noticed that image-text alignment could degrade slightly as the model focused intensely on visual quality. To counter this, they separately trained the Text Encoder during the Easy stage.

By combining the Hard-stage U-Net with this enhanced Text Encoder, the final HG-DPO model achieves the best of both worlds: photorealism and high prompt adherence.
Experiments and Results
The researchers compared HG-DPO against several state-of-the-art baselines, including Diffusion-DPO, Pick-a-Pic, and AlignProp. The evaluation used standard metrics like FID (Fréchet Inception Distance, measuring realism) and PickScore (measuring human preference).
Quantitative Superiority

Table 1 shows a clear dominance. HG-DPO achieves the lowest FID (29.41), significantly lower than the base model (37.34) and competitors like Diffusion-DPO (112.67). This mathematically confirms that HG-DPO images are statistically much closer to real images.
Qualitative Comparison
Numbers are great, but visual inspection is crucial for generative models.

In Figure 4, look at the second row (the selfie comparison). Many baselines struggle with the lighting or the facial structures. HG-DPO produces a natural, cohesive image that looks like a genuine photograph.
Does the Curriculum Matter? (Ablation Study)
You might wonder, “Can we skip the Easy or Normal stages?” The researchers asked this too and ran ablation studies.

Figure 7 proves the curriculum is essential.
- Hard w/o Easy: The model collapses or produces artifacts because it wasn’t ready for the hard data.
- Hard w/o Normal: The model improves but lacks the refined realism of the full pipeline.
Personalization Applications
One of the most practical applications of this technology is Personalized Text-to-Image (PT2I)—generating images of a specific person (like yourself) in different styles. HG-DPO can be plugged into existing personalization frameworks like InstantBooth without additional training.

Conclusion and Implications
The HG-DPO paper presents a compelling argument: if we want AI to generate real-looking images, we must find a way to train it on real images. The hurdle has always been the mathematical gap between what the model knows (noise) and what reality looks like.
By using a curriculum learning approach—graduating from Anatomy to Composition to Texture—HG-DPO successfully bridges this gap. It essentially “grooms” the model, ensuring it doesn’t just memorize good generations, but actively pushes toward the distribution of real photography.
While the model still has occasional limitations (fingers remain the nemesis of all AI, as shown in Figure 24 below), HG-DPO represents a significant step forward in crossing the “uncanny valley.”

For students and researchers in generative AI, this paper highlights the importance of data selection strategy. It’s not just about the architecture or the loss function; it’s about what you show the model, and when you show it.
](https://deep-paper.org/en/paper/2405.20216/images/cover.png)