Introduction

We have all seen them: AI-generated portraits that look almost right, but something is off. Perhaps the skin texture is too plastic, the eyes lack a certain spark, or the anatomy twists in ways human bones simply shouldn’t. Despite the massive leaps in diffusion models like Stable Diffusion, generating truly photorealistic humans remains one of the hardest challenges in computer vision.

The core issue often lies in how these models are fine-tuned. Typically, researchers use methods like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). These methods train the model by showing it two generated images—one “good” and one “bad”—and telling it to prefer the good one. But there is a ceiling to this approach: if the model’s “good” image is still artificial and flawed, the model is only learning to be the “best of a bad bunch.” It isn’t learning what real looks like.

Enter HG-DPO (Human image Generation through Direct Preference Optimization). In a new paper from researchers at Kakao, a novel approach is proposed that fundamentally shifts the goalposts. Instead of asking the model to prefer a slightly better generated image, HG-DPO asks the model to prefer actual real images.

Figure 1: HG-DPO generates high-quality human images across diverse settings. The bottom row highlights how it fixes anatomical issues and adapts to personalized tasks.

As shown in Figure 1, the results are striking. By anchoring the training process to reality, the model learns to correct anatomical distortions and capture fine-grained details like lighting and texture. However, telling a model to “just mimic reality” is mathematically chaotic due to the massive difference between generated noise and real pixels.

In this post, we will deconstruct how HG-DPO solves this problem using a clever three-stage curriculum learning strategy, bridging the gap between artificial noise and photorealism.

The Background: DPO and the Reality Gap

To understand why this paper is significant, we first need to look at Direct Preference Optimization (DPO).

In standard DPO for diffusion models, the training data consists of triplets: a prompt, a “winning” image (\(x_w\)), and a “losing” image (\(x_l\)). The objective is to adjust the model so that it becomes more likely to generate \(x_w\) and less likely to generate \(x_l\).

Existing methods rely on datasets where both the winner and loser are generated by the AI. This works for alignment (following instructions), but it doesn’t help much with realism. The researchers argue that to achieve true realism, the “winning” image should be a real photograph.

However, simply swapping in real photos breaks the training. The statistical distribution of a real photo is vastly different from what a diffusion model generates. If you try to force the model to jump straight to reality, the training becomes unstable. This is known as the domain gap.

The Solution: A Three-Stage Curriculum

The authors propose a solution inspired by human education: Curriculum Learning. You don’t teach a child calculus before they learn addition. Similarly, HG-DPO teaches the model realism in three distinct stages, progressively increasing the difficulty.

Figure 2: The three-stage training pipeline of HG-DPO. The model progresses from an Easy stage (fixing anatomy) to a Normal stage (fixing composition) and finally a Hard stage (perfecting details).

As illustrated in Figure 2, the pipeline moves from the Generative Domain (Easy) to an Intermediate Domain (Normal) and finally to the Real Domain (Hard). Let’s break down each stage.

Stage 1: The Easy Stage (Anatomy & Alignment)

The goal of the first stage is basic quality control. The model, denoted as \(\epsilon_{base}\), often generates distorted limbs or ignores parts of the prompt.

In this stage, the researchers stick to the standard DPO approach but refine how the data is selected. They generate a pool of images for a single prompt and use an AI scorer (PickScore) to rank them.

Figure 3: The DPO Dataset for the easy stage. Winners are chosen based on undistorted anatomy and better text alignment compared to losers.

As shown in Figure 3, the “winner” is simply a generated image that got lucky—it has correct anatomy and follows the prompt. The “loser” is a generated image with distortions. By training on these pairs, the model learns to stop generating six fingers or twisted torsos.

The Image Pool Strategy

Instead of generating just two images, the authors generate a pool of \(N\) images. Equation defining the image pool generation. They then score these images (\(S_{gen}\)) to find the best and worst examples. Equation defining the scoring of generated images. The best image becomes the winner (\(x^{\mathbf{w}}\)) and the worst becomes the loser (\(x^{\mathbf{l}}\)). Equation defining the selection of winner and loser images.

Solving the Color Shift

There was a catch during the Easy stage experiments. The model started producing images with weird color casts (e.g., oversaturated or shifted hues). This happens because the statistical distribution of the model’s latent space drifts away from the original base model.

To fix this, the authors introduced a Statistics Matching Loss (\(\mathcal{L}_{stat}\)).

Equation 4: The Statistics Matching Loss formula.

This loss function forces the channel-wise mean of the model’s latent features to stay close to the base model’s statistics.

Figure 15: Comparison showing the effect of the statistics matching loss. Without it (left), images have unnatural color tones. With it (right), colors are natural.

Figure 15 demonstrates the impact of this loss. The images on the left (without \(\mathcal{L}_{stat}\)) look washed out or tint-shifted, while the images on the right retain natural lighting and color balance.

Stage 2: The Normal Stage (Bridging the Gap)

Once the model (\(\epsilon_{\mathbb{E}}\)) can generate anatomically correct humans, it’s time to tackle realism. However, we still can’t jump straight to real photos. The gap is too wide.

The Normal stage introduces an Intermediate Domain. The researchers create synthetic “winning” images that act as a bridge. They use a technique called Stochastic Differential Reconstruction (SDRecon).

How SDRecon Works

They take a real photograph, add a specific amount of noise to it (diffuse it forward in time), and then use the base model to “reconstruct” or denoise it back to an image.

Figure 17: Visualization of intermediate domains. Images range from \\(t_1\\) (close to real) to \\(t_T\\) (close to generated).

Figure 17 visualizes this spectrum.

  • \(t_1\): Very little noise added. The reconstructed image looks almost exactly like the real photo.
  • \(t_T\): A lot of noise added. The reconstruction looks like a purely AI-generated image.

For the Normal stage, they select images from the middle of this spectrum (\(t_4\) to \(t_7\)). These images have the composition and pose of the real photograph but the texture and noise patterns of a generated image.

In this stage:

  • Winner: An intermediate image (Realism + Generative Texture).
  • Loser: The winner from the Easy stage (Purely Generative).

This teaches the model to prefer realistic poses and compositions without shocking it with pixel-perfect real textures yet.

Figure 16: Improvements from Easy to Normal stage. The Normal stage (right) produces more natural lighting and less “stiff” poses.

Stage 3: The Hard Stage (Photorealism)

Now the model (\(\epsilon_{\mathbb{N}}\)) understands anatomy and realistic composition. It is ready for the final exam: Real Images.

In the Hard stage, the “winning” images are drawn from the \(t_1\) domain—images so close to the real photos that they are indistinguishable to the human eye.

  • Winner: Real image (technically \(t_1\) reconstruction).
  • Loser: The winner from the Normal stage.

Equation 9: Selection of winner and loser for the Hard stage.

This final step forces the model to refine fine details: the texture of skin, the reflection in eyes, and the subtle shading that makes a photo look “real.”

Figure 21: The Hard stage introduces realistic shading and fine details compared to the Normal stage.

As seen in Figure 21, the Hard stage lifts the “plastic” look often associated with AI art, introducing vivid shading and sharpness.

Improving Text Alignment

While the U-Net (the image generator) is training, the researchers noticed that image-text alignment could degrade slightly as the model focused intensely on visual quality. To counter this, they separately trained the Text Encoder during the Easy stage.

Figure 23: The enhanced text encoder (right) captures specific details like “a tattoo on her left shoulder” that the standard model might miss.

By combining the Hard-stage U-Net with this enhanced Text Encoder, the final HG-DPO model achieves the best of both worlds: photorealism and high prompt adherence.

Experiments and Results

The researchers compared HG-DPO against several state-of-the-art baselines, including Diffusion-DPO, Pick-a-Pic, and AlignProp. The evaluation used standard metrics like FID (Fréchet Inception Distance, measuring realism) and PickScore (measuring human preference).

Quantitative Superiority

Table 1: Quantitative comparison. HG-DPO outperforms other methods in almost every metric, particularly FID and PickScore.

Table 1 shows a clear dominance. HG-DPO achieves the lowest FID (29.41), significantly lower than the base model (37.34) and competitors like Diffusion-DPO (112.67). This mathematically confirms that HG-DPO images are statistically much closer to real images.

Qualitative Comparison

Numbers are great, but visual inspection is crucial for generative models.

Figure 4: Qualitative comparison. Note how HG-DPO (far right) handles complex lighting and group shots better than baselines.

In Figure 4, look at the second row (the selfie comparison). Many baselines struggle with the lighting or the facial structures. HG-DPO produces a natural, cohesive image that looks like a genuine photograph.

Does the Curriculum Matter? (Ablation Study)

You might wonder, “Can we skip the Easy or Normal stages?” The researchers asked this too and ran ablation studies.

Figure 7: Ablation study showing the necessity of each stage. Skipping the Normal stage (Hard w/o Normal) results in lower quality than the full pipeline.

Figure 7 proves the curriculum is essential.

  • Hard w/o Easy: The model collapses or produces artifacts because it wasn’t ready for the hard data.
  • Hard w/o Normal: The model improves but lacks the refined realism of the full pipeline.

Personalization Applications

One of the most practical applications of this technology is Personalized Text-to-Image (PT2I)—generating images of a specific person (like yourself) in different styles. HG-DPO can be plugged into existing personalization frameworks like InstantBooth without additional training.

Figure 12: HG-DPO improves personalized generation, keeping the subject’s identity while enhancing realism.

Conclusion and Implications

The HG-DPO paper presents a compelling argument: if we want AI to generate real-looking images, we must find a way to train it on real images. The hurdle has always been the mathematical gap between what the model knows (noise) and what reality looks like.

By using a curriculum learning approach—graduating from Anatomy to Composition to Texture—HG-DPO successfully bridges this gap. It essentially “grooms” the model, ensuring it doesn’t just memorize good generations, but actively pushes toward the distribution of real photography.

While the model still has occasional limitations (fingers remain the nemesis of all AI, as shown in Figure 24 below), HG-DPO represents a significant step forward in crossing the “uncanny valley.”

Figure 24: Limitations. Despite high realism, fingers can still be tricky for the model.

For students and researchers in generative AI, this paper highlights the importance of data selection strategy. It’s not just about the architecture or the loss function; it’s about what you show the model, and when you show it.