Double Trouble: How 'Double Unprojected Textures' Solves Real-Time Human Rendering

Imagine video calling a friend, but instead of staring at a flat 2D rectangle on your phone, you are looking at a photo-realistic 3D hologram of them. You can walk around them, see the back of their shirt, or watch them dance from any angle. This is the “Holy Grail” of telepresence and the metaverse.

For years, achieving this required Hollywood-style motion capture studios with 50+ cameras and hours of processing time. But a new paper titled “Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures” is changing the game.

The researchers propose a method called DUT (Double Unprojected Textures). It allows for photo-realistic, 4K resolution, real-time rendering of humans using only a few cameras (sparse-view).

Figure 1. We propose Double Unprojected Textures (DUT), a new method to synthesize photoreal 4K novel-view renderings in real-time.

In this post, we’ll break down how they achieved this, focusing on their unique “double” approach that solves one of the hardest chicken-and-egg problems in computer vision.

The Core Problem: Geometry vs. Appearance

To render a digital twin of a human, you generally need two things:

Geometry: The 3D shape of the person (the mesh).
Appearance: The colors and textures painted onto that shape.

In a “sparse-view” setup (e.g., just 3 or 4 cameras), you have very little data. Existing real-time methods usually try to cut corners. Some try to learn geometry and appearance simultaneously, which often leads to “ghosting” artifacts where the texture slides around on the body. Others ignore the image data when estimating the body shape, relying only on skeletal motion, which results in smooth, plastic-looking avatars that lack realistic clothing wrinkles.

The researchers realized that to get high fidelity, you need to decouple these two tasks, but you need to use the image data for both.

The Solution: Double Unprojected Textures (DUT)

The intuition behind DUT is surprisingly simple: If you try to paint a texture on a bad shape, the painting looks messy. If you fix the shape first, the painting looks great.

The method runs in two distinct stages, performing a “texture unprojection” step in each.

What is Texture Unprojection?

Imagine you have a 3D model of a person and a photo of them. “Unprojection” is the process of taking the pixels from the photo and projecting them back onto the 3D model, then unwrapping that model into a flat 2D image (a texture map).

If your 3D model matches the person in the photo perfectly, the resulting texture map looks clean. If your 3D model is slightly off (e.g., the digital arm is lower than the real arm), the texture map looks distorted and “smeared.”

The Pipeline Overview

The DUT architecture operates entirely using efficient 2D Convolutional Neural Networks (CNNs), which is the secret to its real-time speed.

Figure 2. Overview of DUT. Given sparse-view images and respective motion, DUT predicts coarse template geometry and fine-grained 3D Gaussians.

As shown in Figure 2 above, the system is split into two main pipelines:

Blue Pipeline (GeoNet): Image-Conditioned Template Deformation.
Green Pipeline (GauNet): Texel Gaussian Prediction.

Let’s walk through them.

Stage 1: Fixing the Geometry (GeoNet)

The process starts with a standard “body template” (a generic 3D human mesh) that is posed to match the actor’s skeleton. However, a generic template doesn’t capture the specific body shape or the way clothing folds and moves.

In previous works, researchers would often ignore the video feed at this stage. DUT, however, performs the First Unprojection. They project the video pixels onto this generic template.

Because the template isn’t perfect, the resulting texture map is distorted. But here is the clever part: The distortion itself contains information.

Figure 15. Illustration of Undeformed Texture Map. The distortions of undeformed (first) texture maps are directly related with deformations on the canonical template.

As visualized in Figure 15, the network can look at the distorted texture map (center) and figure out how to fix the mesh. The researchers train a network called GeoNet that takes this messy texture map and predicts a “deformation map” (Equation 5).

Equation 5 and 6 showing GeoNet formulation.

This map tells the system exactly how to push and pull the vertices of the mesh so that the geometry aligns with the actual person in the video.

Stage 2: Painting with Gaussians (GauNet)

Now that Stage 1 has fixed the geometry, the system applies the deformation to the mesh. Now, they perform the Second Unprojection.

Because the geometry is now accurate, this second texture map is much cleaner, with far fewer artifacts and ghosting issues.

Figure 3. Unprojected Texture Maps. Performing a second texture unprojection using the deformed template leads to less ghosting artifacts.

In Figure 3, you can see the difference. The top row shows the first unprojection (messy seams, distortions on the arm). The bottom row shows the second unprojection after the geometry correction—it is significantly sharper and more consistent.

This clean texture map is fed into the second network, GauNet.

Instead of just predicting RGB colors (which can look flat), GauNet predicts parameters for 3D Gaussian Splatting.

Equation 3 showing Gaussian definition.

For every pixel in the texture map (texel), the network predicts a 3D Gaussian (a fuzzy 3D blob with position, rotation, scale, opacity, and color). This allows the system to represent fine details like hair strands and cloth wrinkles that a flat triangle mesh simply cannot capture.

Equation 10 showing GauNet formulation.

One issue with deforming meshes is that polygons can get stretched (like the triangles on a bent elbow). This can confuse the Gaussian sizing. The authors introduced a refinement step where they check how much the mesh stretched and adjust the scale of the 3D Gaussians accordingly.

Equation 12 showing scale refinement.

This simple geometric check ensures that the rendering doesn’t break down during fast motions or extreme poses.

Experiments and Results

The results of this two-stage approach are impressive. The researchers tested DUT against state-of-the-art methods like HoloChar, DVA, and ENeRF.

Visual Quality

The visual improvement is stark. DUT captures facial expressions, finger positions, and clothing wrinkles at 4K resolution.

Figure 4. Qualitative Results. Note that DUT faithfully captures cloth wrinkles, body details on hands and faces.

In the comparison below (Figure 5), look at the sharpness of the face and the “completeness” of the limbs compared to other methods like HoloChar or DVA, which often produce blurry or fragmented results.

Figure 5. Qualitative Comparison. Given sparse-view images with novel poses, our method captures sharper and more faithful appearance details.

The “Standing Long Jump” Test (Robustness)

A major weakness of previous systems is “Out-of-Distribution” (OOD) motion. If a system is trained on walking and waving, and the actor suddenly performs a standing long jump, most systems break because they rely too heavily on the skeleton prior.

Because DUT uses the live image data to calculate geometry (via the first unprojection), it adapts much better to unseen motions.

Figure 11. Quantitative Comparison on out-of-distribution human motions.

The charts above show that DUT maintains high PSNR (image quality) and Structural Similarity (SSIM) even during complex, unseen motions where competitor methods fail.

Quantitative Data

For the data-minded, Table 1 details the performance metrics. DUT outperforms competitors consistently across different subjects (S3, S22, S2618).

Table 1. Quantitative Comparison. We evaluate our approach on 3 subjects.

Note: LPIPS measures perceptual similarity (lower is better), while PSNR measures signal-to-noise ratio (higher is better).

Speed (The “Real-Time” Promise)

Perhaps the most critical achievement is speed. High-quality rendering is useless for telepresence if it runs at 5 frames per second.

Table 4. Quantitative Ablation showing runtime.

As shown in Table 4, DUT runs at over 42 FPS on a single Nvidia RTX 3090 GPU, and over 53 FPS on an H100. This is achieved because the heavy lifting is done by efficient 2D CNNs operating in texture space, rather than heavy 3D volumetric convolutions.

Conclusion

The “Double Unprojected Textures” paper offers a clever engineering solution to the problem of sparse-view rendering. By explicitly separating the problem into Geometry Correction and Appearance Synthesis, and linking them via two passes of unprojection, the authors manage to get the best of both worlds: the stability of template-based models and the flexibility of image-based rendering.

Key Takeaways:

Decoupling is Key: separating geometry and appearance prevents them from negatively influencing each other.
Use the Image Twice: The first pass tells you where the geometry is wrong; the second pass gives you the clean texture to paint with.
2D is Efficient: Mapping 3D problems into 2D texture space allows for real-time performance on consumer hardware.

This research brings us one step closer to a future where our video calls feel less like watching a screen and more like sharing a room.

This blog post explains the research paper “Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures” by Guoxing Sun et al., presented at the Max Planck Institute for Informatics.

The Core Problem: Geometry vs. Appearance#

The Solution: Double Unprojected Textures (DUT)#

What is Texture Unprojection?#

The Pipeline Overview#

Stage 1: Fixing the Geometry (GeoNet)#

Stage 2: Painting with Gaussians (GauNet)#

Refinement: Gaussian Scale Refinement#

Experiments and Results#

Visual Quality#

The “Standing Long Jump” Test (Robustness)#

Quantitative Data#

Speed (The “Real-Time” Promise)#

Conclusion#