Introduction
One of the most persistent challenges in 3D computer vision is the assumption of a static world. Traditional 3D reconstruction techniques, such as Photogrammetry or Neural Radiance Fields (NeRFs), generally assume that while the camera moves, the scene itself remains frozen.
But in the real world, this is rarely true. If you are scraping a collection of photos of a famous landmark from the internet, those photos were taken at different times of day, under different weather conditions, and with different cameras. Even in a casual capture session where you walk around an object, the sun might go behind a cloud, or your own shadow might fall across the subject.
When a standard NeRF tries to reconstruct a 3D object from these inconsistent images, it gets confused. Is that dark patch a black painted surface, or is it just a shadow? Is that white spot a sticker, or a specular highlight from the sun? The result is often a “foggy” or blurry reconstruction where the model averages out these contradictions.
In a recent paper, researchers from Google and the University of Maryland propose a novel solution to this problem. Their method, Generative Multiview Relighting, effectively “harmonizes” the lighting across all input photos before attempting to build the 3D model.

As shown in Figure 1 above, this approach allows for the recovery of high-fidelity details—specifically “shiny” specular highlights—that prior state-of-the-art methods fail to capture.
The Core Problem: Illumination Ambiguity
To understand why this new method is significant, we first need to look at how previous approaches handled variable lighting.
The “Latent Code” Approach
Early solutions like NeRF-W (NeRF in the Wild) introduced the idea of a per-image “appearance embedding” or latent code. This is essentially a vector of numbers unique to each photo that tells the network, “This image is a bit darker,” or “This image is warmer.”
While this works well for changing the overall mood or exposure of a scene, it often fails for complex materials. The model tends to “explain away” view-dependent effects. For example, if a metallic object gleams as you move around it, a standard model with appearance embeddings might incorrectly learn that the gleam is actually a white patch of paint that only exists in that one photo. The resulting 3D model looks diffuse (matte) and lacks realistic glossiness.
The “Inverse Rendering” Approach
Another strategy is inverse rendering, which attempts to physically decompose an image into its constituent parts: geometry, material properties (albedo, roughness), and lighting. While physically grounded, this problem is mathematically “ill-posed.” There are infinite combinations of lights and materials that could produce the same pixel color. Without strong priors, these models often fail to separate the lighting from the texture.
The “Generative” Approach
More recently, researchers have turned to diffusion models (the technology behind DALL-E and Midjourney) to “relight” images. If you can use AI to edit all your input photos so they look like they were taken under the exact same lighting, the 3D reconstruction problem becomes easy again.
However, applying a diffusion model to images one by one creates a new problem: consistency. If you independently relight a photo of a car from the front and a photo from the side, the diffusion model might “hallucinate” slightly different lighting directions or reflection patterns for each. When you feed these inconsistent images into a 3D pipeline, the reconstruction fails.
The Solution: A Two-Stage Pipeline
The authors propose a robust two-stage pipeline that tackles these issues head-on.
- Multiview Relighting: Use a diffusion model that looks at all images simultaneously to harmonize the lighting.
- Robust 3D Reconstruction: Use a modified NeRF architecture that can handle the small imperfections that remain after relighting.

Figure 2 provides a high-level overview. The system takes \(N\) images with varying illumination. It selects one image as the “reference.” It then processes the other images to match the lighting conditions of that reference. Finally, these “harmonized” images are fed into a Neural Radiance Field to build the 3D asset.
Stage 1: The Multiview Relighting Model
The first innovation is in how the images are relighted. Instead of processing each image in isolation, the authors use a multiview diffusion model.
The model takes in a set of noisy latent codes corresponding to the input images. Crucially, it uses 3D self-attention mechanisms. This means that when the model is denoising Image A, it is “aware” of the content and geometry in Image B and Image C.
By processing the images jointly, the model creates a unified understanding of the object’s shape and the target illumination. This drastically reduces the “hallucinations” common in single-image relighting.

Figure 3 illustrates the power of this multiview approach. In column (c), a single-image relighting method (IllumiNeRF) struggles; it produces a plausible image, but the lighting cues don’t match the geometry perfectly. In column (d), the proposed multiview method produces a result almost identical to the ground truth, because it leverages information from all available angles to resolve ambiguities.
Stage 2: 3D Reconstruction with Shading Embeddings
Once the images are relit, they are technically “consistent.” However, diffusion models are not perfect physics simulators. They might generate the correct lighting style, but the exact placement of a specular highlight (the shiny reflection of a light source) might be off by a few pixels compared to where it should be geometrically.
If you train a standard NeRF on these slightly imperfect images, the model gets confused by the “wobbling” reflections and produces blurry artifacts.
To solve this, the authors introduce a clever modification to the NeRF architecture: Shading Embeddings.
The Concept
In traditional NeRF-W, an appearance embedding modifies the color output. The authors argue this is the wrong approach for this problem. Instead, they use a per-image vector to perturb the surface normals.
In 3D graphics, the “normal” is a vector perpendicular to the surface that determines how light bounces off it. By allowing the network to slightly tweak the surface normal for each individual image, the model can compensate for the small geometric errors introduced by the diffusion relighting process.
The equation for this operation is:

Here, \(\mathbf{n}_i(\mathbf{x})\) is the normal for image \(i\) at position \(x\). It is derived from the base geometry features \(\mathbf{f}(\mathbf{x})\) combined with a learnable per-image vector \(\mathbf{v}_i\).
This allows the 3D model to say, “I know the reflection in this specific photo is slightly to the left of where it should be. I will tilt the surface normal just for this frame to catch the reflection correctly, without altering the actual 3D shape or color of the object.”

Figure S2 demonstrates why this is necessary. These spherical maps show the lighting environment extracted from relit images. While the content is consistent (the house and trees are there), the geometry is slightly warped between samples. The Shading Embeddings absorb this warping so the 3D geometry doesn’t have to.
Experimental Results
The researchers validated their method on both synthetic datasets (Objaverse) and real-world captures (NAVI).
Synthetic Benchmarks
The Objaverse dataset provided a controlled environment to measure accuracy. The authors tested on standard objects and a specific “Shiny” subset to push the limits of view-dependent rendering.

In Figure 4, we see a visual comparison. Notice the apple (top row) and the fire extinguisher (middle row).
- NeRFCast + AE (c): Fails to capture the sharp reflections.
- NeROIC (d): Produces a very diffuse, matte look.
- IllumiNeRF (e): Results in blurry textures due to inconsistent relighting.
- Ours (f): Captures the sharp, realistic specular highlights that match the Ground Truth (g).
The quantitative data backs this up:

As seen in Table 1, the proposed method achieves significantly higher PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index) scores compared to baselines, particularly on the difficult “Shiny Assets.”
Real-World Performance
The method also excels with real-world photographs, where camera poses might be imperfect and lighting is uncontrolled.

Figure 5 shows results from the NAVI dataset. The method successfully preserves complex shadows (like the bunny’s ear shadow) and distinct reflections (like the glossy car hood).

Table 2 confirms that this visual quality translates to better numerical performance on real-world data as well.
Why It Works: Ablation Studies
The researchers performed ablation studies to verify which parts of their pipeline were doing the heavy lifting.
The Importance of Multiview Context
Does the model really need to see 64 frames at once? Yes. The breakdown in Table 3 shows that as the number of simultaneously processed frames (\(N\)) increases, the quality of the reconstruction improves.

Going from 1 frame (standard single-image processing) to 64 frames provides a massive boost in PSNR, proving that the cross-attention between views is vital for consistency.
Shading vs. Appearance Embeddings
Is the “Shading Embedding” (normal perturbation) actually better than the standard “Appearance Embedding” (color modification)?

Table 5 provides the answer. Using standard appearance embeddings actually performed worse than using no embeddings at all in some metrics. This is likely because appearance embeddings encourage the model to “cheat” by baking reflections into the texture. The shading embeddings, however, provide the flexibility needed to align the specular highlights without sacrificing geometric integrity.
Conclusion
This research represents a significant step forward in bringing 3D reconstruction out of the lab and into the wild. By combining the creative power of generative diffusion models with the geometric rigor of neural radiance fields, the authors have found a way to utilize inconsistent data that was previously discarded or mishandled.
The key takeaways from this work are:
- Don’t process images alone: When relighting for 3D, context is everything. Multiview attention ensures that the “hallucinated” lighting is consistent across all angles.
- Generative AI isn’t perfect: Even the best diffusion models introduce small geometric errors.
- Adapt the geometry, not the color: When dealing with these small errors, it is better to tweak surface normals (Shading Embeddings) than to blend colors (Appearance Embeddings), especially for shiny objects.
This method opens the door for high-quality 3D asset generation from casual photo collections, enabling more realistic digital twins and virtual experiences from everyday photography.
](https://deep-paper.org/en/paper/2412.15211/images/cover.png)