Introduction
In the rapidly evolving world of Generative AI, creating a 3D object from a single 2D image is something of a “Holy Grail.” We have seen tremendous progress with models that can turn a picture of a cat into a 3D mesh in seconds. However, if you look closely at the results of most current state-of-the-art models, you will notice a flaw: they look great from the original camera angle, but they often fail to react realistically to light.
This is because most models “bake” the lighting into the texture. If the input image shows a shadow on the left side of the object, the 3D model paints that shadow permanently onto the surface. If you try to use that asset in a game or a movie and place a light source on the left, the shadow remains, breaking the illusion of realism. Furthermore, the textures often suffer from blurriness, lacking the crisp details found in the input image.
Enter ARM (Appearance Reconstruction Model), a new framework presented by researchers from the University of Utah, Zhejiang University, UCLA, and Amazon. ARM proposes a fundamental shift in how we generate 3D assets. Instead of trying to do everything at once, it intelligently decouples geometry from appearance and moves the texturing process into the UV space—the same workflow used by professional 3D artists.

As shown in Figure 1 above, ARM is capable of reconstructing diverse objects—from gaming peripherals to fantasy armor—with sharp textures and, crucially, physically based materials that react to light naturally. In this post, we will tear down the architecture of ARM to understand how it achieves this leap in fidelity.
The Problem with Current 3D Generation
To appreciate ARM, we first need to understand the limitations of the current landscape.
The Trade-off: Speed vs. Quality
Broadly, there are two ways to generate 3D from 2D:
- Optimization-based methods (e.g., DreamFusion): These “distill” 3D shapes from 2D diffusion models. They produce good results but are incredibly slow, taking hours per object.
- Feed-forward models (e.g., LRM, LGM): These train a massive neural network to output a 3D representation directly. They are fast (seconds) but often produce blurry textures.
The “Baked-in” Lighting Issue
Most feed-forward models predict a “vertex color.” They look at the input image and simply project that color onto the 3D shape. They treat the object as a glowing surface rather than a material. They ignore Physically-Based Rendering (PBR) properties, such as:
- Albedo: The base color of the object without any shadows or highlights.
- Roughness: How microscopic irregularities on the surface scatter light (e.g., matte rubber vs. polished chrome).
- Metalness: Whether the material behaves like a metal (conducting electricity and reflecting color) or a dielectric (plastic, wood).
Without these properties, you cannot “relight” an object. This is where ARM changes the game.
The ARM Framework: A Decoupled Approach
The core philosophy of ARM is decoupling. Trying to teach a single neural network to understand shape, color, lighting, and material reflectivity all at once creates a bottleneck. Instead, ARM breaks the pipeline into specialized stages.

As illustrated in Figure 2, the pipeline is split into two main phases:
- Geometry Stage (GeoRM): Focuses exclusively on building the 3D shape.
- Appearance Stage (GlossyRM & InstantAlbedo): Focuses exclusively on how the surface looks and interacts with light.
Let’s break down these components.
1. GeoRM: Building the Shape
The first step uses GeoRM, which stands for Geometry Reconstruction Model. It is built on the Large Reconstruction Model (LRM) architecture using Triplanes.
Think of a “Triplane” as three orthogonal 2D feature maps (xy, xz, yz) that represent a 3D volume. A neural network predicts these planes from the input images. To get the 3D shape, the model queries any point in 3D space, projects it onto these planes, and calculates the density.
GeoRM is trained solely to predict density. It doesn’t care about color. By focusing only on geometry, it produces a clean, high-resolution mesh using an algorithm called Differentiable Marching Cubes (DiffMC).
2. GlossyRM: Defining Reflectivity
Once the mesh is created, the appearance stage begins. The researchers discovered that predicting all material properties in one go reduced quality. So, they created GlossyRM to handle the “glossy” components: Roughness and Metalness.
GlossyRM uses a similar architecture to GeoRM (Triplanes) but is trained to predict material properties per vertex. It takes the mesh generated by GeoRM and “paints” the roughness and metalness values onto the vertices.
Why separate this? The researchers found that when they tried to predict albedo (color) and glossy materials together, the network struggled to output extreme values (like perfectly smooth or fully metallic surfaces), leading to washed-out, “average” looking materials.
Figure 8 shows that separating the tasks (Ours) produces distinct material properties, whereas a unified approach (InstantAlbedo only) results in grey, muddled predictions.
3. InstantAlbedo: The Texture Specialist
This is perhaps the most innovative part of ARM. The previous modules (GeoRM and GlossyRM) used Triplanes. While triplanes are great for 3D structure, they are terrible for fine texture details. They act like a voxel grid; if you want sharp texturing, you need an impractically massive grid that consumes too much memory.
InstantAlbedo ditches triplanes entirely for the texturing phase. Instead, it works in UV Texture Space.
The UV Space Advantage
In 3D modeling, “UV unwrapping” is the process of peeling the skin of a 3D object and laying it flat on a 2D image (the texture map). This allows you to paint details at the pixel level of an image, rather than the vertex level of a mesh.
InstantAlbedo performs the following steps (visualized in Figure 3 below):
- Unwrapping: It takes the mesh from GeoRM and unwraps it into UV atlas charts.
- Back-projection: It takes the input images and projects them directly onto the UV map. If a part of the object was visible in the photo, that pixel goes onto the texture map.
- Inpainting: Since the input images (usually 6 views generated by a diffusion model) don’t cover every millimeter of the object, there will be holes in the texture. InstantAlbedo uses a U-Net combined with a Fast Fourier Convolution (FFC) Net to intelligently fill in these missing gaps.

This approach creates textures that are significantly sharper than what triplanes can produce because the resolution is determined by the 2D image size, not the 3D grid size.
Solving the Ambiguity: The Material Prior
There is a fundamental scientific problem in extracting materials from images: Ambiguity.
Imagine looking at a picture of a dark grey sphere.
- Is it a white ball sitting in a dark room?
- Or is it a black ball sitting in a bright room?
Mathematically, many combinations of Lighting + Albedo can result in the same pixel color. This is an “ill-posed problem.” Traditional “Inverse Rendering” tries to solve this mathematically but often fails with sparse data, resulting in lighting artifacts baked into the albedo.
ARM solves this by introducing a Material Prior.
The researchers utilize a pre-trained image encoder (based on DINO features) that acts as a “material expert.” This encoder looks at the semantic context of the image. It “knows” that a golden trophy should look yellow and metallic, or that a tire should look dark and matte rubber.
By feeding these semantic features into InstantAlbedo, the model can make an educated guess to separate the actual color of the object from the shadows cast upon it.

The ablation study in Figure 6 (above) demonstrates this critical contribution.
- Top row: Without the FFC-Net (the inpainting module), the unseen areas of the handbag are messy.
- Bottom row: Without the Material Prior, the mallet’s head retains dark shadows in the albedo map (baked lighting). With the prior (Ours), the model understands the wood is a solid color and the darkness was just a shadow, removing it cleanly.
Experimental Results
The ARM team trained their models on a subset of the Objaverse dataset using 8 NVIDIA H100 GPUs for about 5 days. But how does it stack up against the competition?
Qualitative Comparison
Visual inspection reveals a stark difference in sharpness. Because ARM operates in UV space for texturing, it avoids the blur associated with voxel/triplane methods like LGM or MeshFormer.

In Figure 4, look closely at the “Supreme” logo on the helmet or the text on the burger sign. In competing methods (InstantMesh, MeshFormer), these details are illegible blobs. In ARM, the text and intricate patterns are crisp.
The Relighting Test
The ultimate test of PBR reconstruction is changing the environment. If the lighting was “baked in,” the object would look wrong in a new environment.

Figure 5 compares ARM against SF3D, another recent method. Notice the “Diffuse” column. SF3D’s diffuse map for the trophy still has bright highlights and dark shadows painted on it. ARM’s diffuse map is flat and uniform—which is exactly what you want.
When placed in a new lighting environment (“Relit Image”), ARM’s trophy reflects the new environment accurately. SF3D’s trophy looks like it’s glowing with the old lighting, creating a confusing visual.
Quantitative Metrics
The team evaluated ARM on standard datasets (GSO and OmniObject3D). They measured geometry accuracy (Chamfer Distance) and appearance quality (PSNR/SSIM).

As shown in Table 1, ARM achieves state-of-the-art results across the board. It scores highest in PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index), indicating that its rendered views match the ground truth far better than previous feed-forward models.
Conclusion
ARM represents a maturing of AI 3D generation. We are moving past the “wow factor” of simply getting a shape, and towards the practical requirements of production pipelines: fidelity, editability, and physical realism.
By strategically separating the reconstruction of shape (GeoRM), material properties (GlossyRM), and surface color (InstantAlbedo), ARM bypasses the bottlenecks of previous architectures. The shift to processing texture in UV space allows for high-frequency details that triplanes simply cannot capture efficiently. Furthermore, the use of a semantic material prior helps solve the age-old computer vision problem of distinguishing paint from shadow.
While challenges remain—specifically regarding the consistency of the multi-view images generated by the upstream diffusion models—ARM provides a robust framework for creating relightable, game-ready assets from a single image. For students and researchers in computer vision, ARM serves as a perfect example of how domain-specific knowledge (like UV mapping and PBR theory) can be combined with deep learning to solve complex dimensionality problems.
](https://deep-paper.org/en/paper/2411.10825/images/cover.png)