The demand for high-quality 3D assets is exploding. From the immersive worlds of video games and virtual reality to the practical applications of architectural visualization and film production, the need for detailed, realistic 3D models is higher than ever. Traditionally, creating these assets has been a labor-intensive bottleneck, requiring skilled artists to sculpt geometry, paint textures, and tune material properties manually.
In recent years, Generative AI has promised to automate this pipeline. We’ve seen models that can turn text into 3D shapes or turn a single image into a rotating mesh. However, a significant gap remains between what AI generates and what professional graphics engines actually need. Most current AI models produce “baked” assets—meshes with color painted directly onto the vertices. They often look like plastic toys or clay models, lacking the complex material properties (like how shiny metal is versus how matte rubber is) required for photorealistic rendering.
Enter 3DTopia-XL, a new framework presented at CVPR that aims to bridge this gap. This paper introduces a scalable, native 3D generative model capable of producing high-quality “Physically Based Rendering” (PBR) assets. By leveraging a novel data representation called PrimX and a powerful Diffusion Transformer (DiT), 3DTopia-XL isn’t just generating shapes; it’s generating materials, textures, and geometry in a unified, efficient pipeline.
In this post, we will deconstruct how 3DTopia-XL works, why its unique representation is a game-changer, and look at the impressive results it achieves.

The Problem with Current 3D Generation
To understand the innovation of 3DTopia-XL, we first need to look at the limitations of the current state-of-the-art. Existing methods generally fall into three buckets:
- Score Distillation Sampling (SDS): These methods (like DreamFusion) use 2D diffusion models (like Stable Diffusion) to “carve” a 3D shape by optimizing it until it looks right from every angle. While innovative, this process is slow, often results in “cartoony” geometry, and struggles with lighting artifacts.
- Sparse-view Reconstruction: Methods like LRM (Large Reconstruction Model) take an image and try to regress a 3D shape directly, often using a “Triplane” representation. While fast, Triplanes are memory-inefficient. They struggle to represent high-resolution details because the parameter space is limited. Furthermore, these are deterministic—they reconstruct one result rather than generating diverse variations.
- Native 3D Diffusion: These models train directly on 3D data. However, 3D data is hard to represent efficiently. Voxels (3D pixels) are memory hogs (cubic complexity). Point clouds lack surface connectivity.
Crucially, very few of these methods tackle PBR (Physically Based Rendering). A professional 3D asset isn’t just shape and color; it requires maps for Roughness (micro-surface details), Metallic (reflectivity), and Normal (surface bumps). Without these, an asset looks flat and unrealistic in a game engine.
The Core Innovation: PrimX
The researchers realized that to scale 3D generation, they needed a better way to represent 3D data—something efficient, expressive, and compatible with deep learning. Their solution is PrimX.
What is a Primitive?
In standard voxel approaches, you divide 3D space into a massive grid. Most of that grid is empty air, which is a waste of computation. In PrimX, the researchers instead represent an object as a set of \(N\) primitives.
Think of a primitive as a “tiny voxel” or a building block. Instead of a fixed grid, these primitives are anchored specifically to the surface of the mesh.

As shown in Figure 2 above, the process works like this:
- Input: A textured mesh (shape + albedo + material).
- Rapid Tensorization: The mesh is converted into \(N\) primitives (represented as colorful cubes in the diagram).
- Primitive Payload: Each primitive \(\mathcal{V}_k\) contains rich information packed into a tensor:
- Position (\(\mathbf{t}_k\)): Where is it in 3D space?
- Scale (\(s_k\)): How big is this block?
- Features (\(\mathbf{X}_k\)): A payload grid containing the SDF (Signed Distance Function for shape), RGB (Color), and Material (Roughness/Metallic).
By summing up these primitives, the model can reconstruct the full 3D object. This representation is sparse (it only exists where the object exists) and tensorial (it can be processed easily by neural networks).
Why PrimX is Superior
The researchers compared PrimX against other popular representations like Triplanes, Dense Voxels, and MLPs. The results, visualized in Figure 4 below, are striking.

Notice how the Triplane method (middle) results in a blurry face and blocky artifacts. The MLP methods struggle with high-frequency details. PrimX (green box), however, captures the sharp contours of the monster’s face and the specific texture details, closely matching the Ground Truth.
Table 1 in the paper (referenced in the image above) highlights that PrimX achieves this quality while being 7x faster to fit than the next best method. It fits a high-quality asset in about 1.5 minutes.
The Engine: 3DTopia-XL Framework
With PrimX providing a compact way to store 3D data, the researchers built a generative framework to create this data from scratch. The architecture, 3DTopia-XL, consists of two main stages: Primitive Patch Compression and Latent Primitive Diffusion.

1. Primitive Patch Compression (VAE)
Even though PrimX is efficient, generating raw high-resolution 3D data is computationally expensive. To solve this, the authors employ a 3D Variational Autoencoder (VAE).
Looking at the left side of Figure 3, the VAE takes the PrimX data (\(N \times D\) tensor) and compresses the local features of each primitive into a smaller, “latent” vector. This is similar to how Stable Diffusion compresses pixels into latents. This step significantly reduces the dimensionality of the data, making the training of the diffusion model feasible.
2. Latent Primitive Diffusion (DiT)
The heart of the generation process is the Latent Primitive Diffusion model (center of Figure 3). The authors chose a Diffusion Transformer (DiT) architecture rather than the standard U-Net used in image generation.
Why a Transformer?
- Set-based Data: PrimX is essentially a set of primitives. Transformers are excellent at processing sequences or sets of tokens.
- Scalability: Transformers scale remarkably well with more data and parameters.
The DiT treats each compressed primitive as a token. It uses self-attention to understand how different parts of the 3D object relate to each other (e.g., “if there is a leg here, there should be a body there”) and cross-attention to incorporate the Condition (your text prompt or input image).
The model learns to start with random noise and iteratively “denoise” it into a structured set of primitives representing a 3D object.
From Tensor to Game Asset: PBR Extraction
One of the most practical contributions of this paper is the pipeline for converting the generated PrimX tensor back into a usable 3D file (GLB).
Many AI models stop at “Vertex Coloring”—painting colors directly onto the geometry points. This looks okay in a viewer but terrible in a game engine because the resolution depends on the geometry density.
3DTopia-XL takes a different approach:
- Geometry Extraction: It uses the “Marching Cubes” algorithm on the SDF field to extract a clean mesh.
- UV Unwrapping: It generates a high-resolution UV map (1024x1024).
- Texture Sampling: It samples the RGB and Material values from the PrimX field onto this UV map.
- Inpainting: It intelligently fills in gaps to prevent aliasing artifacts.
This results in a standard GLB file containing distinct textures for Albedo (color), Roughness, and Metallic properties.
Experiments and Results
Does it work? The results suggest a resounding yes.
Image-to-3D Generation
When given a single image, 3DTopia-XL can generate a full 3D model that remains faithful to the input while hallucinating the unseen sides plausibly.

In Figure 5, compare the Ours column (3DTopia-XL) with competitors like LGM or CRM.
- Look at the orange elephant (middle row). Other methods produce a flat orange shape. 3DTopia-XL captures the bumpy texture of the orange peel skin.
- Look at the rendering. Because 3DTopia-XL generates roughness and metallic maps (shown on the far right), the elephant reflects light realistically. The other models look matte and fake.
Text-to-3D Generation
The model is also capable of generating assets from pure text descriptions.

In Figure 9, prompts like “A donuts with pink icing” generate detailed geometry with distinct material properties—the icing looks glossy (low roughness), while the dough looks matte. The ability to separate these materials is unique to this PBR-centric approach.
Generative Diversity
Unlike reconstruction models (like LGM) which try to find the “one true shape” for an image, 3DTopia-XL is a probabilistic generative model. This means that for a single input, it can generate multiple valid variations.

In Figure 14, given a picture of a unicorn (bottom row), the model generates several variations. They all look like the input image from the front, but the specific details of the geometry and texture vary slightly, giving artists choices.
Conclusion and Implications
3DTopia-XL represents a significant step forward in automated 3D content creation. By moving away from inefficient representations and embracing a primitive-based approach (PrimX), the researchers have enabled the generation of high-resolution, PBR-ready assets.
Key Takeaways:
- PrimX is powerful: It combines the benefits of voxels and explicit meshes, enabling efficient learning of Shape, Color, and Material.
- Physics matters: Integrating PBR properties (Roughness/Metallic) directly into the generation pipeline is essential for creating assets that look “real” in modern engines.
- Scalability: The use of Diffusion Transformers (DiT) proves that the “scaling laws” we see in text and image generation also apply to 3D, provided we have the right data representation.
For students and researchers in computer graphics, this work highlights the importance of data representation. The architecture (DiT) is standard, but the way the data is presented to the network (PrimX) unlocked the performance. As we move toward the Metaverse and increasingly complex digital twins, techniques like 3DTopia-XL will likely become the standard for populating virtual worlds.
](https://deep-paper.org/en/paper/2409.12957/images/cover.png)