Figure 1: High-quality 3D assets generated by TRELLIS in various formats from text or image prompts. Demonstrates versatile generation, vivid appearances with 3D Gaussians or Radiance Fields, detailed geometries with meshes, and flexible editing.
The world of AI-generated content has been dominated by stunning 2D imagery. Models like DALL-E and Midjourney can conjure photorealistic scenes and fantastical art from a simple text prompt.
But what about the third dimension?
While 3D generation has made impressive strides, it’s long felt a step behind its 2D counterparts. Why is that?
One of the biggest hurdles is the representation problem. Unlike 2D images, neatly stored as a pixel grid, 3D objects come in many forms: meshes for clean geometry, voxels for volumetrics, Radiance Fields (NeRFs) and 3D Gaussians for photorealistic rendering. Each format has distinct strengths and limitations, and most generative models commit to just one, constraining their versatility. A model that excels at NeRFs might struggle to produce a clean mesh ready for a game engine.
This fragmentation makes it tough to build a unified, all-purpose 3D generation system. What if there were a common language—a representation that could fluidly translate into any format?
That’s the challenge addressed by the paper “Structured 3D Latents for Scalable and Versatile 3D Generation”. The authors introduce a unified latent representation called Structured Latents (SLAT) and a family of models built on it, dubbed TRELLIS. This system generates highly detailed 3D assets from either text or images, and crucially, can output them in multiple formats—meshes, 3D Gaussians, or Radiance Fields—all from the same underlying data.
The 3D Representation Zoo: Why We Need a Unified Framework
Before appreciating SLAT’s novelty, let’s look at the representation landscape:
- Meshes: The staple for games, animation, and CAD. Precise, crisp geometry through vertices, edges, faces. They shine for structure but detailed material generation can be challenging.
- Radiance Fields (NeRFs): Continuous functions mapping 3D coordinates + view direction to color/density. Fantastic for photorealistic view synthesis, but difficult to extract clean, editable geometry from.
- 3D Gaussians: Represent scenes as clouds of “blobs” with color, opacity, and shape. Allow real-time, high-quality rendering but suffer from the same clean surface extraction challenges as NeRFs.
Because of these differences, methods specialize—mesh models excel at geometry but need additional texturing; NeRF or Gaussian models produce rich visuals but can’t yield clean meshes.
TRELLIS argues the answer isn’t perfecting one format, but creating a foundational representation that easily converts into any of them.
The Core Idea: Structured Latents (SLAT)
SLAT elegantly captures both geometry and appearance using two components:
- Sparse Structure: A 3D grid marking active voxels
\(p_i\)
intersecting the object’s surface. This scaffold outlines coarse geometry. It’s efficient even at high resolution because most voxels are empty. - Local Latents: For each active voxel, a high-dimensional feature vector
\(z_i\)
encodes fine geometric and textural detail for its local region.
Mathematically:
Equation 1: SLAT
\(\boldsymbol{z} = \{(\boldsymbol{z}_i, \boldsymbol{p}_i)\}_{i=1}^L\)
captures active voxel positions and associated local feature vectors.
Most of the 3D grid is empty, so the number of active voxels \(L\)
is far smaller than \(N^3\)
. This sparse-yet-rich structure gives SLAT efficiency with high fidelity: the scaffold defines form, while latents—extracted from a powerful vision model—provide the detail.
Learning and Generating with SLAT: The TRELLIS Pipeline
Figure 2: TRELLIS overview — Encoding & Decoding: SLAT encodes geometry and appearance via multiview features from DINOv2. Generation: Two rectified flow transformers generate SLAT in two steps — structure then latents.
1. Encoding 3D Assets into SLATs
To train its models, TRELLIS transforms existing 3D data into SLAT form:
- Render Multi-view Images: Hundreds of views per object.
- Extract Visual Features: Use a pretrained DINOv2 encoder for strong feature representation and 3D awareness.
- Aggregate per Voxel: Map each active voxel to its positions on all rendered feature maps, average the corresponding features.
- Sparse VAE Compression: Feed aggregated features into a Sparse Variational Autoencoder, compressing into normalized local latents
\(z_i\)
.
2. Decoding SLATs into Multiple Formats
Once encoded, SLAT can be decoded into various standard 3D formats via dedicated decoders:
3D Gaussians —
\(\mathcal{D}_{GS}\)
outputs Gaussian properties (position offsets, scale, opacity, rotation, color).Radiance Fields —
\(\mathcal{D}_{RF}\)
produces CP-decomposed local volumes assembled into a global radiance field.Meshes —
\(\mathcal{D}_{M}\)
maps into a detailed Signed Distance Field, extracting meshes via FlexiCubes.
All decoders share a transformer backbone optimized for sparse input. Only the final output layer adapts to the target format, confirming SLAT’s versatility.
Figure 3: Architectures — Sparse VAE (encoding/decoding), Flow Transformer (
\(\mathcal{G}_S\)
), Sparse Flow Transformer (\(\mathcal{G}_L\)
).
3. Generating New 3D Assets
TRELLIS uses a Rectified Flow generative approach, training models to steadily transform noise into SLATs. Generation mirrors SLAT’s structure:
- Stage 1: Flow Transformer
\(\mathcal{G}_{S}\)
generates sparse structure\(p_i\)
from text/image prompt. - Stage 2: Sparse Flow Transformer
\(\mathcal{G}_{L}\)
generates local latents\(z_i\)
to populate structure with detail.
Equation 5: Conditional Flow Matching objective guides rectified flow training.
Once a SLAT is generated, it’s decoded into the desired format—high-resolution mesh, detailed Gaussian splat, or photorealistic radiance field.
Results: How TRELLIS Performs
Trained on 500k curated 3D assets, with GPT-4o captions for text prompts, TRELLIS scales up to 2B parameters.
Reconstruction Fidelity
Figure 4: TRELLIS produces vivid textures, accurate geometry, and coherent details from diverse prompts.
Table 1: SLAT achieves state-of-the-art reconstruction in both appearance (PSNR↑, LPIPS↓) and geometry (CD↓, F-score↑), outperforming alternate latent representations.
Generation Quality vs. State-of-the-Art
Figure 5: Qualitative comparison — TRELLIS shows sharper, more coherent geometry and vivid textures. Other methods suffer from distortions or bland detail.
Table 2: Quantitative performance on Toys4k — TRELLIS leads in CLIP (prompt alignment), FD, and KD metrics for both text-to-3D and image-to-3D tasks.
User Study
Figure 6: User preference — TRELLIS chosen 67.1% (text) and 94.5% (image) prompts by over 100 participants.
Beyond Generation: Powerful Editing & Variation
SLAT’s decoupled design enables tuning-free creative control:
- Detail Variation: Keep structure
\(p_i\)
fixed, re-run Stage 2 with a new prompt to produce novel textures/materials. - Region-Specific Editing: Regenerate latents within a targeted voxel region, leaving the rest unchanged.
Figure 7: Above — varied styles for same structure (robot, house). Below — sequential edits to replace/remove/add parts while maintaining coherence.
Conclusion & Future Outlook
The Structured 3D Latents framework and TRELLIS architecture mark a significant leap in versatile, high-quality 3D generation:
- Unified Representation: Sparse scaffold + rich local detail features.
- Scalable Generative Model: From text or image input, flexible decoding into meshes, Gaussians, or NeRFs.
- Interactive Creativity: Supports intuitive, tuning-free editing workflows.
TRELLIS points towards a standardized 3D generative paradigm akin to latent diffusion in 2D, offering potent tools for games, animation, digital twins, and metaverse experiences. While limitations remain—like two-stage generation overhead and baked-in lighting from image prompts—the foundation is set for scalable, format-agnostic 3D creation.