TRELLIS: Weaving High-Quality 3D Worlds with a Unified Latent Structure

A collage showing high-quality 3D assets generated by TRELLIS from text and image prompts, including a wooden trellis, a vintage camera, and editable wagons.

Figure 1: High-quality 3D assets generated by TRELLIS in various formats from text or image prompts. Demonstrates versatile generation, vivid appearances with 3D Gaussians or Radiance Fields, detailed geometries with meshes, and flexible editing.

The world of AI-generated content has been dominated by stunning 2D imagery. Models like DALL-E and Midjourney can conjure photorealistic scenes and fantastical art from a simple text prompt.
But what about the third dimension?

While 3D generation has made impressive strides, it’s long felt a step behind its 2D counterparts. Why is that?

One of the biggest hurdles is the representation problem. Unlike 2D images, neatly stored as a pixel grid, 3D objects come in many forms: meshes for clean geometry, voxels for volumetrics, Radiance Fields (NeRFs) and 3D Gaussians for photorealistic rendering. Each format has distinct strengths and limitations, and most generative models commit to just one, constraining their versatility. A model that excels at NeRFs might struggle to produce a clean mesh ready for a game engine.

This fragmentation makes it tough to build a unified, all-purpose 3D generation system. What if there were a common language—a representation that could fluidly translate into any format?

That’s the challenge addressed by the paper “Structured 3D Latents for Scalable and Versatile 3D Generation”. The authors introduce a unified latent representation called Structured Latents (SLAT) and a family of models built on it, dubbed TRELLIS. This system generates highly detailed 3D assets from either text or images, and crucially, can output them in multiple formats—meshes, 3D Gaussians, or Radiance Fields—all from the same underlying data.

The 3D Representation Zoo: Why We Need a Unified Framework

Before appreciating SLAT’s novelty, let’s look at the representation landscape:

Meshes: The staple for games, animation, and CAD. Precise, crisp geometry through vertices, edges, faces. They shine for structure but detailed material generation can be challenging.
Radiance Fields (NeRFs): Continuous functions mapping 3D coordinates + view direction to color/density. Fantastic for photorealistic view synthesis, but difficult to extract clean, editable geometry from.
3D Gaussians: Represent scenes as clouds of “blobs” with color, opacity, and shape. Allow real-time, high-quality rendering but suffer from the same clean surface extraction challenges as NeRFs.

Because of these differences, methods specialize—mesh models excel at geometry but need additional texturing; NeRF or Gaussian models produce rich visuals but can’t yield clean meshes.
TRELLIS argues the answer isn’t perfecting one format, but creating a foundational representation that easily converts into any of them.

The Core Idea: Structured Latents (SLAT)

SLAT elegantly captures both geometry and appearance using two components:

Sparse Structure: A 3D grid marking active voxels \(p_i\) intersecting the object’s surface. This scaffold outlines coarse geometry. It’s efficient even at high resolution because most voxels are empty.
Local Latents: For each active voxel, a high-dimensional feature vector \(z_i\) encodes fine geometric and textural detail for its local region.

Mathematically:

A mathematical equation defining the structured latent z as a set of pairs (zi, pi).

Equation 1: SLAT \(\boldsymbol{z} = \{(\boldsymbol{z}_i, \boldsymbol{p}_i)\}_{i=1}^L\) captures active voxel positions and associated local feature vectors.

Most of the 3D grid is empty, so the number of active voxels \(L\) is far smaller than \(N^3\). This sparse-yet-rich structure gives SLAT efficiency with high fidelity: the scaffold defines form, while latents—extracted from a powerful vision model—provide the detail.

Learning and Generating with SLAT: The TRELLIS Pipeline

An overview diagram of the TRELLIS method, showing the Encoding & Decoding pipeline and the two-stage Generation pipeline.

Figure 2: TRELLIS overview — Encoding & Decoding: SLAT encodes geometry and appearance via multiview features from DINOv2. Generation: Two rectified flow transformers generate SLAT in two steps — structure then latents.

1. Encoding 3D Assets into SLATs

To train its models, TRELLIS transforms existing 3D data into SLAT form:

Render Multi-view Images: Hundreds of views per object.
Extract Visual Features: Use a pretrained DINOv2 encoder for strong feature representation and 3D awareness.
Aggregate per Voxel: Map each active voxel to its positions on all rendered feature maps, average the corresponding features.
Sparse VAE Compression: Feed aggregated features into a Sparse Variational Autoencoder, compressing into normalized local latents \(z_i\).

2. Decoding SLATs into Multiple Formats

Once encoded, SLAT can be decoded into various standard 3D formats via dedicated decoders:

3D Gaussians — \(\mathcal{D}_{GS}\) outputs Gaussian properties (position offsets, scale, opacity, rotation, color).
Radiance Fields — \(\mathcal{D}_{RF}\) produces CP-decomposed local volumes assembled into a global radiance field.
Meshes — \(\mathcal{D}_{M}\) maps into a detailed Signed Distance Field, extracting meshes via FlexiCubes.

All decoders share a transformer backbone optimized for sparse input. Only the final output layer adapts to the target format, confirming SLAT’s versatility.

Diagrams of the neural network architectures for the Sparse VAE, Flow Transformer, and Sparse Flow Transformer.

Figure 3: Architectures — Sparse VAE (encoding/decoding), Flow Transformer (\(\mathcal{G}_S\)), Sparse Flow Transformer (\(\mathcal{G}_L\)).

3. Generating New 3D Assets

TRELLIS uses a Rectified Flow generative approach, training models to steadily transform noise into SLATs. Generation mirrors SLAT’s structure:

Stage 1: Flow Transformer \(\mathcal{G}_{S}\) generates sparse structure \(p_i\) from text/image prompt.
Stage 2: Sparse Flow Transformer \(\mathcal{G}_{L}\) generates local latents \(z_i\) to populate structure with detail.

The conditional flow matching (CFM) loss equation.

Equation 5: Conditional Flow Matching objective guides rectified flow training.

Once a SLAT is generated, it’s decoded into the desired format—high-resolution mesh, detailed Gaussian splat, or photorealistic radiance field.

Results: How TRELLIS Performs

Trained on 500k curated 3D assets, with GPT-4o captions for text prompts, TRELLIS scales up to 2B parameters.

Reconstruction Fidelity

A gallery of high-quality 3D assets, including a log cabin, a retro radio, and robots, generated by TRELLIS.

Figure 4: TRELLIS produces vivid textures, accurate geometry, and coherent details from diverse prompts.

Table showing reconstruction fidelity where ‘Ours’ outperforms all metrics.

Table 1: SLAT achieves state-of-the-art reconstruction in both appearance (PSNR↑, LPIPS↓) and geometry (CD↓, F-score↑), outperforming alternate latent representations.

Generation Quality vs. State-of-the-Art

Visual comparison of 3D models generated by different methods.

Figure 5: Qualitative comparison — TRELLIS shows sharper, more coherent geometry and vivid textures. Other methods suffer from distortions or bland detail.

Table of quantitative comparisons showing TRELLIS excelling across metrics.

Table 2: Quantitative performance on Toys4k — TRELLIS leads in CLIP (prompt alignment), FD, and KD metrics for both text-to-3D and image-to-3D tasks.

User Study

Donut charts showing user preference for TRELLIS.

Figure 6: User preference — TRELLIS chosen 67.1% (text) and 94.5% (image) prompts by over 100 participants.

Beyond Generation: Powerful Editing & Variation

SLAT’s decoupled design enables tuning-free creative control:

Detail Variation: Keep structure \(p_i\) fixed, re-run Stage 2 with a new prompt to produce novel textures/materials.
Region-Specific Editing: Regenerate latents within a targeted voxel region, leaving the rest unchanged.

Examples of variation and region-specific editing.

Figure 7: Above — varied styles for same structure (robot, house). Below — sequential edits to replace/remove/add parts while maintaining coherence.

Conclusion & Future Outlook

The Structured 3D Latents framework and TRELLIS architecture mark a significant leap in versatile, high-quality 3D generation:

Unified Representation: Sparse scaffold + rich local detail features.
Scalable Generative Model: From text or image input, flexible decoding into meshes, Gaussians, or NeRFs.
Interactive Creativity: Supports intuitive, tuning-free editing workflows.

TRELLIS points towards a standardized 3D generative paradigm akin to latent diffusion in 2D, offering potent tools for games, animation, digital twins, and metaverse experiences. While limitations remain—like two-stage generation overhead and baked-in lighting from image prompts—the foundation is set for scalable, format-agnostic 3D creation.

The 3D Representation Zoo: Why We Need a Unified Framework#

The Core Idea: Structured Latents (SLAT)#

Learning and Generating with SLAT: The TRELLIS Pipeline#

1. Encoding 3D Assets into SLATs#

2. Decoding SLATs into Multiple Formats#

3. Generating New 3D Assets#

Results: How TRELLIS Performs#

Reconstruction Fidelity#

Generation Quality vs. State-of-the-Art#

User Study#

Beyond Generation: Powerful Editing & Variation#

Conclusion & Future Outlook#