Introduction

In the rapidly evolving landscape of AR/VR and the metaverse, the demand for personalized, photorealistic 3D avatars is skyrocketing. We all want a digital twin that not only looks like us but can also change outfits as easily as we do in the real world.

While recent advances in 3D Gaussian Splatting (3DGS) have allowed for incredible real-time rendering of static scenes, editing these representations remains a massive headache. If you have ever tried to “edit” a point cloud, you know the struggle: it lacks structure. On the other hand, traditional meshes are easy to edit but often struggle to capture the fuzzy, intricate details of real-world clothing and hair.

Enter TetGS (Tetrahedron-constrained Gaussian Splatting), a new method proposed by researchers from Peking University and Alibaba Group. This framework attempts to bridge the gap between structural control and photorealistic rendering.

Figure 1. Given a short RGB video, the method generates editable 3D avatars with text or image guidance.

As shown in Figure 1, this method takes a simple monocular video and allows users to perform text-guided or image-guided edits—changing a sweater to a varsity jacket or a trench coat—while maintaining high fidelity. In this post, we will break down how TetGS works, why it solves the “uncontrollable Gaussian” problem, and how it achieves such clean results.

The Problem: The Chaos of Unstructured Splats

To understand why this paper is significant, we first need to look at the limitations of standard 3D Gaussian Splatting. 3DGS represents a scene as millions of discrete 3D ellipses (Gaussians). It’s excellent for reconstruction (taking photos and making a 3D model) but terrible for editing.

Why? Because 3DGS is essentially an unstructured point cloud. There is no underlying mesh connecting the points. When you try to edit a 3DGS model using generative AI (like a diffusion model), the gradients—the signals telling the points where to move—become noisy. This often results in:

Needle-like artifacts: Gaussians shooting off into space.
Blurriness: The texture loses its sharpness.
Lack of geometry: The model looks like a colored fog rather than a solid object.

The researchers hypothesize that to get good edits, you need to decouple geometry (shape) from appearance (texture).

The Solution: TetGS (Tetrahedron-constrained Gaussian Splatting)

The core innovation of this paper is the TetGS representation. Instead of letting Gaussian splats float freely in space, the researchers embed them inside a structured Tetrahedral Grid.

Think of a Tetrahedral Grid as a 3D mesh made of pyramids (tetrahedrons) rather than surface triangles. This grid fills the 3D space.

Figure 3. An illustration of tetrahedron-constrained Gaussian. Each Gaussian kernel is embedded in a unique tetrahedron.

As illustrated in Figure 3, every Gaussian is assigned to a specific tetrahedron. Its position (\(\mu\)) is calculated based on the vertices of that tetrahedron.

Why does this matter? It binds the rendering primitives (Gaussians) to a deformable geometry (Tetrahedrons). If you deform the grid (change the shape of the shirt), the Gaussians automatically move with it. This provides the structure of a mesh with the rendering quality of splatting.

The mathematical relationship binding the mesh vertex (\(v^{M}\)) to the tetrahedron vertex (\(v^{T}\)) uses Signed Distance Functions (SDF), as defined below:

Equation relating mesh vertices to tetrahedral vertices via SDF values.

This equation ensures that the “surface” (where the SDF is zero) is explicitly defined, giving the Gaussians a precise surface to inhabit.

The Pipeline: From Video to Editable Avatar

The overall workflow is a three-stage process: Instantiation, Localized Spatial Adaptation, and Texture Generation.

Figure 2. An overview of the proposed hybrid Tetrahedron-constrained Gaussian Splatting (TetGS) pipeline.

Stage 1: High-Quality 3D Avatar Instantiation

Before editing can happen, the system needs to understand the “base” avatar. The input is a simple 360-degree video of a person.

The system first reconstructs the person using an Implicit SDF field—a neural network that learns the 3D shape of the person. This ensures the surface is smooth and accurate. Once the geometry is known, it initializes the TetGS by converting that geometry into a tetrahedral grid and filling it with Gaussians.

The architecture for this initial reconstruction uses a combination of geometry and appearance networks:

Figure 10. The architecture of the implicit reconstruction with SDF field.

To handle the imperfections of real-world video (like uneven lighting or holes in the scan), the authors use specific loss functions to regularize the normals (surface direction), ensuring the avatar doesn’t look lumpy or inverted.

Equation for normal regularization loss. Equation for normal orientation loss.

Stage 2: Localized Spatial Adaptation (The Geometry Edit)

This is where the magic happens. Let’s say you want to change the avatar’s t-shirt to a “puffy jacket.” This requires changing the 3D shape (geometry).

Standard methods would try to move everything at once, often ruining the face or hands. TetGS uses Localized Tetrahedron Partitioning.

Masking: The system identifies which parts of the body are “clothing” (editable) and which are “body” (keep fixed).
Partitioning: The tetrahedrons corresponding to the “keep” regions are frozen. Their vertices cannot move.
Deformation: The tetrahedrons in the “edit” region are allowed to deform based on the text prompt.

Figure 4. Demonstration of the tetrahedron partitioning process and vertex grouping.

Figure 4 visualizes this perfectly. The red dots represent frozen vertices (the face/neck), while the green dots are free to move. This ensures the person’s identity is preserved while their clothes change shape.

Dual Spatial Constraint

To ensure the new shape looks realistic, the optimization is guided by a diffusion model (SDS loss). However, applying this globally isn’t enough. The authors introduce a Dual Constraint:

Global SDS: Ensures the overall avatar looks coherent.
Local SDS: Focuses specifically on the edited region to capture fine details of the new clothing.

Equation for Global SDS Loss. Equation for Local SDS Loss.

Additionally, a surface-aware regularization ensures that the new clothing geometry doesn’t accidentally intersect or occlude the parts we want to keep.

Equation for Surface-aware regularization loss.

The total loss function combines these elements to drive the geometry transformation:

Equation for Total Loss.

Stage 3: Coarse-to-Fine Texture Generation

Once the geometry has morphed into a jacket, it still looks like the old t-shirt stretched out. It needs a new texture. The authors argue that doing this in one shot leads to artifacts. Instead, they decouple it into two substeps.

Step A: Restricted TetGS (The Coarse Pass)

Initially, they “restrict” the Gaussians. They force the 3D Gaussians to behave like flat 2D disks (surfels) on the surface and remove view-dependent effects (like shininess).

They then use a Normal-guided Inpainter. They take a rendering of the new shape, identify uncolored areas, and use a generative model (like ControlNet) to “paint” the new texture based on the surface normals.

Figure 12. An overview of the coarse texture generation stage using inpainting.

This process blends the inpainted “hallucinated” texture with the existing rendering to create a stable base.

Equation for blending inpainted and original images. Equation for training objective during coarse texture generation.

Step B: Attribute Activation (The Fine Pass)

The coarse pass gives a stable texture but lacks the photorealism of 3DGS. In the final step, the system “releases” the restrictions. The Gaussians are allowed to become 3D ellipsoids again, and their view-dependent color attributes (Spherical Harmonics) are activated.

The system refines these attributes using augmented multi-view guidance, bringing back high-frequency details like fabric texture and lighting interactions.

Experimental Results

Does it actually work? The results suggest a significant leap over previous methods.

Qualitative Comparison

In Figure 5, we see the system’s versatility. It handles skirts, jackets, and shorts, maintaining the person’s pose and identity perfectly.

Figure 5. Multi-view renderings and underlying geometries before and after editing.

Comparing TetGS to state-of-the-art baselines like GaussianEditor and DGE (Figure 6) reveals the benefits of the structural constraints. Notice how the baselines often produce “spiky” artifacts or blurry textures, whereas TetGS produces clean, sharp clothing.

Figure 6. Qualitative comparison with text-guided methods GaussianEditor and DGE.

Quantitative Analysis

The visual improvements are backed by numbers. Table 1 shows that TetGS achieves a significantly lower FID (Frechet Inception Distance), which measures how “real” the image looks compared to the ground truth. A lower score indicates better photorealism.

Table 1. Quantitative comparison with 3D avatar editing methods.

Ablation Studies

The authors also proved the necessity of their decoupled pipeline.

Without TetGS: Directly optimizing points leads to noise (Figure 18).
Without Localized Adaptation: The geometry doesn’t deform enough to match the prompt (e.g., a jacket looks like a painted t-shirt).
Without Attribute Activation: The texture looks flat and cartoonish.

Figure 7. Ablation study on decoupled editing with TetGS. Table 2. Ablation study on the proposed editing pipeline.

Image-Guided Editing (Virtual Try-On)

A particularly cool application is Virtual Try-On. Instead of a text prompt, you can feed the system a reference image of a garment. By using an image-based virtual try-on model (IDM-VTON) to guide the texture generation, TetGS can wrap a specific real-world item onto the 3D avatar.

Figure 16. More results on reference image-guided 3D avatar editing.

To achieve this, they add a specific loss term that aligns the normals of the generated avatar with the normals expected from the reference image, ensuring the drape and folds match the target clothing.

Equation for Virtual Try-On loss.

Conclusion

The TetGS framework represents a smart step forward in generative 3D media. By acknowledging that 3D Gaussian Splatting is great for rendering but needs help with structure, the authors combined the best of both worlds: the explicit editability of meshes (via tetrahedrons) and the visual fidelity of Gaussians.

Key takeaways for students and researchers:

Structure matters: Unconstrained optimization in 3D space usually leads to chaos. Constraining Gaussians to a grid stabilizes the learning process.
Decouple your problems: Trying to learn shape and color simultaneously is hard. Solving for geometry first, then coarse texture, then fine texture yields much better results.
Hybrid representations: The future of 3D graphics likely isn’t just meshes or just NeRFs/Gaussians, but hybrid systems that leverage the strengths of each.

This technology paves the way for accessible 3D content creation, allowing users to create and customize high-fidelity avatars from simple phone videos.

Introduction#

The Problem: The Chaos of Unstructured Splats#

The Solution: TetGS (Tetrahedron-constrained Gaussian Splatting)#

The Pipeline: From Video to Editable Avatar#

Stage 1: High-Quality 3D Avatar Instantiation#

Stage 2: Localized Spatial Adaptation (The Geometry Edit)#

Dual Spatial Constraint#

Stage 3: Coarse-to-Fine Texture Generation#

Step A: Restricted TetGS (The Coarse Pass)#

Step B: Attribute Activation (The Fine Pass)#

Experimental Results#

Qualitative Comparison#

Quantitative Analysis#

Ablation Studies#

Image-Guided Editing (Virtual Try-On)#

Conclusion#