Introduction
In the rapidly evolving landscape of AR/VR and the metaverse, the demand for personalized, photorealistic 3D avatars is skyrocketing. We all want a digital twin that not only looks like us but can also change outfits as easily as we do in the real world.
While recent advances in 3D Gaussian Splatting (3DGS) have allowed for incredible real-time rendering of static scenes, editing these representations remains a massive headache. If you have ever tried to “edit” a point cloud, you know the struggle: it lacks structure. On the other hand, traditional meshes are easy to edit but often struggle to capture the fuzzy, intricate details of real-world clothing and hair.
Enter TetGS (Tetrahedron-constrained Gaussian Splatting), a new method proposed by researchers from Peking University and Alibaba Group. This framework attempts to bridge the gap between structural control and photorealistic rendering.

As shown in Figure 1, this method takes a simple monocular video and allows users to perform text-guided or image-guided edits—changing a sweater to a varsity jacket or a trench coat—while maintaining high fidelity. In this post, we will break down how TetGS works, why it solves the “uncontrollable Gaussian” problem, and how it achieves such clean results.
The Problem: The Chaos of Unstructured Splats
To understand why this paper is significant, we first need to look at the limitations of standard 3D Gaussian Splatting. 3DGS represents a scene as millions of discrete 3D ellipses (Gaussians). It’s excellent for reconstruction (taking photos and making a 3D model) but terrible for editing.
Why? Because 3DGS is essentially an unstructured point cloud. There is no underlying mesh connecting the points. When you try to edit a 3DGS model using generative AI (like a diffusion model), the gradients—the signals telling the points where to move—become noisy. This often results in:
- Needle-like artifacts: Gaussians shooting off into space.
- Blurriness: The texture loses its sharpness.
- Lack of geometry: The model looks like a colored fog rather than a solid object.
The researchers hypothesize that to get good edits, you need to decouple geometry (shape) from appearance (texture).
The Solution: TetGS (Tetrahedron-constrained Gaussian Splatting)
The core innovation of this paper is the TetGS representation. Instead of letting Gaussian splats float freely in space, the researchers embed them inside a structured Tetrahedral Grid.
Think of a Tetrahedral Grid as a 3D mesh made of pyramids (tetrahedrons) rather than surface triangles. This grid fills the 3D space.

As illustrated in Figure 3, every Gaussian is assigned to a specific tetrahedron. Its position (\(\mu\)) is calculated based on the vertices of that tetrahedron.
- Why does this matter? It binds the rendering primitives (Gaussians) to a deformable geometry (Tetrahedrons). If you deform the grid (change the shape of the shirt), the Gaussians automatically move with it. This provides the structure of a mesh with the rendering quality of splatting.
The mathematical relationship binding the mesh vertex (\(v^{M}\)) to the tetrahedron vertex (\(v^{T}\)) uses Signed Distance Functions (SDF), as defined below:

This equation ensures that the “surface” (where the SDF is zero) is explicitly defined, giving the Gaussians a precise surface to inhabit.
The Pipeline: From Video to Editable Avatar
The overall workflow is a three-stage process: Instantiation, Localized Spatial Adaptation, and Texture Generation.

Stage 1: High-Quality 3D Avatar Instantiation
Before editing can happen, the system needs to understand the “base” avatar. The input is a simple 360-degree video of a person.
The system first reconstructs the person using an Implicit SDF field—a neural network that learns the 3D shape of the person. This ensures the surface is smooth and accurate. Once the geometry is known, it initializes the TetGS by converting that geometry into a tetrahedral grid and filling it with Gaussians.
The architecture for this initial reconstruction uses a combination of geometry and appearance networks:

To handle the imperfections of real-world video (like uneven lighting or holes in the scan), the authors use specific loss functions to regularize the normals (surface direction), ensuring the avatar doesn’t look lumpy or inverted.

Stage 2: Localized Spatial Adaptation (The Geometry Edit)
This is where the magic happens. Let’s say you want to change the avatar’s t-shirt to a “puffy jacket.” This requires changing the 3D shape (geometry).
Standard methods would try to move everything at once, often ruining the face or hands. TetGS uses Localized Tetrahedron Partitioning.
- Masking: The system identifies which parts of the body are “clothing” (editable) and which are “body” (keep fixed).
- Partitioning: The tetrahedrons corresponding to the “keep” regions are frozen. Their vertices cannot move.
- Deformation: The tetrahedrons in the “edit” region are allowed to deform based on the text prompt.

Figure 4 visualizes this perfectly. The red dots represent frozen vertices (the face/neck), while the green dots are free to move. This ensures the person’s identity is preserved while their clothes change shape.
Dual Spatial Constraint
To ensure the new shape looks realistic, the optimization is guided by a diffusion model (SDS loss). However, applying this globally isn’t enough. The authors introduce a Dual Constraint:
- Global SDS: Ensures the overall avatar looks coherent.
- Local SDS: Focuses specifically on the edited region to capture fine details of the new clothing.

Additionally, a surface-aware regularization ensures that the new clothing geometry doesn’t accidentally intersect or occlude the parts we want to keep.

The total loss function combines these elements to drive the geometry transformation:

Stage 3: Coarse-to-Fine Texture Generation
Once the geometry has morphed into a jacket, it still looks like the old t-shirt stretched out. It needs a new texture. The authors argue that doing this in one shot leads to artifacts. Instead, they decouple it into two substeps.
Step A: Restricted TetGS (The Coarse Pass)
Initially, they “restrict” the Gaussians. They force the 3D Gaussians to behave like flat 2D disks (surfels) on the surface and remove view-dependent effects (like shininess).
They then use a Normal-guided Inpainter. They take a rendering of the new shape, identify uncolored areas, and use a generative model (like ControlNet) to “paint” the new texture based on the surface normals.

This process blends the inpainted “hallucinated” texture with the existing rendering to create a stable base.

Step B: Attribute Activation (The Fine Pass)
The coarse pass gives a stable texture but lacks the photorealism of 3DGS. In the final step, the system “releases” the restrictions. The Gaussians are allowed to become 3D ellipsoids again, and their view-dependent color attributes (Spherical Harmonics) are activated.
The system refines these attributes using augmented multi-view guidance, bringing back high-frequency details like fabric texture and lighting interactions.
Experimental Results
Does it actually work? The results suggest a significant leap over previous methods.
Qualitative Comparison
In Figure 5, we see the system’s versatility. It handles skirts, jackets, and shorts, maintaining the person’s pose and identity perfectly.

Comparing TetGS to state-of-the-art baselines like GaussianEditor and DGE (Figure 6) reveals the benefits of the structural constraints. Notice how the baselines often produce “spiky” artifacts or blurry textures, whereas TetGS produces clean, sharp clothing.

Quantitative Analysis
The visual improvements are backed by numbers. Table 1 shows that TetGS achieves a significantly lower FID (Frechet Inception Distance), which measures how “real” the image looks compared to the ground truth. A lower score indicates better photorealism.

Ablation Studies
The authors also proved the necessity of their decoupled pipeline.
- Without TetGS: Directly optimizing points leads to noise (Figure 18).
- Without Localized Adaptation: The geometry doesn’t deform enough to match the prompt (e.g., a jacket looks like a painted t-shirt).
- Without Attribute Activation: The texture looks flat and cartoonish.

Image-Guided Editing (Virtual Try-On)
A particularly cool application is Virtual Try-On. Instead of a text prompt, you can feed the system a reference image of a garment. By using an image-based virtual try-on model (IDM-VTON) to guide the texture generation, TetGS can wrap a specific real-world item onto the 3D avatar.

To achieve this, they add a specific loss term that aligns the normals of the generated avatar with the normals expected from the reference image, ensuring the drape and folds match the target clothing.

Conclusion
The TetGS framework represents a smart step forward in generative 3D media. By acknowledging that 3D Gaussian Splatting is great for rendering but needs help with structure, the authors combined the best of both worlds: the explicit editability of meshes (via tetrahedrons) and the visual fidelity of Gaussians.
Key takeaways for students and researchers:
- Structure matters: Unconstrained optimization in 3D space usually leads to chaos. Constraining Gaussians to a grid stabilizes the learning process.
- Decouple your problems: Trying to learn shape and color simultaneously is hard. Solving for geometry first, then coarse texture, then fine texture yields much better results.
- Hybrid representations: The future of 3D graphics likely isn’t just meshes or just NeRFs/Gaussians, but hybrid systems that leverage the strengths of each.
This technology paves the way for accessible 3D content creation, allowing users to create and customize high-fidelity avatars from simple phone videos.
](https://deep-paper.org/en/paper/2504.20403/images/cover.png)