Creating photorealistic digital humans that can move and react in real-time is one of the “holy grail” challenges in computer graphics. Whether for the Metaverse, video games, or virtual reality telepresence, we want avatars that look real—down to the wrinkles on a shirt—and render at high frame rates.
Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful alternative to meshes and NeRFs (Neural Radiance Fields). However, current methods face a frustrating trade-off: you can either have a fast avatar with blurry details or a high-fidelity avatar that runs at a sluggish 10 frames per second (FPS).
In this deep dive, we will explore a new method presented in the paper “Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs.” The researchers propose a novel architecture that abandons the idea of a single neural network controlling the whole body. Instead, they distribute the workload across many small networks anchored to the body surface. The result? Avatars that render at 166 FPS with sharper details than previous state-of-the-art methods.

The Problem: The Speed vs. Quality Trade-off
To understand why this paper is significant, we first need to look at how neural avatars are typically built. Most modern approaches use Linear Blend Skinning (LBS) to handle the rough movement of the skeleton, combined with a neural network to predict the fine details (like cloth folding or muscle bulging) based on the current pose.
In the context of 3D Gaussian Splatting, the avatar is made up of thousands of 3D Gaussians—ellipsoids with position, rotation, scale, color, and opacity. As the human moves, these properties need to change.
The Single MLP Bottleneck
Previous methods usually employed a single Multilayer Perceptron (MLP) or a large Convolutional Neural Network (CNN) to predict these changes for the entire body.
- Small Single MLP: If you use a small, fast network, it doesn’t have enough “brain power” (capacity) to memorize high-frequency details like logo text or fine wrinkles across the whole body. The result is real-time performance, but a blurry look.
- Large CNN (e.g., AnimatableGaussians): If you use a massive network (like a StyleUNet), you get beautiful details, but the computation is so heavy that you can’t render it in real-time.
The authors of this paper ask a pivotal question: Why force one network to learn the entire body’s appearance?
The Solution: Spatially Distributed MLPs
The core innovation of this work is Spatially Distributed MLPs. Instead of feeding the pose into one central network, the researchers place distinct “Anchor Points” all over the surface of the human body model. Each anchor point has its own tiny MLP.
How It Works
Imagine covering a mannequin in 300 small stickers. Each sticker contains a specialist neural network that only cares about the appearance of the skin or clothing in its immediate vicinity.

As shown in Figure 2 above, the pipeline works as follows:
- Input: The system takes the current pose vector \(\boldsymbol{\theta}\) (derived from the skeleton).
- Distributed Processing: This pose is fed into the MLPs located at the anchor points.
- Local Prediction: Each MLP outputs a set of coefficients specific to that region.
- Interpolation: Because a Gaussian (a distinct point of the avatar) rarely sits exactly on top of an anchor point, it looks at the nearest anchor points and interpolates their outputs.
This “divide and conquer” strategy significantly reduces the learning burden. Each MLP only needs to master a small patch of local appearance, allowing the system to capture high-frequency details without a massive computational cost.
The Math of Interpolation
Let’s break down the mathematical architecture. We have \(F\) anchor points. The MLP at the \(j\)-th anchor point, denoted as \(\mathcal{E}^j\), takes the pose \(\boldsymbol{\theta}\) and outputs anchor coefficients \(\mathbf{w}_a^j\):

Now, for a specific Gaussian at position \(\mathbf{x}_0\) (its neutral position), we need to determine its coefficients. We can’t just pick one anchor; that would create seams. Instead, we use Inverse Distance Weighting to interpolate the coefficients from the three nearest anchor points.

Here, \(\gamma(\mathbf{x}, \mathbf{y})\) is the reciprocal of the distance between points. This ensures that the influence of an MLP fades smoothly as you move away from its anchor.
The Basis Strategy: Preventing Blurriness
There is a subtle but critical design choice here. You might think, “Why not just let the MLPs output the color and opacity directly?”
If the MLPs outputted raw properties (like “Red” or “Transparency”), interpolating them would smooth out the data, effectively killing the sharp details the researchers are trying to preserve. Smooth interpolation equals blur.
To solve this, the authors use a Basis-Coefficient approach:
- Learnable Basis: Each Gaussian stores a set of “Basis” values (\(\delta\Lambda^k\)). Think of these as a dictionary of possible fundamental changes this specific Gaussian can undergo. These are learned freely and can be very sharp/distinct.
- Smooth Coefficients: The MLPs output the coefficients (weights), which are smoothly interpolated.
The final property offset \(\delta\Lambda\) is a linear combination of the sharp basis definitions weighted by the smooth coefficients:

Finally, these offsets are added to the neutral Gaussian properties (\(\Lambda_0\)) to get the final appearance for the current pose:

By interpolating the instructions (coefficients) rather than the results (colors), the system maintains smooth transitions across the body while allowing individual Gaussians to express sharp, high-frequency details defined in their basis.
Geometry Control: Pinning the Gaussians Down
While the method above solves the appearance (color/texture) problem, there is a geometric challenge. 3D Gaussians are unstructured points. During training, if left unconstrained, they tend to drift inside the body volume or clump together irregularly. This leads to artifacts when the avatar performs a novel pose that stretches the skin.
To fix this, the authors introduce Control Points.

Similar to the appearance anchors, Control Points are sampled on the mesh surface. The position offset of a Gaussian is not learned freely; it is interpolated from the position offsets of nearby Control Points.
The position offset for a control point \(\delta \mathbf{x}_c\) is calculated using a similar basis-coefficient method:

Then, the Gaussian’s physical shift \(\delta \mathbf{x}\) is derived by interpolating these control points:

This constraint forces the Gaussians to move coherently as a surface layer, rather than a cloud of independent particles. It acts as a regularization that drastically improves generalization to new poses.
Training Objectives
To train this complex system, the researchers use a combination of loss functions.
They explicitly enforce that neighboring control points should have similar movement vectors to prevent tearing or erratic geometry:

They also limit the size of Gaussians to prevent them from becoming huge “blobs” that obscure details:

The total loss combines L1 reconstruction loss, LPIPS (perceptual loss), and the constraints mentioned above:

Experiments and Results
The results of this architecture are impressive, particularly when comparing the “Speed vs. Quality” ratio against competitors.
Visual Quality
The researchers compared their method against state-of-the-art approaches like 3DGS-Avatar, MeshAvatar, and AnimatableGaussians.

In Figure 4, pay attention to the text on the shirt (Top Row) and the wrinkles in the pants (Bottom Row).
- 3DGS-Avatar & MeshAvatar: Often fail to capture fine wrinkles or text, resulting in a “washed out” look.
- AnimatableGaussians: Produces high-quality details similar to the Ground Truth (GT), but requires heavy computation.
- Ours: Matches or exceeds the visual fidelity of AnimatableGaussians, capturing the text “LIFE WITHOUT LIMITS” clearly.
Quantitative Metrics
The visual evidence is backed by numbers. In terms of LPIPS (where lower is better, indicating perceptual similarity to reality) and FID (Frechet Inception Distance), the proposed method consistently scores top marks.

Even more importantly, the method generalizes well. When the avatar is put into Novel Poses (poses it never saw during training), it maintains stability and quality.

The Speed Factor
This is the clincher. High quality is great, but not if it takes 100ms to render a single frame.
- AnimatableGaussians: ~10 FPS (Not real-time).
- Ours: 166 FPS.
This massive speedup is possible because the distributed MLPs are small and efficient. Furthermore, the position-based interpolation means the system doesn’t have to query a massive network for every single pixel or Gaussian; it only computes the anchor points once per frame.
Ablation Studies: Proving the Design
The researchers performed ablation studies to prove that every part of their complex pipeline is necessary.
1. Why use the Basis? In Figure 6 below, you can see what happens if you remove the Basis-Coefficient strategy and just try to output properties directly. The folds in the shirt (center panel) become unrecognizable blurs. The Basis allows the sharpness seen on the right.

2. Why use Control Points? In Figure 7, the “w/o control point” examples show severe artifacts, particularly around the logo on the back. Without the control points anchoring the geometry to the surface, the Gaussians drift, destroying the texture coherence.

3. Number of MLPs They also tested how many distributed MLPs are optimal. Too few (50), and you lose detail. Too many (800), and you waste computational power without gaining much quality. They settled on 300 MLPs as the sweet spot between speed and fidelity.

Conclusion
The paper “Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs” presents a smart architectural shift for digital humans. By moving away from monolithic neural networks and embracing a distributed, position-based approach, the authors achieved the best of both worlds: photorealistic detail and high frame rates.
Key Takeaways:
- Distributed is better: Breaking the body into local regions managed by small MLPs is more efficient than one large network.
- Interpolate coefficients, not colors: To keep details sharp while ensuring smooth transitions, learn a basis set and interpolate the weights.
- Geometry matters: Using Control Points to constrain Gaussian positions is essential for preventing artifacts during animation.
This work paves the way for highly realistic avatars that can actually be used in consumer VR hardware, where every millisecond of rendering time counts. While the current method doesn’t simulate complex cloth physics (like a skirt blowing in the wind), it sets a new standard for articulated human avatars in real-time applications.

](https://deep-paper.org/en/paper/2504.12909/images/cover.png)