In the rapidly evolving world of Computer Vision, Multimodal Large Language Models (MLLMs) have achieved something formerly thought impossible: they can look at an image and describe it with near-human fluency. Models like GPT-4V or LLaVA can identify a person in a photo, tell you they are smiling, and perhaps describe their clothing.
However, if you ask these general models to identify specific, fine-grained details—like the precise location of “crow’s feet” wrinkles, the style of eyeliner applied, or the exact boundaries of a skin blemish—they often fail. They lack fine-grained grounding, the ability to link specific textual concepts to precise pixels on a high-resolution face.
This is the problem addressed in GroundingFace, a new framework presented at CVPR. The researchers introduce a method to bridge the gap between general scene understanding and the micro-level perception required for facial analysis.

As shown in Figure 1 above, GroundingFace doesn’t just “see” a face; it parses it into granular components like “nose bridge,” “cheekbone brown spot,” and “jawline,” while providing detailed captions about makeup and skin conditions.
In this post, we will break down how this model works, the massive dataset engineered to train it, and the novel architectural choices that allow it to maintain general knowledge while becoming a facial expert.
1. The Data Problem: Why General Models Fail at Faces
To understand the solution, we must first understand the bottleneck. Current MLLM datasets are designed for general objects. They contain captions like “a dog running on grass” or “a man holding a cup.”
When these datasets do include faces, the annotations are usually coarse (e.g., “face,” “eye,” “nose”). They lack the vocabulary for:
- Skin attributes: Acne, rosacea, moles, wrinkles.
- Makeup: Eyeliner style, lipstick texture, contouring.
- Hierarchical semantics: Understanding that “looking old” implies the presence of “wrinkles” and “grey hair.”
Existing facial datasets (like CelebA or FFHQ) are good for classification but lack the pixel-grounded text-to-image alignment needed for modern MLLMs.
The Solution: FacePlayGround-240K
To train a model to be a facial expert, the authors constructed a new dataset called FacePlayGround-240K. This is the first large-scale, pixel-grounded face caption and Question-Answer (QA) dataset. It includes 240,000 images with 5.4 million mask annotations covering 47 detailed categories.
The construction of this dataset is a masterclass in automated data engineering. As illustrated in the figure below, the pipeline involves four distinct stages:

- Comprehensive Caption Generation: Instead of relying on simple templates, the researchers used a commercial API to extract raw facial data (pose, skin type, emotion). They then fed this data into a Large Language Model (InternVL) to generate rich, descriptive captions covering short, detailed, and overall semantic levels.
- Fine-Grained Part Mask Annotation: This is where the pixel precision comes in. The team combined automatic parsing (using MediaPipe for structure) with commercial APIs (for skin defects) and manual annotation (specifically for hard-to-segment items like makeup).
- Text-Mask Alignment: A caption is useless for grounding if the model doesn’t know which word corresponds to which pixel. The pipeline aligns specific noun phrases (e.g., “dark eyeliner”) with their corresponding segmentation masks.
- Grounded Hierarchical Semantic QA: Finally, an LLM generates Question-Answer pairs. This transforms the data from static descriptions into interactive instructions, such as “Where is the blemish?” or “Why does the person look angry?”
The statistical breakdown of this dataset reveals its depth. Unlike previous datasets that might only label “eyes” and “nose,” FacePlayGround-240K includes specific categories like “Nasolabial fold,” “Crow’s feet,” and various makeup types.

2. The Architecture: Inside GroundingFace
With the data ready, the authors needed a model architecture capable of utilizing it. They built GroundingFace upon GLaMM (Grounding Large Multimodal Model), a state-of-the-art generalist model.
However, standard GLaMM has limitations when applied to faces:
- Resolution: Global visual encoders often shrink images, losing the tiny details of a pore or an eyelash.
- Decoder Sensitivity: Standard segmentation decoders (like the vanilla SAM decoder) are trained for objects, not fine facial lines.
The GroundingFace architecture introduces three specific components to solve these issues: the Face Prior Sampler, the HQ-SAM Adapter, and a Multi-Stage Training Recipe.

Component A: Fine-Grained Face Part Segmentation (The HQ-SAM Adapter)
One of the key insights of this paper is related to how Vision Transformers (ViTs) “see” images at different depths.
- Deep layers capture high-level semantics (e.g., “this is a human face”).
- Shallow layers capture low-level details (e.g., edges, textures, lines).
Standard models often rely heavily on deep features. For facial analysis, this is a mistake because features like wrinkles or skin texture are low-level details that get smoothed out in deep layers.
The visualization below demonstrates this phenomenon. Notice how the shallow features (left columns) retain sharp edges and textures, while deep features (right columns) become abstract blobs.

To address this, GroundingFace reuses the shallow features from the SAM (Segment Anything Model) visual encoder. They introduce a Shallow-Deep Fusion module that combines these high-frequency details with the semantic understanding of the deeper layers, allowing the mask decoder to draw precise boundaries around tiny facial features.
Component B: The Face Prior Sampler
Processing high-resolution images is computationally expensive. MLLMs usually resize images to a standard square (e.g., 336x336), which destroys facial details.
Instead of processing the whole high-res image, GroundingFace uses a Face Prior Sampler. Since facial landmarks are easy to detect, the model uses them to crop and align the face from the high-resolution input. These “face tokens” are then compressed and injected into the model.
This allows the model to “zoom in” on the face without bearing the computational cost of processing the entire high-resolution background.
3. The Training Recipe: Mixture of Experts (MoE)
A common problem in AI fine-tuning is catastrophic forgetting. If you take a general model that knows about cars, trees, and dogs, and fine-tune it exclusively on faces, it will become a facial expert but “forget” how to segment a car.
GroundingFace employs a clever two-stage training strategy equipped with LoRA (Low-Rank Adaptation) and a Mixture of Experts (MoE) router to solve this.

Stage 1: The model is trained on a mix of general data and the new facial data. Stage 2: The model freezes most parameters and trains a specific “High-Quality (HQ) Adapter” and a specific LoRA module (LoRA3) on high-quality manual annotations.
Crucially, they implement a Router. When the model receives an input, the router decides whether the token requires “High-Quality” expert processing (for fine facial details) or “Low-Quality” processing (for general scene understanding).
The routing logic acts as a traffic controller, ensuring that the specialized facial training doesn’t overwrite the model’s general capabilities. The mathematical formulation for this routing is straightforward:

Here, \(S^l(x)\) determines the probability of a token \(x\) being routed to a specific expert. If it’s a facial detail, it goes to the HQ expert (\(A_{HQ}^l B_{HQ}^l\)); otherwise, it takes the standard path.
4. Experiments and Results
The researchers evaluated GroundingFace across four tasks: Pixel Grounded Face Captioning, Face Referring Segmentation, Grounded VQA, and Zero-shot Face Attribute Recognition.
Qualitative Results
The model’s ability to ground specific attributes is visually impressive. In the examples below, you can see the model accurately identifying complex concepts like “severe crow’s feet” or interpreting emotional states based on facial geometry.

Quantitative Analysis
The quantitative results highlight the gap between general models and GroundingFace. The authors compared their model against the GLaMM baseline.
Face Captioning & Segmentation: In Table 2, we see the ablation study. Row 1 represents the baseline GLaMM. Row 7 represents the full GroundingFace model.
- METEOR (Caption Quality): Improves from 1.1 to 23.1. This is a massive jump, indicating the baseline barely understood the fine-grained facial prompt.
- gIoU (Segmentation Quality): Improvements in the Generalized Intersection over Union score show much tighter, more accurate masks.

Pixel Grounded VQA: Table 3 shows the performance on Visual Question Answering. Again, the jump from GLaMM to the proposed method is substantial, with the Grounded Caption METEOR score rising from 0.9 to 21.9.

Zero-Shot Attribute Recognition: Perhaps most interestingly, the model excels at recognizing attributes (like age, gender, and emotion) on standard datasets (RAF-DB, LFWA) even without being explicitly trained on them. It outperforms larger models like InternVL-v1.5 (26B parameters) despite being a smaller 7B parameter model, simply because its alignment training is so effective.

5. Conclusion
GroundingFace represents a significant step forward in fine-grained visual understanding. By acknowledging that faces require a different set of visual features (shallow, high-frequency) compared to general objects (deep, semantic), the authors successfully adapted a general MLLM into a facial specialist.
The key takeaways for students and practitioners are:
- Data is King: The creation of FacePlayGround-240K with its hierarchical, pixel-aligned annotations was the prerequisite for success.
- Feature Hierarchy Matters: You cannot rely solely on deep features for segmentation tasks involving texture and fine lines.
- MoE for Specialization: Using Mixture of Experts allows models to learn new, specific domains without suffering from catastrophic forgetting of their general pre-training.
This work paves the way for advanced applications in digital makeup, automated dermatology, and more nuanced human-computer interaction where the computer understands not just that you are smiling, but how you are smiling.
](https://deep-paper.org/en/paper/file-2061/images/cover.png)