Less is More: Solving Gaze Estimation with Frozen Foundation Models

Imagine walking into a crowded room. Almost instantly, you can tell who is talking to whom, who is looking at the clock waiting to leave, and who is staring at the delicious cake on the table. This ability—gaze following—is a fundamental building block of human social interaction. It allows us to infer intent, attention, and social dynamics in a split second.

For computers, however, this task is surprisingly difficult. To understand where a person is looking, a model must understand two distinct things: the person’s physical orientation (head pose, eye position) and the semantic context of the scene (where are the objects? how far away are they?).

Historically, solving this problem involved building complex, heavy deep learning architectures that acted like “Frankenstein” models—stitching together separate neural networks for head analysis, scene analysis, depth estimation, and pose estimation.

In this post, we are doing a deep dive into Gaze-LLE, a paper that radically simplifies this process. Instead of training massive, complex pipelines from scratch, the authors ask a simple question: Can we just ask a general-purpose foundation model where someone is looking?

The answer is yes, but as we will see, it requires a clever architectural trick called “Head Prompting.”

The Problem with Prior Approaches

To appreciate the elegance of Gaze-LLE, we first need to look at how gaze target estimation has traditionally been solved. The standard approach is known as the Multi-Branch Fusion architecture.

In this paradigm, the model is split into two (or more) parallel streams:

Head Branch: Takes a high-resolution crop of the person’s head to analyze gaze direction.
Scene Branch: Takes the full image to understand the environment.
Auxiliary Branches (Optional): Many recent state-of-the-art (SotA) models also add separate branches for depth maps or 3D pose estimation.

These features are then mashed together in a “Fusion Module” to predict a heatmap of where the person is looking.

This diagram compares two approaches for gaze estimation. Prior approaches use multiple encoders (head, scene, depth, pose) fused together. Gaze-LLE uses a single frozen scene encoder with a head prompt.

As shown in the top half of Figure 1 above, this approach gets messy. It requires training multiple encoders simultaneously. It requires complex loss functions to balance the different branches. And importantly, it results in models with massive parameter counts (often 50M to 100M+ parameters) that are slow to train.

The Gaze-LLE Solution

The researchers propose a method called Gaze-LLE (Gaze estimation via Large-scale Learned Encoders). The core philosophy is streamlining.

Rather than training a scene encoder from scratch on a small gaze dataset, why not use a Foundation Model? Models like DINOv2 (a large Vision Transformer) have been trained on millions of images to understand visual semantics. They already “know” what a scene looks like.

Gaze-LLE freezes this massive backbone and builds a lightweight decoder on top of it. The result is a model that is:

Lightweight: It has ~2.8M learnable parameters (vs. 50M+ in prior work).
Fast: It trains in less than 1.5 GPU hours.
Accurate: It achieves state-of-the-art performance.

Let’s break down the architecture step-by-step.

Architecture diagram of Gaze-LLE showing the frozen backbone, head position embedding, and transformer decoder blocks.

1. The Frozen Backbone

The process starts with an input image. This image is fed into a frozen DINOv2 encoder. The authors chose DINOv2 because it is excellent at dense prediction tasks (like segmentation or depth estimation) right out of the box.

This encoder outputs a set of feature tokens (think of them as a compressed, smart representation of the image). Crucially, the weights of this backbone are never updated. This saves massive amounts of compute and prevents the model from “forgetting” the rich general visual knowledge it learned during pre-training.

2. The Innovation: Head Prompting

Here lies the main challenge. If you feed an image into DINOv2, it gives you a representation of the whole scene. But gaze estimation is person-specific. If there are five people in a room, the model needs to know which person we are interested in.

Prior works solved this by cropping the head and running a separate neural network (the “Head Branch”). Gaze-LLE eliminates the head branch entirely.

Instead, they introduce Head Prompting. They take the bounding box of the target person’s head and create a binary mask \(M\). They then learn a specific vector, called a position embedding (\(p_{\text{head}}\)). They inject this information directly into the scene features.

Mathematically, if \(x_{\mathcal{F}}\) is the feature map from the backbone, the new representation \(S\) is:

\[ S = x _ { \mathcal { F } } + \left( M * p _ { \mathrm { h e a d } } \right) \]

This is a subtle but powerful shift. Instead of processing the head separately, the model effectively “highlights” the region of the feature map corresponding to the person and asks the Transformer: “Focus on the person at this location.”

3. The Transformer Decoder

Once the features are prompted with the head location, they are passed to a lightweight module consisting of 3 Transformer layers.

Why Transformers? Gaze is often a long-range dependency problem. A person on the far left of the image might be looking at an object on the far right. Convolutional Neural Networks (CNNs), which look at local neighborhoods of pixels, often struggle to connect these distant points. Transformers, with their self-attention mechanism, are designed exactly for this—allowing the model to relate the “head” tokens to the “target” tokens anywhere in the image.

For tasks that require knowing if the person is looking at something outside the camera frame, the authors append a special learnable token:

\[ [ \underbrace { t _ { \mathrm { i n / o u t } } } _ { \mathrm { t a s k ~ t o k e n } } , \underbrace { s _ { 1 } , s _ { 2 } , . . . , s _ { H \times W } } _ { \mathrm { s c e n e ~ t o k e n s } } ] \]

This task token \(t_{\text{in/out}}\) gathers information from the whole scene and is eventually used to classify whether the gaze target is visible or off-screen.

4. The Loss Function

Because the architecture is so streamlined, the training objective is remarkably simple. While prior works use complex multi-task losses to supervise depth, pose, and saliency branches, Gaze-LLE uses a straightforward combination of heatmap loss (where is the point?) and binary classification (is it in-frame?):

\[ \mathcal { L } = \mathcal { L } _ { \mathrm { h m } } + \lambda \mathcal { L } _ { \mathrm { i n / o u t } } \]

Why Naive Swapping Didn’t Work

You might be wondering: “If DINOv2 is so good, couldn’t we just take an old model (like the Chong et al. model from 2020) and swap out its ResNet encoder for DINOv2?”

The authors actually tried this, and the results were fascinating.

Table showing that simply replacing encoders in existing methods with DINOv2 results in worse performance.

As shown in Table 1 above, simply plugging DINOv2 into existing architectures (like Chong et al., Miao et al., or Gupta et al.) actually made performance worse, even if the backbone was fine-tuned.

This reveals a critical insight: Architecture matters. Old architectures relied on Convolutional decoders and specific fusion techniques designed for ResNet features. DINOv2 features are different—they are patch-based and semantically dense. They require a decoder that speaks their language (i.e., Transformers) and a prompting mechanism that integrates seamlessly.

Design Analysis: The Recipe for Success

The authors performed a rigorous ablation study to prove that their simplified design was actually the optimal one. They tested three axes of design:

Head Integration: Should we merge head info early (in the encoder) or late (in the decoder)?
Decoder Type: Should we use Convolutions (CNN) or Transformers?
Branches: Do we really need a separate Head Branch?

Table comparing design choices. Row F (Late integration, Transformer decoder, Scene-only branch) yields the best results.

Table 2 (above) tells the story of the paper in numbers:

Row (a): This mimics prior work (Early integration, Conv decoder, Separate head branch). Performance is decent (AUC 0.854).
Row (c): Switching to “Late” integration helps significantly (AUC 0.932).
Row (d) vs (f): This is the clincher. Row (d) uses a separate Head Branch. Row (f) does not. The performance is almost identical (AUC 0.954 vs 0.953).

Conclusion: If you use a Transformer decoder and a strong backbone, you do not need a separate head branch. The backbone already captures enough detail about the head orientation to solve the task, provided you tell the model where the head is.

Experimental Results

So, how does Gaze-LLE stack up against the heavyweights of the field? The results are undeniable.

Main results table comparing Gaze-LLE to prior state-of-the-art. Gaze-LLE achieves higher AUC and lower L2 error with significantly fewer parameters.

In Table 3, look at the Learnable Params column. Most competitive models have between 30M and 100M parameters. Gaze-LLE (ViT-B) has just 2.8M.

Despite being an order of magnitude smaller, it achieves higher AUC (Area Under Curve) and lower distance error (L2) on both the GazeFollow and VideoAttentionTarget benchmarks. It beats complex models that use auxiliary depth and pose estimators.

Training Efficiency

One of the biggest practical benefits of this reduction in complexity is training speed. Because the heavy lifting is done by the pre-computed, frozen backbone, the gradient updates only happen in the small decoder.

Graph plotting AUC against GPU hours. Gaze-LLE converges to high accuracy almost immediately compared to other methods.

Figure 4 illustrates this dramatic difference. While other methods take 6+ GPU hours to converge to a lower accuracy, Gaze-LLE shoots up to state-of-the-art performance in under 1.5 hours.

Qualitative Examples

The model isn’t just good at metrics; it passes the eye test. It generalizes remarkably well to different domains, such as the ChildPlay dataset (interacting children) and GOO-Real (retail environments), even without fine-tuning.

Visual examples of gaze prediction across different datasets showing heatmaps originating from people’s heads.

In the visualization above, you can see the model correctly identifying the gaze target (the heatmap) based on the input person (the bounding box). It handles complex scenes—like a crowded room or a grocery aisle—where simply guessing “center bias” wouldn’t work.

Conclusion

Gaze-LLE represents a shift in how we approach computer vision tasks in the era of Foundation Models. For years, the trend was to add more: more branches, more modalities (depth, pose), and more complex fusion modules.

This paper proves that less is often more. By leveraging the massive knowledge embedded in frozen models like DINOv2 and designing a decoder that properly utilizes that knowledge (via head prompting and transformers), we can outperform complex, purpose-built architectures.

The implications for students and researchers are clear:

Don’t reinvent the wheel: Use pre-trained foundation models.
Focus on the interface: The magic often lies in how you query the model (prompting) rather than the model structure itself.
Simplicity scales: A smaller, simpler model is easier to train, debug, and deploy.

Gaze-LLE effectively turns gaze estimation from a complex multi-modal fusion problem into a streamlined “prompt-and-decode” task, setting a new standard for efficient visual understanding.

The Problem with Prior Approaches#

The Gaze-LLE Solution#

1. The Frozen Backbone#

2. The Innovation: Head Prompting#

3. The Transformer Decoder#

4. The Loss Function#

Why Naive Swapping Didn’t Work#

Design Analysis: The Recipe for Success#

Experimental Results#

Training Efficiency#

Qualitative Examples#

Conclusion#