Beyond Static Scenes: How DIV-FF Unlocks Dynamic Egocentric Understanding

If you have ever strapped a GoPro to your head while cooking or working, you know the resulting footage is chaotic. The camera shakes, your hands obscure the view, objects move, and the environment changes state (an onion becomes chopped onions). For a computer vision system, making sense of this “egocentric” (first-person) view is a nightmare.

Traditional 3D reconstruction methods, like Neural Radiance Fields (NeRFs), usually assume the world is a statue—rigid and unchanging. On the other hand, video understanding models might grasp the action “cutting,” but they have no concept of the 3D space where it’s happening.

In this post, we are diving deep into DIV-FF (Dynamic Image-Video Feature Fields), a novel framework presented by researchers from the University of Zaragoza. This paper proposes a way to bridge the gap between 3D geometry and semantic video understanding. It is a system that doesn’t just see a “kitchen”; it understands where the “cutting” happens, tracks moving objects, and separates the actor from the environment.

The DIV-FF framework processing an egocentric video to produce semantic synthesis and affordance maps.

The Core Problem: The Static Assumption vs. Reality

To understand why DIV-FF is necessary, we first need to look at the limitations of current technology.

Neural Radiance Fields (NeRFs) have revolutionized 3D computer vision. By training a neural network to predict the color and density of points in space, NeRFs can generate photorealistic views of a scene from new angles. Recently, methods like LERF (Language Embedded Radiance Fields) have added a semantic layer to this. They embed language features (from models like CLIP) into the 3D field, allowing you to query the 3D scene with text (e.g., “Find the toaster”).

However, LERF and similar methods rely on a rigid scene assumption. They assume that between frame 1 and frame 100, the only thing moving is the camera.

In an egocentric video, this assumption breaks immediately.

The Actor: The camera wearer’s hands and body are constantly entering and leaving the frame.
Dynamic Objects: Tools are picked up, food is moved, and containers are opened.
Interaction: The semantics aren’t just about nouns (objects); they are about verbs (affordances). A cutting board is defined by its ability to support “cutting.”

If you run a standard NeRF on a cooking video, the moving hands and objects create “ghosting” artifacts, and the semantic understanding falls apart because the model cannot distinguish between the permanent counter and the temporary vegetables.

The Solution: DIV-FF

The researchers propose DIV-FF, a framework that decomposes the scene into three distinct components: the Persistent Environment, the Dynamic Environment, and the Actor. Furthermore, it integrates two different types of “language” to understand the world: Image-Language (for detailed object recognition) and Video-Language (for action and affordance understanding).

Let’s break down the architecture.

Overview of the DIV-FF three-stream architecture, showing how geometry, image features, and video features are processed.

1. The Triple-Stream Geometry

As shown in Figure 2 above, DIV-FF doesn’t use a single neural network to represent the space. Instead, it uses three parallel streams, each handling a specific part of reality:

Persistent Environment Network (Static): This stream models the background—the walls, the table, the fridge. It takes the viewing position and direction as input and outputs color and density (\(c^p, \sigma^p\)).
Dynamic Environment Network (Objects): This models objects that move independently, like a bowl being stirred or a knife being lifted. Crucially, this network takes a frame-specific code (\(z_t^d\)) as input. This code acts like a timestamp, telling the network, “This is the state of the object at time \(t\).” It predicts density and color with uncertainty (\(\beta\)), allowing the model to be less confident about blurry moving parts.
Actor Network (Hands/Body): This models the camera wearer. Since the actor moves continuously with the camera, this network uses a different frame-specific code (\(z_t^a\)) to capture the complex, non-rigid deformation of hands and arms.

By explicitly separating these three streams, the system can reconstruct a clean background even if hands are constantly waving in front of it.

2. The Image-Language Stream (CLIP + SAM)

The second innovation lies in how DIV-FF understands what it is looking at. Previous methods like LERF used CLIP (Contrastive Language-Image Pre-training) features extracted from image patches. The problem with patches is that they are coarse. A patch might contain half a hand and half an apple, confusing the semantic embedding.

DIV-FF improves this by integrating the Segment Anything Model (SAM).

SAM generates precise masks for objects in the training images.
The model extracts a CLIP descriptor for the masked region and the bounding box.
It assigns a weighted average of these descriptors to all pixels within the mask.

This results in pixel-aligned features. Instead of a fuzzy cloud of “apple-ness,” the model learns that the pixels strictly inside the apple mask correspond to the word “apple.” This allows for much sharper semantic queries in 3D.

3. The Video-Language Stream (EgoVideo)

This is perhaps the most critical contribution for robotics and interaction understanding. A static image can tell you “this is a knife.” But a video can tell you “this knife is being used to cut.”

To capture these “affordances” (potential for action), DIV-FF distills features from EgoVideo, a video-language pre-trained model. Unlike CLIP, which looks at single images, EgoVideo looks at short clips and understands temporal dynamics.

However, extracting local features from video transformers is tricky. The researchers employ a clever loss function to supervise this training:

Equation describing the video-language loss function.

As seen in the equation above:

\(\psi^{GT}(V_p)\): Represents patch-level video features (medium-sized regions).
\(\psi^{GT}(V)\): Represents a global video embedding (the meaning of the whole clip).
\(\mathcal{M}_{IH}\): This is the Interaction Hotspot.

The model forces the learned features (\(\hat{\psi}\)) to match the local patch features everywhere. But for the specific area where the hand interacts with the object (the interaction hotspot), it also forces the features to match the global video context. This teaches the model that the specific zone where the knife touches the onion is the “cutting” zone.

Experiments and Results

The researchers tested DIV-FF on the EPIC-Kitchens dataset, specifically on sequences with heavy object manipulation. They compared it against LERF and other object detection baselines.

Dynamic Object Segmentation

The first test was simple: Can the model find and segment moving objects in novel views?

Table comparing Dynamic Object Segmentation performance. DIV-FF outperforms LERF significantly.

The quantitative results are striking. As shown in Table 1, the full DIV-FF model achieves a 30.5 mIoU (mean Intersection over Union), which is a 40.5% improvement over the best baseline.

Why such a big jump?

LERF fails because it assumes the scene is static. It tries to average the moving object into the background, resulting in a ghostly blur.
CLIP (Patches) helps, but the boundaries are fuzzy.
CLIP (SAM) sharpens the boundaries significantly.

We can see this visual improvement clearly in the ablation study below. Notice how the heatmaps (red/yellow areas) become progressively tighter around the objects (like the colander and the cutting board) as we move from LERF to the full DIV-FF model.

Visual comparison of object segmentation. LERF shows blurry heatmaps, while DIV-FF with SAM shows precise object localization.

Querying the 3D World

Because the model learns a feature field, we can query it with text. If we type “countertop” or “banana,” the model highlights the relevant 3D regions.

Novel-view renderings showing heatmaps for specific text queries like ‘countertop’ and ‘banana’.

In Figure 4, we see that the model handles scale variations well. It can segment a large surface like a countertop just as well as a small object like a banana. This is due to the robust masking strategy during training.

Temporal Consistency

One of the hardest things in video segmentation is consistency. Often, a model will detect an object in frame 10, lose it in frame 11, and find it again in frame 12.

Sequence of frames showing consistent tracking of a spatula and cutting board over time.

DIV-FF maintains consistency remarkably well. In Figure 5, watch the heatmaps under the “spatula” and “blue cutting board.” Even as the actor moves and the perspective shifts, the model “remembers” the object’s identity. This is because the frame-specific codes (\(z_t\)) provide a continuous trajectory for the semantic features in the latent space.

Understanding What’s Not There (Surrounding Understanding)

Because DIV-FF builds a 3D representation, it knows about the environment outside the current camera frame.

Demonstration of surrounding understanding, identifying objects at the very edge of the frame.

Figure 6 illustrates this capability. The model can segment the “pot” and the “sink” even when they are barely visible at the edge of the image. This “amodal” capability is crucial for robots that need to plan movements without constantly turning their cameras to verify object locations.

Affordance Segmentation: The Power of Video Features

This is where the Video-Language stream shines. The researchers queried the model with action-based phrases like “cut the onion” or “wash a kitchen utensil.”

Comparison of relevancy maps for affordance segmentation between image-language and video-language fields.

In Figure 7 (and Table 2), we see a comparison between using Image-Language features versus Video-Language features.

Image Features (Top rows): They struggle with verbs. They might find the “utensil,” but they don’t understand the concept of “washing” it.
Video Features (Bottom rows): DIV-FF accurately highlights the interaction zones.

The table confirms this with a 69.7% improvement in affordance segmentation when using the video-inference stream. The model understands that “toasting bread” isn’t just about the bread; it’s about the toaster and the bread interacting.

The importance of the global supervision (mentioned in the core method) is highlighted in the ablation below. Without adding the global video context to the interaction hotspot, the model produces diffuse, blurry maps (middle column). With global supervision, it pinpoints the action (right column).

Ablation showing how global supervision sharpens the attention on interaction hotspots.

Decomposing the Scene

Finally, because the architecture is split into three streams, DIV-FF allows for “Amodal Scene Understanding.” This means you can virtually turn off layers of reality.

Visualization of scene decomposition: removing the actor and dynamic objects to reveal the persistent background.

In Figure 9, the researchers visualize the Principal Component Analysis (PCA) of the features.

PCA (all): Shows the full scene with hands.
PCA (w/o actor): The hands disappear, revealing the objects behind them.
PCA (w/o actor & dynamic): Everything temporary is gone, leaving a clean map of the static kitchen.

This capability is incredibly useful for long-term mapping. A robot could enter a messy room, filter out the mess (dynamic objects), and navigate based on the permanent furniture.

Conclusion

DIV-FF represents a significant step forward in how machines understand egocentric video. By moving away from the “static scene” assumption and embracing the chaos of real-world interaction, it builds a representation that is both geometrically accurate and semantically rich.

The key takeaways are:

Decomposition is key: Separating the world into static, dynamic, and actor streams allows for cleaner reconstruction and understanding.
Pixels over patches: Using SAM masks to align language features to pixels drastically improves object boundaries.
Video implies Action: You cannot understand “doing” with just images. Incorporating video-language features unlocks the ability to map “affordances”—not just what things are, but what they allow us to do.

For students and researchers in robotics and AR, this paper highlights that the future of environment understanding lies in hybrid models—those that blend explicit geometry (NeRFs/3D) with the high-level reasoning of large language and video models.

The Core Problem: The Static Assumption vs. Reality#

The Solution: DIV-FF#

1. The Triple-Stream Geometry#

2. The Image-Language Stream (CLIP + SAM)#

3. The Video-Language Stream (EgoVideo)#

Experiments and Results#

Dynamic Object Segmentation#

Querying the 3D World#

Temporal Consistency#

Understanding What’s Not There (Surrounding Understanding)#

Affordance Segmentation: The Power of Video Features#

Decomposing the Scene#

Conclusion#