Introduction

In the world of computer vision, estimating depth from a single image—determining how far away every pixel is—has seen revolutionary progress. Models like Depth Anything V2 can look at a flat photograph and intuitively understand the 3D geometry of the scene with remarkable accuracy. However, a massive gap remains between understanding a static image and understanding a video.

If you simply run a standard image depth model on a video, frame by frame, you encounter a phenomenon known as “flickering.” Because the model processes each frame in isolation, slight changes in lighting or camera angle cause the predicted depth to jump erratically. The result is a jittery, inconsistent mess that is unusable for robotics, augmented reality, or video editing.

Existing solutions to this problem usually fall into two camps: they are either computationally expensive diffusion models that take nearly a second to process a single frame, or they rely on “optical flow”—tracking pixel movement—which is prone to errors when objects move quickly or occlude one another. Furthermore, most of these methods struggle with long videos, accumulating errors that cause the depth scale to drift over time.

Enter Video Depth Anything (VDA).

Figure 1. Left: Our model can generate consistent depth predictions for long videos with rich actions. The demo video shows a 196-second (4690 frames) long take of pair skating. Right: Comparison to baselines in terms of accuracy, consistency, and latency on the Nvidia A100 GPU.

As illustrated in Figure 1, the researchers behind Video Depth Anything have proposed a solution that hits the “sweet spot” of the trade-off triangle: it achieves high accuracy, high temporal consistency, and low latency. Perhaps most impressively, it introduces a strategy to handle “super-long” videos (over several minutes) without losing track of the scene’s geometry.

In this deep dive, we will explore how this new architecture works, the clever mathematical loss function that stabilizes predictions without relying on optical flow, and the inference strategy that allows it to scale to arbitrarily long sequences.

Background: The Challenge of Consistency

To understand why Video Depth Anything is significant, we first need to look at the foundations it builds upon.

The Foundation: Depth Anything V2

The core of this new model is Depth Anything V2, a powerful “foundation model” for Monocular Depth Estimation (MDE). Foundation models are trained on massive datasets, giving them a robust understanding of general scenes. They handle complex lighting, transparent surfaces, and intricate geometries far better than older models trained on small, specific datasets (like just indoor rooms or just driving scenes).

However, Depth Anything V2 is a spatial expert, not a temporal one. It doesn’t know that Frame 2 follows Frame 1.

The Problem with Previous Video Approaches

Researchers have tried to force temporal consistency onto MDE models in several ways:

Test-Time Optimization: This involves fine-tuning the model on the specific video you want to process. While accurate, it is excruciatingly slow and impractical for real-time applications.
Post-Processing with Optical Flow: Some methods try to smooth out the jitter by calculating “optical flow” (how pixels move between frames) and warping the depth map to match. However, if the optical flow calculation fails (which happens often in complex scenes), the depth map breaks.
Video Diffusion Models: Models like DepthCrafter use generative AI to “dream” up the depth. These provide great detail but are extremely heavy computationally. As we saw in Figure 1, some of these models take nearly a second (910ms) to process a single frame, whereas Video Depth Anything does it in milliseconds.

The researchers identified a need for a feedforward model—one that runs directly without iterative steps—that inherits the generalization of foundation models but adds temporal stability natively.

Core Method: Architecture and Design

The philosophy behind Video Depth Anything is to keep the powerful visual understanding of the image model but replace the part of the network responsible for making the final prediction with something that understands time.

The Architecture: Encoder and Spatiotemporal Head

The model architecture is elegant in its reuse of existing technology. It consists of two main parts: the Encoder and the Spatiotemporal Head.

Figure 2. Overall pipeline and the spatio-temporal head. Left: Our model is composed of a backbone encoder from Depth Anything V2 and a newly proposed spatio-temporal head. Right: Our spatiotemporal head inserts several temporal layers into the DPT head.

1. The Encoder (Frozen)

The researchers use the pre-trained encoder from Depth Anything V2. This component extracts rich feature maps from input images. Crucially, they freeze this encoder during training. This decision serves two purposes:

Efficiency: It reduces the computational cost of training.
Preservation: It ensures the model retains the robust generalization capabilities learned from millions of images, rather than “forgetting” them to overfit on a smaller video dataset.

However, to process video, the input is reshaped. Instead of a single image (\(B \times C \times H \times W\)), the input is a batch of video clips (\(B \times N \times C \times H \times W\)), where \(N\) is the number of frames (e.g., 32 frames).

2. The Spatiotemporal Head (STH)

This is where the innovation happens. The standard “head” of an image model (usually a DPT head) only looks at spatial features. The researchers replaced this with a Spatiotemporal Head (STH).

As shown on the right side of Figure 2, the STH takes features from different stages of the encoder (\(F_1\) through \(F_4\)). It processes them through “Reassemble” and “Fusion” blocks, similar to standard depth networks. The key difference is the insertion of Temporal Layers.

The Temporal Layer

The temporal layer is a mechanism that allows the model to compare features across the \(N\) frames of the video clip.

Figure 10. Temporal layer. The feature shape is adjusted for temporal attention.

Figure 10 details this operation. The system takes the feature map and reshapes it to isolate the temporal dimension (\(N\)). It then applies Multi-Head Self-Attention solely along the temporal axis.

This means that for a specific pixel location (say, the top-left corner), the model looks at how the features at that location change across all 32 frames. This allows the network to smooth out inconsistencies and understand the motion dynamics before predicting the final depth map.

The Loss Function: Temporal Gradient Matching

Designing the architecture is only half the battle. You also need to tell the model what “success” looks like mathematically. This is defined by the loss function.

Why “Optical Flow Warping” is Flawed

Traditional video depth methods often use an Optical Flow Warping (OPW) loss. The logic is: “If I know pixel A in Frame 1 moves to position B in Frame 2, the depth at A (in Frame 1) should be very similar to the depth at B (in Frame 2).”

\[ \mathcal { L } _ { \mathrm { O P W } } = \frac { 1 } { N - 1 } \sum _ { i = 2 } ^ { N } \parallel p _ { i } - \hat { p _ { i } } \parallel _ { 1 } , \]

Equation 1: OPW Loss

The problem with this assumption (Equation 1 above) is that depth isn’t invariant. If a car is driving towards the camera, the depth of the car decreases between frames. Forcing the depth to stay the same actually confuses the model during training. Furthermore, calculating optical flow adds extra computational overhead and potential errors.

The Solution: Conserving the Gradient

Instead of forcing depth values to match, the researchers propose Temporal Gradient Matching (TGM).

The insight is subtle but powerful: The rate at which depth changes in the prediction should match the rate at which depth changes in the ground truth.

If the ground truth says an object got 1 meter closer, the prediction should also get 1 meter closer. This allows for dynamic scenes (moving cars, walking people) where depth naturally changes.

\[ \mathcal { L } _ { \mathrm { T G M } } = \frac { 1 } { N - 1 } \sum _ { i = 1 } ^ { N - 1 } \Vert | d _ { i + 1 } - d _ { i } | - | g _ { i + 1 } - g _ { i } | \Vert _ { 1 } . \]

Equation 5: TGM Loss

In Equation 5 (above):

\(d_{i+1} - d_i\) represents the change in predicted depth between frames.
\(g_{i+1} - g_i\) represents the change in ground truth depth between frames.
The loss minimizes the difference between these two changes.

This method removes the need for optical flow entirely. It purely looks at the temporal gradient of the depth values.

The final loss function combines this new TGM loss with a standard spatial loss (Scale-and-Shift Invariant loss, or SSI) to ensure each individual frame still looks correct:

\[ \mathcal { L } _ { \mathrm { a l l } } = \alpha \mathcal { L } _ { \mathrm { T G M } } + \beta \mathcal { L } _ { \mathrm { s s i } } , \]

Equation 6: Total Loss

Inference Strategy for Super-Long Videos

Training on video clips is one thing, but running inference on a 5-minute video is another. GPU memory is limited; you can’t just feed 10,000 frames into the model at once. You have to process the video in chunks (windows).

The danger of processing chunks is that the model might interpret the scale of Chunk 1 differently from Chunk 2. This leads to “scale drift,” where an object might appear 5 meters away in one second and 10 meters away the next, simply because the window shifted.

To solve this, the authors designed a sophisticated Key-Frame Referencing strategy.

Figure 3. Inference strategy for long videos.

As shown in Figure 3, constructing the input for the next inference window involves three components:

Future Frames: The new content we want to estimate (\(N - T_o - T_k\)).
Overlapping Frames (\(T_o\)): Frames from the immediate end of the previous window. These provide immediate continuity.
Key Frames (\(T_k\)): Frames sampled from much earlier in the video (using interval \(\Delta_k\)).

Why Key Frames? By including frames from the distant past in the current window’s input, the model is “reminded” of the global scale it established earlier. This anchors the prediction and prevents the scale from drifting over time.

Finally, the overlapping regions are stitched together using a linear interpolation (blending) to ensure there are no visible seams between windows.

Experiments and Results

The researchers evaluated Video Depth Anything (VDA) against top competitors, including diffusion-based models like DepthCrafter and DepthAnyVideo.

Zero-Shot Performance

The primary test is “zero-shot,” meaning the model is tested on datasets it never saw during training. This tests true generalization.

Table 1. Zero-shot video depth estimation results.

Table 1 highlights the results.

Accuracy (\(\delta_1\)): VDA (specifically the Large model, VDA-L) achieves the highest scores across almost all datasets (KITTI, Scannet, Bonn, NYUv2).
Consistency (TAE): The Temporal Alignment Error is significantly lower for VDA (0.570) compared to competitors like NVDS (2.176) or DepthCrafter (0.639). Lower TAE means less flickering.

Long Video Stability

One of the paper’s boldest claims is the ability to handle super-long videos. To test this, they evaluated performance on video lengths up to 500 frames.

Figure 4. Video depth estimation accuracy for different frame length.

Figure 4 plots accuracy against video length.

DepthCrafter (Blue line): As the video gets longer, accuracy drops significantly. This indicates scale drift or accumulated errors.
VDA-L (Red line): The line is nearly flat. The performance remains stable regardless of whether the video is 100 frames or 500 frames long. This validates the success of the key-frame inference strategy.

Visual Quality

Numbers are useful, but depth estimation is a visual task. Let’s look at the qualitative comparisons.

Long Video Scenarios

Figure 5. Qualitative comparison for real-world long video depth estimation.

In Figure 5, look at the timeline slices (the vertical strips).

DAv2: Shows jagged stripes, indicating flickering over time.
DepthCrafter (DC): Shows “drift.” The depth colors shift gradually even if the object distance doesn’t change.
Ours: The timeline is smooth and consistent, closely matching the Ground Truth (GT).

Short Video Scenarios

Figure 6. Qualitative comparison for in-the-wild short video depth estimation.

In Figure 6, we see difficult “in-the-wild” scenarios.

Row 2 (Rally Car): Notice the smoke. DepthCrafter gets confused by the smoke, creating artifacts (Red box). VDA handles the semi-transparent volume much better.
Row 5 (Bear): The complex texture of the bear’s fur and the rocks causes other models to hallucinate inconsistent geometries. VDA maintains the shape of the bear distinctly from the background.

Computational Efficiency

Perhaps the most practical advantage of VDA is speed.

Table 3. Inference latency comparisons for video depth estimation.

Table 3 reveals a stark difference in latency (processing time per frame):

DepthCrafter: 910ms (nearly 1 second per frame).
VDA-L (Large): 67ms.
VDA-S (Small): 9.1ms.

The small version of VDA is capable of running at over 100 FPS on an A100 GPU, making it the only viable option among these high-quality models for real-time applications like robotics or autonomous driving.

Application: 3D Point Clouds

Consistent depth allows for the creation of 3D point clouds. If the depth flickers, the 3D reconstruction will look like a noisy cloud of dust. If it drifts, the geometry will stretch and warp.

Figure 12. Dense point cloud generation.

Figure 12 shows a point cloud generated from 5 seconds of driving footage.

DepthCrafter: Creates “layers” or visible slices in the road, a result of inconsistent depth steps.
VDA: Generates a smooth, continuous road surface and distinct vertical structures for the trees and signs.

Conclusion

Video Depth Anything represents a significant maturity in the field of video depth estimation. By rejecting the complexity of optical flow and the high computational cost of diffusion models, the authors have returned to a cleaner, more efficient transformer-based approach.

The key takeaways are:

Temporal Attention is Efficient: You don’t need to retrain a massive encoder. A lightweight spatiotemporal head is enough to teach an image model about time.
Gradients > Values: Matching the change in depth (Temporal Gradient Matching) is a more robust training signal than warping pixels based on optical flow.
Context is King: For long videos, looking back at distant key frames prevents the model from “forgetting” the scale of the world.

With the release of the VDA-Small model running at 9ms, we are likely to see this technology integrated quickly into downstream applications, giving robots and software a consistent, reliable pair of eyes to understand the moving world.

Introduction#

Background: The Challenge of Consistency#

The Foundation: Depth Anything V2#

The Problem with Previous Video Approaches#

Core Method: Architecture and Design#

The Architecture: Encoder and Spatiotemporal Head#

1. The Encoder (Frozen)#

2. The Spatiotemporal Head (STH)#

The Temporal Layer#

The Loss Function: Temporal Gradient Matching#

Why “Optical Flow Warping” is Flawed#

The Solution: Conserving the Gradient#

Inference Strategy for Super-Long Videos#

Experiments and Results#

Zero-Shot Performance#

Long Video Stability#

Visual Quality#

Long Video Scenarios#

Short Video Scenarios#

Computational Efficiency#

Application: 3D Point Clouds#

Conclusion#