Imagine trying to reconstruct a 3D world from a standard 2D video. For humans, this is intuitive; we understand that as a car moves forward, it gets closer, or that a tree passing by is distinct from the mountain in the background. For computers, however, this task—known as monocular video depth estimation—is notoriously difficult.

While AI has made massive strides in estimating depth from single images (thanks to models like Depth Anything), applying these frame-by-frame methods to video results in a jarring “flickering” effect. The depth jumps around wildly because the model doesn’t understand that Frame 1 and Frame 2 are part of the same continuous reality. Existing solutions often rely on complex camera pose estimation or optical flow, which frequently break down in “open-world” videos where camera movements are unpredictable or dynamic objects (like running animals) dominate the scene.

Enter DepthCrafter.

In a recent paper, researchers from Tencent AI Lab and HKUST introduced a novel approach that repurposes powerful video diffusion models to solve this consistency problem. The result? Smooth, highly detailed, and temporally consistent depth maps for long videos without needing any camera tracking data.

Figure 1: Comparison of DepthCrafter against existing methods. Note the smooth temporal consistency in the bottom row compared to the noisy output of Depth-Anything-V2.

In this post, we will tear down the architecture of DepthCrafter, exploring how it leverages the generative power of Stable Video Diffusion (SVD), its unique three-stage training process, and a clever inference strategy that allows it to handle videos of virtually any length.


1. The Challenge of the Open World

To understand why DepthCrafter is significant, we first need to define the problem.

Monocular Depth Estimation is the task of predicting the distance of every pixel in an image relative to the camera. When scaling this to video, two main challenges arise:

  1. Temporal Consistency: If you run a single-image depth estimator on a video, the estimated depth scale often shifts randomly between frames. A car might look 10 meters away in one frame and 15 meters away in the next, purely due to calculation noise. This causes “flickering.”
  2. Open-World Diversity: Traditional video depth methods often solve flickering by calculating the camera’s movement (pose). However, in “open-world” scenarios—like a handheld video of a pet running—calculating accurate camera pose is extremely difficult due to shaking, moving objects, and complex lighting.

The researchers hypothesized that generative video models, which have learned to create realistic moving scenes, implicitly understand 3D geometry and temporal continuity. DepthCrafter is designed to extract that understanding.


2. Background: Diffusion Models and Latent Spaces

DepthCrafter is built upon Stable Video Diffusion (SVD). To grasp how it works, we need a quick primer on the underlying concepts.

The Diffusion Process

Diffusion models are generative models that learn to create data by reversing a noise-adding process. During training, Gaussian noise is gradually added to an image or video until it is unrecognizable. The model’s job is to learn how to remove that noise (denoise) to recover the original signal.

Mathematically, the forward process adds noise \(\epsilon\) to data \(\mathbf{x}_0\) to create a noisy version \(\mathbf{x}_t\) at time step \(t\):

Equation 1: The forward diffusion process adding Gaussian noise.

The model trains a denoiser \(D_\theta\) to predict the clean data from the noisy input. The training objective is to minimize the difference between the model’s prediction and the actual clean data, weighted by the noise level \(\sigma_t\):

Equation 2: The denoising score matching loss function.

Latent Diffusion and VAEs

Working directly with high-resolution video pixels is computationally expensive. Therefore, DepthCrafter operates as a Latent Diffusion Model (LDM). This means it uses a Variational Autoencoder (VAE) to compress the video into a lower-dimensional “latent space.”

Equation 4: The VAE encoder and decoder transformation.

Here, \(\mathcal{E}\) is the encoder that turns the video \(\mathbf{v}\) (or depth \(\mathbf{d}\)) into a latent representation \(\mathbf{z}\). The diffusion process happens in this compressed \(\mathbf{z}\) space, and the decoder \(\mathcal{D}\) expands the result back into pixels. This makes the generation of long, high-resolution depth sequences feasible on modern GPUs.


3. The Core Method: From Video Generation to Depth Estimation

The core idea of DepthCrafter is to treat depth estimation as a conditional generation problem: \(p(\mathbf{d} | \mathbf{v})\). In simple terms, the model generates a “video” of depth maps \(\mathbf{d}\) conditioned on the input color video \(\mathbf{v}\).

Architecture Overview

The architecture retains the U-Net structure of Stable Video Diffusion but modifies how it receives information.

Figure 2: Overview of the DepthCrafter architecture and the three-stage training strategy.

As shown in Figure 2, the process involves two main pathways:

  1. Visual Embedding: The input video \(\mathbf{v}\) is encoded into a latent representation \(\mathbf{z}^{(v)}\).
  2. Noisy Input: The process starts with Gaussian noise (or a noisy depth map during training) \(\mathbf{z}^{(d)}_t\).

Crucially, the researchers adapted the conditioning mechanism. In the original SVD, the model is usually conditioned on a single image to generate a video. For DepthCrafter, the model must condition on the entire sequence of input video frames to produce a corresponding sequence of depth frames. This is achieved by:

  • Concatenation: The latent representation of the input video is concatenated frame-by-frame with the noisy depth latent.
  • Cross-Attention: CLIP embeddings of the video frames are injected into the U-Net via cross-attention layers, providing high-level semantic understanding (e.g., “this is a dog,” “this is a wall”).

The denoiser function uses preconditioning (scaling the inputs and outputs) to stabilize training, formulated as:

Equation 3: The preconditioning function for the denoiser.

The Three-Stage Training Strategy

One of the paper’s most significant contributions is its training recipe. Training a video model on long sequences is incredibly memory-intensive. To solve this, the authors devised a progressive three-stage strategy that balances content diversity, temporal length, and fine-grained details.

Stage 1: Alignment (Realistic Data, Short Clips)

  • Goal: Adapt the pre-trained image-to-video model to the task of depth estimation.
  • Data: Large-scale realistic datasets (roughly 200K paired video-depth sequences).
  • Configuration: The model is trained on short clips (randomly sampled 1 to 25 frames).
  • Why: This stage teaches the model the fundamental correlation between RGB pixels and depth values without overwhelming the memory with long sequences.

Stage 2: Temporal Consistency (Realistic Data, Long Clips)

  • Goal: Enable the model to handle long-term dependencies and fix flickering.
  • Data: The same realistic datasets.
  • Configuration: The sequence length is increased significantly (up to 110 frames).
  • Optimization: Crucially, they only fine-tune the temporal layers (the parts of the U-Net that mix information across time). The spatial layers are frozen.
  • Why: Since the spatial layers already learned “what objects look like” in Stage 1, Stage 2 focuses entirely on “how objects move and stay consistent.” Freezing spatial layers saves massive amounts of memory, allowing for longer sequence training.

Stage 3: Fine-Grained Details (Synthetic Data, Medium Clips)

  • Goal: Sharpen the edges and improve precise depth details.
  • Data: Synthetic datasets (like DynamicReplica and MatrixCity).
  • Configuration: Fixed length of 45 frames.
  • Optimization: They fine-tune the spatial layers.
  • Why: Realistic datasets often have noisy or “fat” depth edges because they are generated by stereo matching algorithms. Synthetic data has pixel-perfect ground truth. By refining the spatial layers on this data, DepthCrafter learns to produce sharp, high-fidelity edges while retaining the temporal consistency learned in Stage 2.

Inference for Extremely Long Videos

Even with Stage 2 training, the model is capped at generating about 110 frames at a time due to GPU memory limits. Real-world videos can be thousands of frames long. Simply chopping the video into blocks and processing them individually would re-introduce flickering at the boundaries (the “scale shift” problem).

To solve this, the authors propose a Segment-wise Inference with Stitching strategy.

Figure 3: The inference strategy for extremely long videos using overlap and stitching.

The process works as follows:

  1. Overlapping Segments: The video is divided into segments that overlap (e.g., frames 1-110, then frames 90-200).
  2. Noise Initialization: For the second segment, the latent noise is not purely random. The researchers initialize the noise for the overlapping section using the denoised result from the previous segment. This acts as an anchor, forcing the new segment to adopt the same depth scale and shift as the previous one.
  3. Latent Stitching: In the overlapping region (frames 90-110 in our example), the model produces two predictions: the tail of segment 1 and the head of segment 2. These are fused in the latent space using a linear weighting (\(w_i\)).
  • Frame 90 relies mostly on Segment 1.
  • Frame 110 relies mostly on Segment 2.
  • Frames in between are smoothly interpolated.

This “mortise-and-tenon” approach ensures that the transition between processing blocks is seamless, allowing DepthCrafter to process videos of infinite length.


4. Experiments and Results

The researchers evaluated DepthCrafter against state-of-the-art competitors, including Depth Anything V2 (a strong single-image model) and NVDS (a video stabilization method).

Quantitative Performance

The evaluation was performed in a “zero-shot” setting, meaning the model was tested on datasets it never saw during training (Sintel, ScanNet, KITTI, Bonn).

Table 1: Zero-shot quantitative comparison on multiple datasets.

The table above shows two key metrics:

  • AbsRel (Absolute Relative Error): Lower is better. It measures how far off the predicted depth is from the ground truth.
  • \(\delta_1\) (Accuracy): Higher is better. It measures the percentage of pixels that are correctly estimated within a specific threshold.

Key Takeaway: DepthCrafter achieves the best performance across almost all metrics. Notably, on the KITTI dataset (outdoor driving), it outperforms Depth-Anything-V2 by a massive margin (AbsRel 0.104 vs 0.140), proving the value of its temporal understanding.

Qualitative Analysis: The Temporal Profile

Numbers are great, but in video, visual consistency is king. To visualize this, the authors used “temporal profiles.” By slicing the video along the time axis (taking a single vertical line of pixels from every frame and stacking them), you can see stability.

  • Jagged/Zig-zag lines: Indicate flickering (the depth of a static object is changing frame-to-frame).
  • Smooth/Straight lines: Indicate temporal consistency.

Figure 4: Qualitative comparison showing temporal profiles (green boxes).

In Figure 4, look at the green boxes. The competitors (Depth-Anything-V2 and NVDS) show vertical jittering in the temporal slices. DepthCrafter’s profiles are smooth and continuous, effectively reconstructing the motion of the object rather than just independent frames.

Why the Inference Strategy Matters

The authors also performed an ablation study to prove the necessity of their stitching strategy.

Figure 6: Ablation study of the inference strategy.

  • Baseline: Independent segments. Result: Visible seams and jumps in depth.
  • + Initialization: Better consistency, but still some artifacts at boundaries.
  • + Stitching (Full Method): Smooth transitions in both static (yellow arrow) and dynamic (blue arrow) regions.

Bonus: Single Image Performance

Interestingly, because DepthCrafter was trained to handle variable lengths (starting from 1 frame), it can also function as a single-image depth estimator.

Figure 5: Single-image depth estimation comparison.

Despite being a video model, it produces sharper details than Depth-Anything-V2 in complex scenes (like the truss bridge in the top row), likely due to the high-resolution synthetic data training in Stage 3.


5. Applications and Impact

Precise, consistent video depth opens the door to advanced visual effects that were previously only possible with expensive hardware or manual rotoscoping.

Figure 7: Applications in visual effects, including fog addition and relighting.

  • Atmospheric Effects: You can insert realistic fog that respects the actual distance of objects (thicker fog further away).
  • Relighting: Changing the time of day requires knowing the 3D geometry of the scene to cast shadows correctly.
  • Conditional Video Generation: Using the depth map as a structural guide to generate new videos (e.g., turning a horse rider into a sci-fi character) using tools like ControlNet.

6. Conclusion

DepthCrafter represents a significant leap forward in computer vision. By treating depth estimation not just as a regression task, but as a generative one, it leverages the massive priors locked inside foundation models like Stable Video Diffusion.

Its success relies on three pillars:

  1. Generative Prior: Using SVD to understand open-world content.
  2. Smart Training: A three-stage curriculum that balances realism, consistency, and detail.
  3. Flexible Inference: A stitching mechanism that decouples video length from memory constraints.

For students and researchers, DepthCrafter serves as an excellent case study in how to repurpose generative AI for discriminative tasks (like depth estimation) and how to engineer training pipelines to overcome hardware limitations.


Note: This blog post explains the research presented in the paper “DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos” by Hu et al. All images are sourced from the original paper.