Depth estimation—the ability to look at a 2D image and understand the 3D geometry within it—is a cornerstone of computer vision. It is the prerequisite for autonomous driving, robot navigation, mixed reality, and content generation. However, building an “ideal” depth estimation model has historically been a game of trade-offs.
You usually have to pick two of the following three:
- Meticulous Detail: Can the model see the fine edges of leaves or the texture of a distant building?
- Temporal Consistency: If applied to a video, does the depth map flicker, or is it stable over time?
- Efficiency: Can it run in real-time on a robot, or does it take several seconds per frame?
Recent foundation models like Marigold or Depth Anything have pushed the boundaries of detail, but often at the cost of speed or video stability. In this post, we will explore CH3Depth, a new research paper that proposes a unified framework to solve this trilemma. Using a technique called Flow Matching combined with a novel sampling strategy, CH3Depth achieves state-of-the-art results in both image and video depth estimation while being significantly faster than its predecessors.

Background: From Diffusion to Flow Matching
To understand CH3Depth, we need to briefly look at the generative models that power it. Recently, Diffusion Models (like Stable Diffusion) have been repurposed for depth estimation. Instead of generating a random image from noise, they generate a depth map conditioned on an input image. While accurate, diffusion models are inherently slow because they require many iterative steps to “denoise” the result.
Flow Matching is a newer alternative. Like diffusion, it transforms a simple distribution (noise) into a complex data distribution (depth maps). However, whereas diffusion models simulate a stochastic (random) process, flow matching learns a deterministic Vector Field. It attempts to construct a straight line—an “optimal transport” path—between the noise and the target data. This generally allows for faster inference because the path from “noisy” to “clean” is more direct.
The researchers behind CH3Depth identified that while previous flow-matching approaches (like DepthFM) were a step in the right direction, they still wasted computational capacity and used inefficient sampling strategies.
The CH3Depth Framework
The core philosophy of CH3Depth is to build a “foundation model” that is flexible enough for both single images and videos. The architecture processes images in a Latent Space. Instead of working on raw pixels (which is computationally expensive), it uses a Variational Autoencoder (VAE) to compress the image and depth map into compact representations (latent codes).

As shown in the pipeline above, the process involves encoding the RGB image into a latent code (\(z_c\)). The model then tries to recover the depth latent code (\(z_d\)) from Gaussian noise.
There are three major technical innovations that make this specific framework efficient and accurate:
- InDI Flow Matching: A better optimization objective.
- Non-uniform Sampling: A smarter way to take steps during inference.
- Latent Temporal Stabilizer (LTS): A module for consistent video depth.
Let’s break these down.
1. InDI: Focusing on the “Now”
Standard flow matching trains a network to predict a globally fixed velocity field—essentially asking, “From this noise, where is the final target?” However, this can be inefficient because the network has to predict the entire global path at every step.
CH3Depth reframes the objective using Inversion by Direct Iteration (InDI). Instead of predicting the global path from the source noise to the target, InDI focuses on the transport from the current distribution to the target distribution.
The standard Flow Matching loss looks like this:

In contrast, the InDI loss function is defined as:

The subtle difference lies in the target term. This formulation is equivalent to assigning specific coefficients (weights) to inputs at different noise levels. It forces the model to focus on the optimal transport based on where the denoising process currently is. Empirically, this “InDI Flow Matching” favors difficult initial denoising stages, leading to higher accuracy and much finer details in the resulting depth map.
2. Non-Uniform Sampling: Speeding Up Inference
In generative models, “inference” involves taking multiple steps to clean up the noisy data. Most previous methods use Uniform Sampling, meaning the step size between each update is identical.
The authors of CH3Depth argue that this contradicts common optimization practices. In most optimization tasks, you want rapid convergence in the beginning (taking big steps to get the general structure) and fine-tuning in the later stages (taking small steps to refine details).
To implement this, they map the uniform steps to non-linear steps using a definable concave function. The update rule for the inference process becomes:

Here, \(f(t)\) acts as a mapping function. By using a concave function (specifically \(f(x) = x^{1/2}\)), the model takes larger steps early in the denoising process and smaller steps towards the end.
The result? CH3Depth can produce accurate, detailed depth maps in just 1 or 2 steps, whereas competitive models like Marigold often require 10 to 50 steps.
3. Latent Temporal Stabilizer (LTS) for Video
One of the biggest issues with applying image-based depth models to video is flickering. If you process each frame of a video independently, slight changes in lighting or noise can cause the estimated depth of a stationary object to jump around.
CH3Depth solves this with the Latent Temporal Stabilizer (LTS). Since the main model already operates in the latent space (thanks to the VAE), the LTS is designed as a lightweight module that aggregates latent codes from adjacent frames.
It uses a sliding window approach. For a specific frame, it looks at the predicted depth latents of previous frames and the current frame, fuses them, and outputs a temporally consistent result.
To train this module effectively, the researchers faced a data problem: synthetic video data is perfect but lacks diversity, while real-world video data (natural scenes) has diverse content but imperfect ground truth labels. They introduced a Temporal Consistent Deviation Loss:

This loss function allows the network to learn consistency without strictly overfitting to potentially inaccurate sensor data in natural videos. It essentially asks the network to predict a consistent deviation from the ground truth, rather than the exact raw values, smoothing out the jitter.
Experiments and Results
The researchers conducted extensive zero-shot evaluations (testing on datasets the model was not trained on) to validate CH3Depth.
Image Depth Accuracy
The model was tested against top-tier baselines like Marigold, DepthFM, and Depth Anything.

As seen in the table above, CH3Depth (bottom rows) achieves state-of-the-art performance. On the NYUv2 dataset, it reduced the error metric (AbsRel) by nearly 20% compared to DepthFM. Visually (Figure 3 in the image), the depth maps produced by CH3Depth show sharper edges and better preservation of small structures compared to the “cloudier” results of other methods.
Efficiency Comparison
Perhaps the most striking result is the efficiency. High-quality generative depth estimation is usually slow.

In the comparison above:
- Marigold takes 4.45 seconds for 50 steps.
- DepthFM takes 0.69 seconds for 3 steps.
- CH3Depth (Ours) takes only 0.36 seconds for 2 steps.
This massive speedup brings high-quality generative depth estimation into the realm of real-time applications.
Video Consistency
Finally, the temporal stability was evaluated using “temporal slices.” This involves taking a vertical slice of pixels from a video and stacking them over time. If the depth estimation is stable, the horizontal lines in this stack should be smooth. If the estimation flickers, the lines will look jagged.

The qualitative results in Figure 4 demonstrate that CH3Depth maintains meticulous detail while avoiding the global flickering seen in other methods. Quantitatively (shown in Table 2 below), the inclusion of the LTS module significantly reduces the OPW (Optical Flow Weighted) error, a metric for temporal inconsistency.

Conclusion
CH3Depth represents a significant step forward for 3D vision foundation models. By rethinking the flow matching objective with InDI and optimizing inference with non-uniform sampling, the authors managed to break the trade-off between speed and accuracy. Furthermore, the Latent Temporal Stabilizer proves that we don’t need heavy, complex architectures to achieve stable video depth—sometimes, a lightweight correction in the latent space is all that is needed.
For students and researchers, this paper highlights the importance of looking beyond just network architecture (like Transformers vs. CNNs) and focusing on the underlying mathematical frameworks (Flow Matching vs. Diffusion) and sampling strategies to achieve performance gains.
The LTS module, interestingly, is designed to be transferable. The authors showed it could even improve other models, suggesting that this “plug-and-play” consistency module could become a standard tool in video processing pipelines.
](https://deep-paper.org/en/paper/file-1950/images/cover.png)