MonSter: When Monocular Depth Meets Stereo Matching

Introduction

In the world of computer vision, perceiving depth is everything. Whether it’s an autonomous vehicle navigating a busy street or a robot arm reaching for a cup, the machine needs to know exactly how far away objects are.

For years, Stereo Matching has been the gold standard for this. Mimicking the human eyes, it uses two cameras to triangulate distance based on the disparity between images. But there is a catch: stereo matching relies on finding the exact same feature in both the left and right images. When a car drives into a tunnel with textureless white walls, or faces a highly reflective glass building, those matching cues disappear. The stereo vision essentially goes blind.

Enter Monocular Depth Estimation. This technique uses deep learning to predict depth from a single image based on context (e.g., “cars are usually on the ground,” “sky is far away”). It doesn’t need matching pixels, but it has its own flaw: it’s terrible at guessing absolute distance. It might tell you a building is behind a tree, but it can’t tell you if the building is 10 meters or 100 meters away.

What if we could combine the geometric precision of stereo matching with the contextual understanding of monocular depth?

That is exactly what the paper “MonSter: Marry Monodepth to Stereo Unleashes Power” proposes. The researchers introduce a novel framework that fuses these two approaches, allowing them to correct each other’s mistakes iteratively.

Zero-shot generalization comparison showing significant improvement in reflective and textureless areas.

As shown in the figure above, standard stereo methods (like IGEV) often fail in reflective or textureless regions, creating artifacts. MonSter, however, produces smooth, accurate depth maps even in these “ill-posed” scenarios. Let’s dive into how this marriage of techniques works.

The Core Problem: Why Stereo Matching Isn’t Enough

To understand why MonSter is necessary, we first need to look at the limitations of traditional stereo matching.

Stereo matching algorithms generally follow a pipeline:

Feature Extraction: Find unique patterns in both images.
Cost Volume Construction: Compare features between the left and right images to find matches.
Disparity Regression: Calculate the distance (disparity) based on those matches.

This works perfectly for textured objects like trees or brick walls. However, in ill-posed regions—such as occlusions (where an object is visible in one eye but not the other), textureless surfaces (blank walls), or repetitive patterns (fences)—the algorithm struggles to find a unique match. This leads to noise and “holes” in the depth map.

The Monocular Dilemma

On the other side, we have Monocular Depth Estimation. Modern transformers (like DepthAnything or DINOv2) are incredibly good at understanding scene structure from a single image. They don’t care about textureless walls because they recognize the “wall” as an object.

However, monocular depth suffers from scale and shift ambiguity. A photo of a toy car close up looks geometrically identical to a real car far away. A monocular model produces “relative” depth, not “metric” depth.

Comparisons of Disparity derived from Monodepth vs GT Disparity.

The figure above illustrates this problem perfectly.

Plot (a) shows raw monocular output vs. ground truth (GT). It’s a mess.
Plot (b) shows what happens if you align the global scale and shift. It gets better, but there is still significant noise because the error isn’t consistent across the pixels.
Plot (c) shows the result of MonSter. By refining the monocular depth pixel-by-pixel using stereo cues, the alignment becomes nearly perfect.

The MonSter Method

The brilliance of MonSter lies in its dual-branch architecture. Instead of just adding a monocular channel to a stereo network, the authors built a system where the two branches actively talk to and refine one another.

Architecture Overview

Overview of the MonSter architecture showing the dual-branch system.

The architecture, illustrated above, consists of three main components:

Monocular Branch: Uses a pre-trained model (DepthAnythingV2) to get a structural “guess” of the scene.
Stereo Branch: Uses a standard stereo matching network (similar to IGEV) to calculate geometric disparity.
Mutual Refinement Module: This is the engine room where the fusion happens.

The process is iterative. The model generates an initial estimate, and then loops through a refinement stage where the Stereo branch fixes the Mono branch’s scale, and the Mono branch fixes the Stereo branch’s missing details.

Step 1: Global Scale-Shift Alignment

Before the two branches can cooperate, they need to speak the same language. The monocular output is relative inverse depth, while the stereo output is disparity.

The first step is a Global Scale-Shift Alignment. The model solves a least-squares optimization problem to find a global scale (\(s_G\)) and shift (\(t_G\)) that roughly aligns the monocular depth map (\(\mathcal{D}_M\)) with the current stereo disparity (\(\mathcal{D}_S^0\)).

Equation for Global Scale-Shift Alignment.

This creates a coarse “Monocular Disparity” (\(\mathcal{D}_M^0\)) that is roughly in the same metric space as the stereo output.

Step 2: Stereo Guided Alignment (SGA)

A global alignment isn’t enough. The scale or shift error might vary across different parts of the image. This is where the Stereo branch helps the Mono branch.

The system uses a Confidence-Based Guidance. It calculates a “flow residual map” (\(F_S^j\)) by warping the right image features to the left using the current stereo disparity.

Equation for Flow Residual Map calculation.

If the residual is low (\(F_S^j\) is small), it means the stereo match is good (high confidence). If the residual is high, the stereo match is likely wrong.

Using this confidence, the model constructs a “stereo condition” vector (\(x_S^j\)) containing geometric features and the flow residual.

Equation for Stereo Condition vector.

This condition is fed into a Gated Recurrent Unit (GRU). The GRU decides, pixel by pixel, how much to trust the stereo information to update the hidden state of the monocular branch.

Equation for GRU updates in Stereo Guided Alignment.

Finally, the network predicts a residual shift (\(\Delta t\)) to fine-tune the monocular disparity.

Equation for updating Monocular Disparity.

This step effectively “anchors” the monocular depth to real-world metrics using the high-confidence regions of the stereo output.

Now that we have a high-quality, metric-aligned monocular depth map, it’s time to return the favor. The Mono branch now acts as a teacher to the Stereo branch.

In regions where stereo matching fails (ill-posed regions), the confidence is low. However, the aligned monocular depth provides a strong structural prior. The system uses a symmetric process where the monocular features and the aligned monocular disparity guide the refinement of the stereo disparity.

Equation for Mono Guided Refinement and condition vectors.

This allows the stereo branch to “hallucinate” the correct depth in textureless or reflective areas by relying on the semantic understanding of the monocular branch.

The Loss Function

To train this beast, the authors use a loss function that supervises both branches. The total loss combines the errors from the stereo branch and the monocular branch, exponentially increasing the weight of later iterations (since the final output matters most).

Equation for the Loss Function.

Experiments and Results

The researchers evaluated MonSter on five major benchmarks: Scene Flow, KITTI 2012 & 2015, Middlebury, and ETH3D. The results were nothing short of dominant.

Leaderboard Performance

MonSter ranked 1st across all five leaderboards.

Radar chart comparing MonSter to other state-of-the-art methods.

As seen in the radar chart, MonSter (the red line) pushes the boundary of performance (higher is better on this normalized chart) significantly beyond competitors like IGEV and CREStereo.

Detailed quantitative results on the Scene Flow dataset show a massive reduction in End-Point Error (EPE).

Quantitative evaluation table on Scene Flow.

MonSter achieves an EPE of 0.37, which is a 15.91% improvement over the previous state-of-the-art.

Conquering Ill-Posed Regions

The true test of MonSter is in the difficult areas: reflections and textureless zones. The table below highlights performance specifically in “Reflective Regions” on the KITTI 2012 benchmark.

Table showing results in reflective regions on KITTI 2012.

MonSter reduces the error rate (Out-4 All) from roughly 3% (competitors) down to 2.13%, proving that the monocular prior effectively stabilizes the stereo matching when reflections confuse the geometric solver.

Visual Quality

The numbers are impressive, but the visual results tell the story better. In the ETH3D dataset comparison below, look at the areas pointed out by the white arrows.

Qualitative results comparing IGEV and MonSter on ETH3D.

Standard methods (IGEV) often produce “bleeding” colors or holes where the algorithm gets confused by lighting or fine structures. MonSter preserves sharp boundaries and consistent depth gradients.

Zero-Shot Generalization

Perhaps the most exciting result is Zero-Shot Generalization. This tests how well a model trained only on synthetic data (Scene Flow) performs on real-world data (KITTI, Middlebury) without any fine-tuning.

Zero-shot generalization benchmark table.

MonSter outperforms existing methods by huge margins in this setting. This suggests that by combining two different modes of perception (geometric and semantic), the model learns a more robust representation of “depth” that isn’t as easily fooled by domain shifts.

Visual comparisons of zero-shot performance on real-world data.

In the real-world captures above (Figure 6), notice how MonSter (right column) handles the flat walls and the complex outdoor scene much better than the baseline, which introduces significant noise.

Efficiency

You might think that running two branches makes the model incredibly slow. However, because the monocular branch provides such strong priors, the stereo branch needs fewer iterations to converge.

Table showing efficiency and universality of MonSter.

As shown in Table 7, the full MonSter model with just 4 iterations outperforms the baseline IGEV with 32 iterations. While the total runtime is higher (0.64s vs 0.37s) due to the heavy Monocular backbone (DepthAnything), the accuracy-per-compute trade-off is compelling for high-precision applications.

Conclusion

MonSter represents a paradigm shift in depth estimation. Instead of trying to squeeze more performance out of stereo matching alone, it acknowledges the inherent limitations of the technology and brings in a partner—Monocular Depth—to fill the gaps.

By decomposing the problem into “Monocular Depth Estimation” and “Scale-Shift Recovery,” MonSter unleashes the full potential of both fields. The stereo branch provides the metric accuracy that monocular models lack, and the monocular branch provides the structural completeness that stereo models lack.

The results—ranking 1st on five major leaderboards—speak for themselves. This work paves the way for future “Stereo Foundation Models” that could act as the visual cortex for the next generation of autonomous systems.

Introduction#

The Core Problem: Why Stereo Matching Isn’t Enough#

The Monocular Dilemma#

The MonSter Method#

Architecture Overview#

Step 1: Global Scale-Shift Alignment#

Step 2: Stereo Guided Alignment (SGA)#

Step 3: Mono Guided Refinement (MGR)#

The Loss Function#

Experiments and Results#

Leaderboard Performance#

Conquering Ill-Posed Regions#

Visual Quality#

Zero-Shot Generalization#

Efficiency#

Conclusion#