Why Your Inpainting Model is Secretly a Correspondence Expert: Unveiling ZeroCo
If you have been following Computer Vision research lately, you know that “Masked Image Modeling” (like MAE) has revolutionized how models learn representations. The idea is simple: hide parts of an image and ask the model to fill in the blanks.
But what happens when you extend this to two images? This is called Cross-View Completion (CVC). In this setup, a model looks at a source image to reconstruct a masked target image. To do this effectively, the model must implicitly understand the 3D geometry of the scene—it needs to know which pixel in the source corresponds to the missing pixel in the target.
In a fascinating new paper, researchers from KAIST, Korea University, and Samsung Electronics have revealed that these CVC models (specifically CroCo) are actually Zero-shot Correspondence Estimators. Without ever being explicitly trained to match points, their internal cross-attention maps learn to align images better than models designed specifically for feature extraction, like DINOv2.
In this post, we will break down how this works, why cross-attention is the secret sauce, and how the proposed method, ZeroCo, achieves state-of-the-art results in geometric matching and depth estimation.
The Problem: Finding Matches in a Complex World
Dense correspondence—finding the matching pixel for every point between two images—is a cornerstone of computer vision. It powers Stereo Vision, Optical Flow, and 3D Reconstruction.
Traditionally, we solve this in two ways:
- Supervised Learning: Training on expensive datasets with ground-truth flow (like flying chairs or synthetic scenes).
- Self-Supervised Learning: Using geometric constraints (like epipolar geometry) to train models without labels.
Recently, foundation models like DINOv2 or Stable Diffusion have been used for this. We extract “features” (vectors representing image patches) from their encoders and calculate the similarity. While good, these features are often “semantic”—they know that a dog is a dog, but they might not know exactly which pixel is the dog’s left eye in a rotated view.
This is where ZeroCo comes in.

As shown above, the authors discovered that CVC models can pinpoint exact matches (the red dot) for a query point (the blue dot) with incredible precision, purely zero-shot.
The Insight: CVC is Correspondence Learning in Disguise
To understand why this works, we need to look at the architecture of a Cross-View Completion model.
In CVC, the model takes a Target Image (mostly masked out) and a Source Image (fully visible). It encodes both. Then, a Decoder tries to reconstruct the target.
Here is the “Aha!” moment: To reconstruct a missing patch in the target, the decoder must look at the source image. It does this via a Cross-Attention Layer. This layer calculates a similarity score between the target query (\(Q\)) and the source keys (\(K\)).
The researchers noticed a striking structural similarity between CVC and traditional matching methods.

In traditional matching (Left side, a), we compute a “Cost Volume” to see how well pixels align. In CVC (Right side, b), the Cross-Attention Map acts exactly like a Cost Volume. It tells the network: “To fix this pixel in the Target, copy information from this pixel in the Source.”
Why Cross-Attention Beats Features
Most previous methods (like DUSt3R or MASt3R) used the features from the encoder or decoder to find matches. The authors of ZeroCo argue that the features are just the ingredients; the Cross-Attention Map is the actual recipe for alignment.
Let’s look at a visual comparison of the “Attention” (or correlation) derived from three different parts of the same model:

Look at the difference in sharpness:
- (d) Encoder Correlation: Very blurry. It knows the general area, but it’s noisy.
- (e) Decoder Correlation: Still vague.
- (f) Cross-Attention: Extremely sharp. It pinpoints the exact geometric location of the match.
The math behind this cross-attention map (\(C_{att}\)) is a softmax operation over the dot product of queries and keys:

This map is what the model uses to “warp” information from source to target. Because the training objective is reconstruction, the model is forced to make this map geometrically accurate. If the map is wrong, the reconstruction fails.
Introducing ZeroCo: The Method
The authors propose ZeroCo, a method to extract and refine these cross-attention maps for high-quality correspondence without any fine-tuning.
1. Handling the “Register Token” Artifacts
Transformers often use “register tokens”—extra tokens that act as a workspace for global information. The researchers found that the raw attention maps often incorrectly focused on these tokens (a phenomenon known as shortcut learning), creating artifacts.

By simply suppressing these specific tokens (clamping their values to the minimum), the attention map cleans up instantly, as seen in the transition from (c) to (d) above.
2. Enforcing Reciprocity
In geometric matching, if point A matches to point B, then point B should match back to point A. To enforce this consistency and improve robustness, ZeroCo runs the model twice: once as usual (\(I_t, I_s\)) and once with the inputs swapped (\(I_s, I_t\)).
They combine the maps to create a “reciprocal” cost volume:

This simple averaging trick significantly cleans up the noise and makes the matching invariant to input swapping.
Beyond Zero-Shot: Learning-Based Extensions
While the zero-shot performance is impressive, the authors also show that you can train lightweight “heads” on top of these frozen attention maps to achieve state-of-the-art results in specific tasks.
The Architecture
The researchers designed two variants:
- ZeroCo-flow: For optical flow and geometric matching.
- ZeroCo-depth: For multi-frame depth estimation.
Both architectures follow a similar pattern: take the frozen cross-attention maps, feed them into an aggregation module (to refine the costs), and then use an upsampling module to get high-resolution predictions.

This setup is highly efficient because the heavy lifting—understanding the geometry—is already done by the pretrained CVC model. The learnable head just polishes the result.
Experimental Results
The paper validates ZeroCo across three major areas: Zero-shot matching, Supervised matching, and Depth estimation.
1. Zero-Shot Dense Matching
Tested on the HPatches benchmark, ZeroCo completely dominates other foundation models.

- DINOv2 achieves an error (AEPE) of 28.08.
- ZeroCo achieves an error of 9.41.
This is a massive reduction in error, proving that the geometric cues in CVC are much stronger than the semantic cues in DINOv2 or Stable Diffusion (DIFT).
Visually, the difference is stark. In the heatmap comparison below, look at how ZeroCo (column g) provides a focused, single red dot (high confidence match) compared to the diffuse clouds of DIFT or SD-DINO.

2. Multi-Frame Depth Estimation
This is perhaps the most practical application. Standard multi-frame depth estimation relies on epipolar geometry—assuming that static points move along specific lines as the camera moves.
The Flaw: Epipolar geometry fails on moving objects. If a car is driving alongside you, it breaks the epipolar constraint, and traditional models often estimate its depth incorrectly (often putting it at infinity).
The ZeroCo Fix: Because ZeroCo uses the cross-attention map (which acts as a full cost volume), it doesn’t strictly rely on static assumptions. It can match the car simply because the car looks the same in both frames.

In the figure above (Cityscapes dataset), look at the dynamic cars (the moving vehicles).
- ManyDepth (b) struggles to define them clearly.
- ZeroCo-depth (d) yields crisp, accurate depth maps for the moving vehicles, handling the “dynamic object” problem elegantly without needing semantic segmentation masks.
Quantitative results on the KITTI dataset confirm this performance, showing ZeroCo-depth achieves competitive metrics with state-of-the-art specialized networks.

Conclusion
The paper “Cross-View Completion Models are Zero-shot Correspondence Estimators” offers a fresh perspective on what neural networks are actually learning. By forcing a model to reconstruct a scene from a different view (Cross-View Completion), we inadvertently train it to become a master of geometry and correspondence.
Key Takeaways:
- Don’t ignore the attention maps: The internal cross-attention weights of a decoder often hold more precise geometric information than the feature vectors themselves.
- CVC \(\approx\) Correspondence: The task of cross-view completion is mathematically and structurally analogous to self-supervised correspondence learning.
- ZeroCo works out of the box: You can extract these maps for high-quality, zero-shot matching.
- Robustness: Because it doesn’t rely on strict epipolar constraints, ZeroCo offers significant advantages in handling dynamic scenes and noise for depth estimation.
For students and researchers, this suggests that the next time you need to align images or find dense matches, you might not need to train a new model from scratch—the answer might already be hidden in the attention layers of a pre-trained inpainting model.
](https://deep-paper.org/en/paper/2412.09072/images/cover.png)