ZeroPlane: Bridging the Gap Between Indoor and Outdoor 3D Plane Reconstruction

When we look at the world, we don’t just see pixels; we see structure. We instinctively recognize the floor we walk on, the walls that surround us, and the roads we drive on. In computer vision, these structures are known as 3D planes. Recovering these planes from a single 2D image is a cornerstone capability for Augmented Reality (AR), robotics navigation, and 3D mapping.

However, there has been a significant fragmentation in the field. Current state-of-the-art (SOTA) methods are typically “specialists”—they are trained on indoor datasets to reconstruct rooms, or outdoor datasets to reconstruct streets. If you take a model trained on a cozy living room and ask it to interpret a city street, it usually fails. This lack of generalizability is known as the domain gap.

In this post, we will dive into ZeroPlane, a novel framework proposed by Liu, Yu, et al., that aims to solve this problem. ZeroPlane is designed for zero-shot 3D plane reconstruction, meaning it can accurately reconstruct planes in environments it has never seen during training, handling both indoor and “in-the-wild” outdoor scenes with a single unified model.

Figure 1. Our plane reconstruction framework, ZeroPlane,demonstrates superior zero-shot generalizability on unseen and even inthe-wild data across diverse indoor and outdoor environments.

As shown in Figure 1, ZeroPlane effectively reconstructs 3D geometry across vastly different datasets, from the clutter of an ARKitScenes living room to the expansive roads of NuScenes.

The Challenge: Why is “In-the-Wild” So Hard?

To understand why ZeroPlane is significant, we first need to understand the limitations of previous approaches.

Data Scarcity: While there are decent datasets for indoor scenes (like ScanNet), there is a distinct lack of large-scale, densely annotated plane datasets for outdoor environments.
Geometric Scale: The geometry of a room is fundamentally different from a street. In a room, a plane might be 2 meters away. On a street, a road plane might stretch for 50 meters.
Parameter Coupling: Most models represent a plane using a coupled vector that combines orientation (normal) and distance (offset). This coupling makes it hard for a neural network to learn a single representation that works for both small-scale indoor and large-scale outdoor environments.

Building the Foundation: A Massive Mixed-Domain Benchmark

Before building the model, the researchers had to build the data. A robust “generalist” model requires a training set that encompasses the diversity of the real world.

The authors curated a massive benchmark comprising over 14 datasets and 560,000 high-resolution images. Since manual annotation of 3D planes is prohibitively expensive, they developed an automated pipeline. They utilized a state-of-the-art panoptic segmentation model (Mask2Former) to identify object instances (like “road,” “wall,” or “table”) and then applied RANSAC (a fitting algorithm) to depth maps to mathematically fit planes to those objects.

Table 1. Statistics of the datasets used in our work.Top: Datasets used for training and validation.Bottom: Datasets used for zeroshot evaluation.

Table 1 illustrates the sheer scale of this undertaking. By combining indoor giants like ScanNet and Matterport3D with outdoor synthetic datasets like Synthia and Virtual KITTI, they created a playground diverse enough to train a robust model.

![Figure 4.Froottt:uotatedgoud-truthaesoD7],ees[54]Tsoo76],araleoi,46], ApolloScape [29], Synthia [49] and Sanpo [62] datasets.](images/013.jpg#center)

Figure 4 visualizes the quality of these generated ground truths. Notice how the pipeline successfully segments and reconstructs planes in complex outdoor environments (bottom rows) just as well as indoor scenes (top rows).

The ZeroPlane Architecture

At its core, ZeroPlane is a Transformer-based framework. If you are familiar with DETR (Detection Transformer) or Mask2Former, the general flow will feel familiar: an image goes in, features are extracted, and a set of “queries” (learnable vectors) scour those features to find objects—in this case, planes.

However, standard detection transformers aren’t enough for high-fidelity 3D geometry. ZeroPlane introduces several critical innovations.

1. Advanced Backbone and Pixel Decoder

Instead of standard ResNets or Swin Transformers, the authors utilize DINOv2, a powerful vision transformer trained via self-supervision on millions of images. This provides a robust feature representation that handles diverse lighting and textures better than standard supervised backbones. A pixel decoder (DPT) then processes these features to create high-resolution maps.

2. Pixel-Geometry-Enhanced Plane Embedding

This is a clever “auxiliary task” strategy. The model doesn’t just look for planes immediately. First, it tries to predict the pixel-level depth and surface normals for the whole image.

Why do this? Because pixel-level predictions are rich in local geometric cues (like edges and corners). The architecture projects these depth and normal maps into embeddings ($F_D$ and $F_N$) and fuses them with the plane queries ($Q$) using an attention mechanism:

$()\n{ \\bf X _ { D } } = A t t n ( { \\bf Q } , { \\bf F _ { D } } ) , \\quad { \\bf X _ { N } } = A t t n ( { \\bf Q } , { \\bf F _ { N } } )\n[$

By attending to these pixel-level geometric features, the plane queries become “geometry-aware” before they even attempt to predict the final 3D planes.

The Core Innovation: Disentangled Geometry Learning

The most significant contribution of ZeroPlane is how it handles the mathematics of 3D planes.

Conventionally, a plane is defined by the equation $n^T x = d$, where $n$ is the normal (orientation) and $d$ is the offset (distance from origin). Previous methods (like PlaneRecTR) often predict a coupled vector $n/d$.

The Problem: In an indoor scene, $d$ is small (e.g., 2.5m). In an outdoor scene, $d$ is large (e.g., 50m). When you mix these datasets, the distribution of $n/d$ becomes chaotic and difficult for a network to regress directly.

The Solution: ZeroPlane disentangles the prediction. It uses separate heads to predict the Normal ($n$) and the Offset ($d$).

Classification-then-Regression (Cls-Reg)

Even after separating them, directly regressing the exact values is hard due to the variance. The authors adopt a “Classification-then-Regression” paradigm.

They cluster all the planes in their training set to find common “Exemplars” (prototypical normals and offsets).

Classification: The network predicts which “Exemplar” (cluster center) the plane belongs to.
Regression: The network predicts a small “Residual” (correction) to adjust the exemplar to the exact value.

The final prediction is the sum of the chosen exemplar and the predicted residual:

$]\n\\mathbf { n } = \\hat { \\mathbf { n } } ^ { ( i ) } + \\mathbf { r _ { n } } ^ { ( i ) } , \\quad d = \\hat { d } ^ { ( j ) } + r _ { d } ^ { ( j ) }\n[$

Here, $\hat{n}$ and $\hat{d}$ are the predicted exemplars (class), while $r_n$ and $r_d$ are the learnt residuals. This makes the learning process much more stable across diverse domains.

Training the Model

The model is trained end-to-end using a bipartite matching loss (similar to DETR), ensuring that predicted planes are matched one-to-one with ground truth planes. The total loss function is a weighted sum of classification, mask segmentation, and the specific geometry losses for normals and offsets:

$]\n\\begin{array} { r } { L = \\lambda _ { c } L _ { c } + \\lambda _ { m } L _ { m } + \\lambda _ { n _ { c } } L _ { n _ { c } } + \\lambda _ { n _ { r } } L _ { n _ { r } } } \\ { + \\lambda _ { d _ { c } } L _ { d _ { c } } + \\lambda _ { d _ { r } } L _ { d _ { r } } + \\lambda _ { p _ { d } } L _ { p _ { d } } + \\lambda _ { p _ { n } } L _ { p _ { n } } , } \\end{array}\n()$

This comprehensive loss ensures that the model optimizes for semantic correctness (is it a plane?), segmentation accuracy (where is the plane?), and geometric precision (what is its 3D position?) simultaneously.

Experimental Results

The researchers put ZeroPlane to the test using a rigorous Zero-Shot Evaluation. This means they trained the model on a mix of datasets but tested it on completely different datasets that the model had never seen (specifically NYUv2 and 7-Scenes for indoor, ParallelDomain and ApolloScape for outdoor).

Quantitative Performance

Table 2 below compares ZeroPlane against the previous state-of-the-art, PlaneRecTR.

![Table 2.Zero-shotevaluatioofdiferentmethodsorsetigsonindoordatasets (YUv2[55],7-Scenes[54]andoutdrdatasets (ParalelDomai5,46]poape9].(S:redoSanNet6];-2:trandoaNetv,osetrainingsetish than ScanNetv1; M: trained on mixed datasets.)](images/007.jpg#center)

The results are striking. Look at the Plane Recall metrics (higher is better). ZeroPlane (specifically the Ours-DINO-B (M) variant) significantly outperforms PlaneRecTR (M) across the board.

On ParallelDomain (Outdoor), the Depth Recall @ 1m jumped from 19.11 (PlaneRecTR) to 25.96 (ZeroPlane).
On NYUv2 (Indoor), the Normal Recall @ 5° improved from 24.97 to 37.29.

This proves that the architectural changes and the disentangled learning strategy allow the model to generalize much better than previous attempts.

Qualitative Performance

Ideally, we want to see clean, flat meshes that represent the scene structure. Figure 3 provides a visual comparison.

Figure 3.Qualiativeresultsfromourix-trainedmodelforzero-sotplanesegmentationandmeshreconstructiononNUv,-Scenes Paralel Domain,and ApolloScape,from top to bottm,respectively.Noticeable differences are highlighted.

In the bottom rows (outdoor scenes), look at the mesh column. The “PlaneRecTR” results often fail to capture the road surface correctly or miss building facades. ZeroPlane (Ours-Mesh) produces a much cleaner, continuous surface for the road and distinct vertical planes for buildings.

Why does it work? (Ablation Studies)

Is the complex architecture necessary? The authors broke down their contributions in Table 3.

Table 3.Ablation studies on the contributed components under both single-dataset (ScanNet) training and mixed-dataset training schemes,evaluated on the NYUv2 dataset.

Cls-Reg: Switching to the Classification-then-Regression strategy provided a massive boost in recall (comparing row 5 to row 7).
Geo-Attn: Adding the Geometry-Enhanced Attention (bottom row) provided the final push in performance, verifying that pixel-level cues are essential for high-level plane reasoning.

Conclusion

ZeroPlane represents a significant step forward in 3D computer vision. By acknowledging the vast geometric differences between indoor and outdoor environments and designing a “disentangled” architecture to handle them, the authors have created a unified model that works “in the wild.”

Key takeaways for students and practitioners:

Data is King: The creation of a unified, 560k-image benchmark was a prerequisite for success.
Disentanglement Matters: When variables in your problem scale differently (like indoor vs. outdoor distances), separating them is often better than trying to learn a coupled representation.
Hybrid Approaches: Combining classification (for coarse estimation) and regression (for fine-tuning) is a powerful tool for geometric prediction.

As robotics and AR continue to merge with our daily lives, systems like ZeroPlane will be crucial for helping machines understand the geometry of the world around them, whether it’s a small bedroom or a busy city intersection.

The Challenge: Why is “In-the-Wild” So Hard?#

Building the Foundation: A Massive Mixed-Domain Benchmark#

The ZeroPlane Architecture#

1. Advanced Backbone and Pixel Decoder#

2. Pixel-Geometry-Enhanced Plane Embedding#

The Core Innovation: Disentangled Geometry Learning#

Classification-then-Regression (Cls-Reg)#

Training the Model#

Experimental Results#

Quantitative Performance#

Qualitative Performance#

Why does it work? (Ablation Studies)#

Conclusion#