Introduction

Imagine you are standing in the middle of a room, holding a camera, and taking a \(360^{\circ}\) panoramic photo. Now, walk into the next room and take another one. Can you reconstruct the entire floor plan of the house—accurate to the centimeter—just from those two photos?

This is the problem of Wide-Baseline Floor Plan Reconstruction, and it is notoriously difficult. Unlike a video feed where camera frames are millimeters apart, wide-baseline images are taken far apart (often in different rooms). The visual overlap is small, and traditional computer vision algorithms struggle to stitch these “islands” of data together into a coherent map.

Traditionally, we had two ways to solve this. We could use geometric optimization (like Bundle Adjustment), which is mathematically precise but brittle when data is scarce. or we could use Deep Learning, which is great at pattern recognition but often hallucinates structures that don’t exist.

Enter BADGR (Bundle Adjustment Diffusion Conditioned by GRadients).

In this post, we are going to break down a fascinating paper that proposes a hybrid approach. BADGR doesn’t choose between geometry and AI; it fuses them. It uses a Diffusion Model—the same tech behind DALL-E and Stable Diffusion—but conditions it with “gradients” derived from classical geometric optimization. The result? A system that can “dream” a floor plan that is mathematically consistent with what the cameras actually saw.

Figure 1: Overview of BADGR showing the progression from coarse input to refined output.

As shown in Figure 1, the system takes noisy, disconnected observations and iteratively refines them into a clean, connected floor plan. Let’s explore how this works.


Background: The Two Pillars

To understand BADGR, we need to quickly review the two pillars it stands on: Bundle Adjustment and Diffusion Models.

1. Bundle Adjustment (The “Old School” Geometry)

Bundle Adjustment (BA) is the gold standard in 3D reconstruction. It’s an optimization problem. You have a set of cameras and a set of 3D points. You want to adjust the position of the cameras and the points simultaneously so that if you projected the 3D points back onto the camera lenses, they would line up perfectly with the pixels in your photos.

  • Pros: Highly accurate when you have lots of overlapping images.
  • Cons: It requires a good initial guess. If you start too far from the truth, the math gets stuck in a “local minimum” and gives you a warped result. It also doesn’t know what a “room” is—it just sees points.

2. Diffusion Models (The Generative AI)

Diffusion models are generative models trained to denoise data. During training, you take a clean image (or floor plan) and slowly add noise until it’s garbage. Then, you train a neural network to reverse that process—to look at the garbage and predict the clean version.

  • Pros: They learn “priors.” A diffusion model trained on floor plans knows that walls are usually straight, rooms are closed loops, and corners meet at 90 degrees.
  • Cons: Pure generation can ignore reality. A diffusion model might generate a beautiful floor plan that has nothing to do with the actual house you photographed.

The BADGR Insight

The authors of BADGR realized that existing “guided diffusion” methods were inefficient. Usually, people train a diffusion model and then try to “guide” it during inference using a loss function. BADGR takes a different approach: it calculates the geometric adjustments (the gradients) explicitly and feeds them into the diffusion model as an input condition.


The Core Method: Inside BADGR

The goal of BADGR is to estimate two things simultaneously:

  1. Layouts: The 2D polygons defining the rooms.
  2. Poses: Where the cameras were located when they took the photos.

The architecture is a loop that refines a noisy scene into a clean one. Let’s look at the high-level architecture.

Figure 2: The architecture of BADGR.

As visualized in Figure 2, the process involves taking a scene state (layouts + poses) and passing it through a Planar BA Layer. This layer calculates how “wrong” the current state is compared to the images. These errors (adjustments) are compressed and fed into a Transformer, which predicts the denoised scene.

Let’s break this down into its three critical components.

1. The Planar Bundle Adjustment Layer

This is the “geometric engine” of the system. In standard Bundle Adjustment, we match distinct feature points (like the corner of a table). But in wide-baseline indoor scenes, feature points are unreliable because walls are often textureless (plain white paint).

BADGR uses a column-wise approach. It treats every vertical column of pixels in a panorama as a measurement ray.

Figure 3: Column-wise planar BA module explaining ray casting and projection.

Here is the logic shown in Figure 3:

  1. Ray Casting: For every column in the panorama, the model casts a ray into the 3D world based on the current estimated camera pose.
  2. Intersection: It calculates where that ray hits the current estimated wall (\(l_{m,k}\)).
  3. Reprojection: It projects that intersection point back onto the image to see where the wall should appear vs. where the image segmentation says the floor boundary is (\(\tilde{B}^{i,c}\)).
  4. Adjustment: It calculates the difference (error).

Crucially, BADGR doesn’t just calculate the error; it calculates the gradient—the specific mathematical “nudge” required to minimize that error. It uses a single step of the Levenberg-Marquardt (LM) algorithm, a standard non-linear optimization technique.

Algorithm 1: The logic for calculating wall and pose adjustments.

As outlined in Algorithm 1 above, for every column, the system computes:

  • \(\Delta b\): How much to move the wall.
  • \(\Delta \mathcal{T}\): How much to move the camera.

This results in a massive set of “dense adjustments”—thousands of tiny suggestions from every pixel column about where the walls and cameras should move.

2. The Column Geometry Encoder

We now have thousands of raw geometric suggestions. We can’t feed all of them directly into a Transformer; it’s too much data.

The Column Geometry Encoder acts as a compressor. It takes the dense per-column adjustments (\(\Delta b, \Delta \mathcal{T}\)) and aggregates them. It uses Fourier features to encode the statistics (mean and standard deviation) of these adjustments for each wall and each camera.

This step converts raw mathematical gradients into a rich “feature embedding” that the neural network can understand. It effectively translates “Geometry-speak” (move 5cm left) into “Neural-speak” (a vector representing a spatial discrepancy).

3. The Transformer Denoiser

This is the “brain” of BADGR. It is a self-attention Transformer, similar to the architecture used in Large Language Models, but adapted for geometry.

It takes two inputs:

  1. The Noisy Scene: The current estimated coordinates of the room corners and cameras.
  2. The Geometric Guidance: The embeddings from the BA layer we just discussed.

The Transformer processes these inputs to reason about the global structure. For example, the BA layer might say “move Wall A left” based on Camera 1, and “move Wall A right” based on Camera 2. A simple optimizer might just average them and fail. The Transformer, however, looks at the context. It might realize that moving Wall A left would make the room shape impossible (e.g., self-intersecting), so it prioritizes the structural plausibility learned from its training data.

The Training Objective

BADGR is trained to reverse the noise process. But it’s not just trained to match the ground truth coordinates (L2 loss); it is also trained to minimize the reprojection error—ensuring the generated room actually lines up with the photos.

The loss function combines these goals:

Loss Function Equation

Where \(\mathcal{L}_{L2}\) is the standard coordinate distance loss:

L2 Loss Equation

And \(\mathcal{L}_{proj}\) checks if the projected walls match the image boundaries:

Projection Loss Equation

This dual-loss approach ensures the model satisfies both the geometric evidence (the photos) and the structural priors (what valid rooms look like).


Experiments and Results

Does this complex hybrid architecture actually work better than existing methods? The authors tested BADGR on several datasets, including ZInD (Zillow Indoor Dataset) and RPLAN.

Qualitative Results: Seeing is Believing

Let’s look at the visual evidence. In Figure 4, we see a comparison between the “Before” (Initial) state and the “After” (BADGR) state.

Figure 4: Qualitative results comparing initial noisy layouts with BADGR optimized layouts.

Pay attention to the middle column (“Optimized”). Notice how the walls—which are jagged and disconnected in the “Initial” phase—become straight, connected, and structurally sound. The “stars” representing camera positions also shift to their correct locations to maximize view consistency.

The model is even capable of handling extreme noise. In experiments using the RPLAN dataset (see Figure 12 below), the researchers added significant Gaussian noise to the inputs. BADGR successfully pulled a coherent floor plan out of the chaos.

Figure 12: BADGR refining RPLAN layouts with high input noise.

Quantitative Results: The Numbers

The quantitative tables highlight the precision of BADGR. The metric “Visible walls (cm)” measures the average error in wall position.

Table 1: Pose and layout error on ZInD dataset.

In Table 1, compare BADGR against CovisPose+ (the baseline initialization) and BA-Only (standard optimization without the diffusion model).

  • 0.6 Images/Room (Extreme Sparsity): BADGR achieves a median layout error of 4.5 cm, compared to 6.8 cm for BA-Only.
  • Pose Translation: BADGR reduces camera position error significantly (from ~12cm in BA-Only to ~9cm).

This proves that the diffusion component isn’t just “imagining” details; it is actively helping the optimization solve the geometry more accurately than math alone.

Robustness to Noise

One of the most impressive aspects of BADGR is its resilience. The authors ran ablation studies (tests where they remove parts of the model to see what happens).

Table 4: Ablation analysis showing the importance of the BA layer.

Table 4 shows that if you remove the Planar BA Layer (the “BA Inputs” column), the error skyrockets (from 7.2 cm to 8.6 cm). If you remove the reprojection loss, it also gets worse. This confirms that the combination of learned priors and explicit geometric gradients is essential. Neither works as well in isolation.


Conclusion & Implications

BADGR represents a significant step forward in computer vision and indoor mapping. By treating “geometric optimization” as a condition for “generative diffusion,” the authors have created a system that gets the best of both worlds.

Here are the key takeaways:

  1. Gradients as Conditioning: Instead of hard-coding the optimization, BADGR uses geometric gradients as hints for a neural network.
  2. Structural Plausibility: The diffusion model acts as a powerful regularizer, ensuring that walls form logical rooms even when the visual data is sparse or noisy.
  3. Data Efficiency: The system works remarkably well with very few images (even less than one image per room on average), making it viable for quick capture scenarios.

For students and researchers, BADGR is a masterclass in Differentiable Programming. It shows that we don’t have to discard classical algorithms (like Levenberg-Marquardt) in the age of Deep Learning. Instead, we can embed them deeply into our neural architectures to ground our AI in physical reality.

Future applications for this technology are vast, ranging from autonomous robot navigation (which needs accurate maps) to real estate (generating floor plans from a few smartphone photos) and augmented reality.

The code and project details can be found on the authors’ project website, opening the door for further innovations in hybrid geometric-generative models.