Introduction
Imagine a robot entering a kitchen for the first time. It sees a mug on the counter. To pick it up, the robot needs to know exactly where that mug is in 3D space and how it is oriented—a problem known as 6D pose estimation.
Historically, this has been a rigid process. Robots relied on exact CAD models (digital blueprints) of every object they might encounter. If you didn’t have the CAD file for that specific “Master Chef” mug, the robot was blind. Recent advances in deep learning attempted to fix this with “category-level” training (teaching a robot what a generic mug looks like) or “model-free” approaches that reconstruct objects on the fly.
However, these newer methods face a critical flaw: Hallucination.
When a robot reconstructs a 3D object from a single 2D picture, it has to guess what the back of the object looks like. Generative AI (specifically Diffusion Models) are great at guessing, but they often “hallucinate” incorrect geometry. If the robot trusts a hallucinated handle that doesn’t exist, the grasp fails.
This brings us to UnPose, a groundbreaking framework presented by researchers from Huawei Noah’s Ark Lab and the University of Toronto. UnPose introduces a principled way to handle the “unknowns” in zero-shot pose estimation. Instead of blindly trusting a generative model, UnPose calculates uncertainty. It knows when it is guessing and when it is sure, allowing it to build 3D representations that refine themselves over time.

In this post, we will deconstruct UnPose. We will explore how it leverages diffusion models without falling victim to their hallucinations, how it utilizes 3D Gaussian Splatting for real-time mapping, and how it allows robots to manipulate novel objects with zero prior knowledge.
The Core Problem: Epistemic Uncertainty
To understand UnPose, we must first understand the limitations of current “Model-Free” pose estimation.
In a model-free setup, the system takes a single image of an object and tries to build a 3D model. Modern approaches often use Diffusion Models—the same technology behind DALL-E or Midjourney—trained to turn 2D images into 3D shapes.
However, a single view is insufficient to define a 3D shape. When a diffusion model fills in the unseen back of an object, it is making a prediction based on its training data. This leads to Epistemic Uncertainty—uncertainty arising from a lack of knowledge.
Existing methods typically treat the generated 3D model as ground truth. They do not quantify how confident the model is in its own prediction. If the model hallucinates a feature that isn’t there, the robot treats it as solid geometry, leading to tracking failures or failed grasps. UnPose addresses this by asking the diffusion model not just what it sees, but how confident it is.
The UnPose Framework
UnPose is a pipeline that transforms a single RGB-D image into a refined, textured 3D model with an accurate 6D pose. It does this incrementally, improving the model as the robot moves and observes more of the object.

As shown in Figure 2, the architecture is composed of four main pillars:
- Initialization: Generating an initial guess with uncertainty.
- 3D Gaussian Splatting (3DGS) Mapping: Building the object representation.
- Pose Estimation: Determining position and orientation.
- Backend Optimization: Refining the graph as new frames arrive.
Let’s break these down step-by-step.
1. Initialization: Taming the Diffusion Model
The process begins with a single RGB-D frame. The system segments the object (using standard segmentation tools) and passes the image to a pre-trained multi-view diffusion model (specifically Wonder3D).
Normally, Wonder3D would simply output images of the object from different angles. UnPose, however, needs to know the uncertainty of these pixels. To achieve this without training a new model from scratch, the authors use a technique called Last-Layer Laplace Approximation (LLLA).
Estimating Uncertainty via Bayesian Inference
The goal is to approximate the distribution of the noise prediction \(\epsilon_t\) given the noisy state \(\mathbf{x}_t\) at a specific diffusion timestep \(t\). The authors approximate this as a Gaussian distribution:

Here, \(\epsilon_{\theta}\) is the standard prediction from the diffusion model, and \(\Sigma_{\epsilon_t}\) is the covariance derived from LLLA.
However, diffusion models work in iterative steps. Uncertainty at step \(t\) affects step \(t-1\). The authors must propagate this uncertainty through the reverse diffusion process (specifically the DDIM sampling steps). The update rule for the image state is:

To track how uncertainty flows through this equation, the researchers derive a variance update rule. This equation calculates the variance of the next step (\(\text{Var}(\mathbf{x}_{t-1})\)) based on the current variance and the uncertainty of the noise prediction:

A key challenge here is the term \(\text{Cov}(\mathbf{x}_t, \epsilon_t)\), which represents the covariance (correlation) between the noisy image state and the model’s prediction. Since this is mathematically intractable to solve directly, UnPose estimates it using Monte Carlo sampling:

By running this process, UnPose obtains not just 3D views, but pixel-wise uncertainty maps.
Visualizing Uncertainty
The results of this process are intuitive. In the figure below, look at the “Variance” column. The diffusion model is confident (purple/blue) about the parts of the “Master Chef” can that resemble the input view. However, for the back view or unseen angles, the variance spikes (red/white).

This variance map is the “secret weapon” of UnPose. It tells the subsequent steps: “Trust the front of the can, but be very skeptical about the back.”
2. 3D Gaussian Splatting (3DGS) Mapping
With the initial multi-view images and their uncertainty maps, UnPose constructs a 3D representation. Instead of using heavy meshes or slow NeRFs (Neural Radiance Fields), the authors utilize 3D Gaussian Splatting (3DGS).
3DGS represents a scene as a cloud of 3D Gaussians (blobs), each with position, color, opacity, and size. It is incredibly fast to render and easy to update.
The critical innovation here is how the uncertainty maps guide the creation of these Gaussians. The mapping loss function incorporates the uncertainty \(C(p)\) directly:

In this equation:
- \(D(p)\) and \(C(p)\) are depth and color residuals.
- The term \(C(p)\) acts as a weight.
If the diffusion model is uncertain about a pixel (\(C(p)\) is high/unreliable), the weight decreases. This prevents the 3D model from being corrupted by hallucinations. As the robot observes real data from its camera, those observations (which have low uncertainty) naturally overpower the weak, uncertain priors from the diffusion model. This ensures the 3D model remains “open-minded” to correction.
3. Pose Estimation & Backend Optimization
Once the initial 3DGS field is built, the system needs to estimate the 6D pose relative to the camera.
UnPose uses a “Pose Graph” approach. It treats every observation (keyframe) as a node in a graph, connected by edges representing relative transformations.
- Pose Estimation: The system renders the current 3DGS model and compares it to the live camera observation using a transformer-based network (adapted from FoundationPose).
- Geometric Optimization: To ensure global consistency, UnPose optimizes the graph. It minimizes the geometric error between matched 3D points across frames.
Crucially, the optimization is also weighted by uncertainty:

This optimization ensures that “real” frames (from the camera) pull the “virtual” frames (from diffusion) into alignment, scaling and correcting the hallucinated geometry to match reality.
Handling Loop Closures and Relocalization
If the robot loses track of the object (due to occlusion or fast motion), UnPose uses the 3DGS model for relocalization. Because the diffusion model provided some guess for the back of the object (even if uncertain), the system has a weak prior to latch onto if the object is viewed from a new angle, allowing it to recover tracking where other methods might fail.
Experimental Results
Does adding uncertainty actually help? The authors tested UnPose against state-of-the-art methods like GigaPose, SAM-6D, and FoundationPose on standard benchmarks (YCB-Video and LM-O).
Quantitative Accuracy
The graphs below show the Mean ADD (Average Distance of Model Points) score. Higher is better. The X-axis represents the number of input frames.

- Single View Performance: Even with just one frame, UnPose (Red) starts higher than most competitors because its initial diffusion guess is uncertainty-aware.
- Incremental Improvement: Notice the steep upward trend. As UnPose gets more views (moving from 1 to 16), it effectively fuses new data to correct the initial model. Other methods often plateau because they are stuck with their initial, hallucinated geometry.
Reconstruction Quality
The ability to reconstruct the object’s shape is just as important as knowing its position.

In Figure 6, compare Ours (UnPose) with BundleSDF and GOM.
- BundleSDF often produces “melty” or incomplete shapes.
- GOM can suffer from noise.
- UnPose produces a sharp, clean mesh that closely resembles the Ground Truth (GT), especially when 16 views are available.
The table below quantifies this. UnPose achieves a significantly lower Chamfer Distance (CD)—a measure of error between the 3D model and the real object—while being much faster than diffusion-heavy methods like Wonder3D.

Real-World Robotic Application
The ultimate test is a physical robot. The authors deployed UnPose on a robotic arm to grasp a mug. This scenario perfectly illustrates the “handle problem.”
- Initial View: The robot sees the mug, but the handle is occluded (hidden at the back).
- Initial Guess: The diffusion model guesses there might be a handle but assigns it high uncertainty.
- Action: The robot moves. As the camera reveals the side of the mug, the high-uncertainty regions are updated with real data.
- Refinement: The 3DGS model updates, the handle’s geometry becomes solid and certain.
- Grasp: The robot successfully plans a grasp on the newly confirmed handle.


Conclusion
UnPose represents a significant shift in how we approach “Model-Free” robotics. Rather than trying to build a perfect reconstruction algorithm that never makes mistakes, UnPose accepts that AI models will hallucinate. By quantifying that uncertainty via Bayesian inference and incorporating it into a flexible 3D Gaussian Splatting framework, it builds systems that are honest about what they don’t know.
This approach enables:
- Zero-Shot capability: No CAD models required.
- Global Consistency: Integrating multi-view diffusion priors with real-world physics.
- Robustness: Preventing hallucinations from causing grasp failures.
As robots move into unstructured environments—homes, warehouses, and the outdoors—the ability to adaptively learn and refine 3D geometry on the fly will be essential. UnPose proves that sometimes, the key to better perception isn’t just seeing better; it’s knowing how unsure you are.
](https://deep-paper.org/en/paper/2508.15972/images/cover.png)