Bridging the Reality Gap: How CRISP Masters 3D Object Perception with Test-Time Adaptation

Imagine a robot arm tasked with cleaning up space debris. It sees a satellite floating in orbit. To grab it safely, the robot needs to know two things with high precision: where the satellite is (its pose) and what it looks like geometrically (its shape).

In a controlled lab environment with perfect data, this is a solvable problem. But in the real world—or in space—lighting changes, sensors add noise, and objects might look slightly different than the 3D models the robot was trained on. This is known as the domain gap, and it is one of the biggest hurdles in computer vision today.

In this post, we will deep-dive into a research paper from MIT titled “CRISP: Object Pose and Shape Estimation with Test-Time Adaptation.” This paper introduces a fascinating pipeline that doesn’t just guess an object’s shape and pose; it actively corrects itself at test time and uses those corrections to teach itself, bridging the gap between simulation and reality.

Qualitative examples of CRISP on the YCBV dataset and SPE3R dataset. Top row shows household items, bottom row shows satellites.

As shown in Figure 1, CRISP is versatile enough to handle everything from a can of Spam on a kitchen table to a satellite orbiting Earth. Let’s explore how it works.

The Challenge: Category-Agnostic Perception

Traditional approaches to object perception often fall into two buckets:

Instance-level: The robot has an exact CAD model of the specific cup it is looking for. This doesn’t scale well to new objects.
Category-level: The robot knows generally what “cups” look like. This is better, but often requires rigid category definitions.

CRISP (which stands for Cropped RGB-D Inference Shape Pipeline) aims to be category-agnostic. It shouldn’t need to know if it’s looking at a “bottle” or a “can” to reconstruct its geometry. Furthermore, it tackles the Sim-to-Real problem. If you train a neural network on synthetic images (like video game graphics), it usually fails when shown a real photo. CRISP’s superpower is its ability to adapt to real data without requiring human labels.

Part 1: The Architecture of CRISP

CRISP takes a single RGB-D image (color + depth) as input and outputs the object’s 6D Pose (rotation and translation) and its 3D Shape. It does this through two parallel neural network branches powered by a Vision Transformer (ViT) backbone.

1. Estimating Shape with Implicit Fields

How do you represent a 3D shape in a neural network? You could output a point cloud or a voxel grid, but these can be low-resolution or memory-hungry. CRISP uses an Implicit Neural Representation, specifically a Signed Distance Field (SDF).

An SDF is a function that, for any point \((x, y, z)\) in space, tells you the distance to the nearest surface. If the number is negative, you are inside the object; if positive, you are outside; if zero, you are on the surface.

Diagram of our architecture for CRISP’s shape head and shape decoder.

The shape pipeline works as follows:

Encoder: The ViT backbone extracts features from the image. A “Shape Head” compresses these into a latent shape code (\(h\)).
Decoder: This is where the magic happens. The decoder is a Multi-Layer Perceptron (MLP) that takes a 3D coordinate and the shape code \(h\) to predict the SDF value.
FiLM Conditioning: Notice the blocks labeled “FiLM” in the diagram above. Feature-wise Linear Modulation (FiLM) is a technique where the shape code \(h\) modulates the neural network’s activations (scaling and shifting them). This allows a single network to represent infinite variations of shapes simply by changing the input code \(h\).

2. Estimating Pose with PNCs

To find the object’s position and orientation (Pose), CRISP doesn’t just regress a single vector. Instead, it predicts Pose-Normalized Coordinates (PNC).

Imagine the object exists in a canonical “perfect” space centered at \((0,0,0)\). For every pixel in the input image, the network predicts where that pixel would be in that perfect 3D space.

Diagram of our architecture for CRISP’s PNC head.

As seen in the architecture diagram above, a Dense Prediction Transformer (DPT) predicts these coordinates for every pixel. Once we have the observed 3D points (from the depth camera) and the predicted canonical 3D points (from the network), finding the rotation and translation is a classic geometry problem solvable with established algorithms (like the Kabsch algorithm or Arun’s method).

Part 2: The Corrector – Optimization at Test Time

If we just trained the networks above and ran them, we would get decent results. But if there is a domain gap (e.g., strange lighting in the test image), the predictions might drift.

CRISP introduces a Corrector. This is an optimization loop that runs during inference. The intuition is simple: The observed depth measurements from the camera should align with the zero-level set of the predicted SDF.

We can formalize this as an optimization problem:

Optimization equation minimizing the difference between the implicit shape field and the transformed input points.

Here, we are trying to find the best Shape (\(f\)) and Pose (\(R, t\)) such that the SDF value is zero for all observed points (\(x_i\)) after they are transformed by the pose.

The “Active Shape Model” Insight

Directly optimizing a neural network’s shape code \(h\) at test time is dangerous. The space of all possible codes is vast, and most codes produce garbage shapes.

The researchers made a key observation: The shape decoder behaves well inside the “convex hull” (or simplex) of the shapes it saw during training, but fails when you extrapolate outside it.

Visualization of the mesh extracted from SDF produced by the shape decoder as the latent shape code is interpolated and extrapolated.

In the figure above, notice how the shape looks plausible when \(\alpha\) is between 0 and 1 (interpolation). But as soon as we go outside that range (extrapolation, \(\alpha > 1\) or \(\alpha < 0\)), the shape mutates into a blob.

To fix this, CRISP restricts the optimization. Instead of letting the shape code \(h\) be anything, it forces \(h\) to be a weighted average of known “basis” shapes (the training examples).

The Active Shape Decoder (\(f_a\)) replaces the standard decoder with a linear combination:

Equation for the active shape decoder as a linear combination of basis shapes.

By approximating the deep network as a linear combination of basis shapes, the complex, non-linear optimization problem turns into a Constrained Linear Least Squares (LSQ) problem.

Why does this matter?

Non-linear optimization is slow and can get stuck in bad solutions. Linear Least Squares is convex, meaning it has a single optimal solution that can be found very quickly using standard solvers (like Interior Point methods).

This allows CRISP to refine the shape and pose in milliseconds rather than seconds.

Part 3: Self-Training for Adaptation

Now we have a strong pipeline: a neural network that predicts shape/pose, and a mathematical corrector that fixes mistakes. The final piece of the puzzle is Self-Training.

The goal is to let the model improve itself on new, unlabelled data. The authors propose a Correct-and-Certify loop.

Overview of our contributions showing the CRISP pipeline, the corrector, and the self-training loop.

Here is the flow:

Forward Pass: The ViT backbone predicts an initial Shape and Pose.
Correction: The optimization-based Corrector (using the Active Shape Model) refines these estimates to better fit the observed depth data.
Certification: We can’t blindly trust the corrector. Sometimes optimization goes wrong. The system checks a Certificate: is the geometric consistency error low enough?
Pseudo-Labeling: If the certified error is low, the corrected output is treated as a “Ground Truth” label.
Backpropagation: The network updates its weights using this new self-generated label.

This loop runs continuously. Over time, the network learns to predict the “corrected” values directly, effectively adapting to the new domain.

Experiments and Results

Does it actually work? The researchers tested CRISP on three distinct datasets: YCBV (household objects), SPE3R (satellites), and NOCS (mixed categories).

1. Household Objects (YCBV)

On the YCBV dataset, CRISP was compared against state-of-the-art methods like Shap-E (a diffusion model) and CosyPose.

Qualitative examples of CRISP on the YCBV dataset. Top: projection of transformed reconstructed mesh. Bottom: reconstructed mesh.

The results show that CRISP reconstructs shapes with much higher fidelity. Because it uses the “Correct-and-Certify” loop, it effectively ignores bad data and learns from good data.

We can visualize the performance using a Cumulative Distribution Function (CDF) of the Chamfer Distance (an error metric for shapes—lower is better).

Cumulative distribution function of the Chamfer distances of the reconstructed shapes on the YCBV dataset.

In this graph, the orange line (CRISP-Real) rises the fastest, indicating that most of its reconstructions have very low error. The pink line (Shap-E), while a powerful generative model, lags behind in accuracy for this specific task.

2. Space Debris (SPE3R)

The SPE3R dataset is particularly challenging because the test objects are satellites that the model has never seen before. This tests the “generalization” capability.

Qualitative example of self-training improving shape reconstruction on a satellite.

The figure above is a perfect example of self-training in action.

Middle Row: Before self-training, CRISP retrieves a “nearest neighbor” shape from its training memory. It looks roughly like a satellite, but the solar panels are wrong.
Bottom Row: After self-training on the test images, CRISP refines the shape. Notice how the solar panels are now correctly aligned and shaped, matching the Ground Truth (Top Row) much better.

3. Speed

One of the limitations of recent 3D reconstruction work (like diffusion models) is speed. Inference can take seconds or minutes.

CRISP Inference: ~125 ms
LSQ Corrector: ~250 ms

This sub-second performance makes it viable for real-time robotics applications, where a robot needs to react quickly to moving objects.

Conclusion

CRISP represents a significant step forward in robot perception. By combining the pattern-matching power of Deep Learning (ViT/DPT) with the rigorous mathematical structure of Optimization (SDFs/Least Squares), it achieves the best of both worlds.

Key Takeaways for Students:

Implicit Representations: SDFs are a powerful way to represent continuous shapes, and conditioning them (via FiLM) makes them flexible.
Hybrid Systems: Neural networks don’t have to do everything. Using a network for initialization and a classical optimizer for refinement is a robust design pattern.
Constraints are Good: The “Active Shape Model” shows that constraining your search space (to the convex hull of known shapes) can actually improve results and speed up computation.
Self-Supervision: The future of robotics lies in systems that can learn from their own experiences (and mistakes) after deployment, rather than relying solely on pre-labeled training data.

CRISP demonstrates that with the right geometric constraints, robots can learn to see and understand objects they’ve never encountered before, bringing us one step closer to autonomous systems that can operate in the messy, unlabelled real world.

Bridging the Reality Gap: How CRISP Masters 3D Object Perception with Test-Time Adaptation#

The Challenge: Category-Agnostic Perception#

Part 1: The Architecture of CRISP#

1. Estimating Shape with Implicit Fields#

2. Estimating Pose with PNCs#

Part 2: The Corrector – Optimization at Test Time#

The “Active Shape Model” Insight#

Why does this matter?#

Part 3: Self-Training for Adaptation#

Experiments and Results#

1. Household Objects (YCBV)#

2. Space Debris (SPE3R)#

3. Speed#

Conclusion#