Beyond Point Clouds: Scaling Indoor 3D Object Detection with Cubify Anything

Imagine walking into a room. You don’t just see “chair,” “table,” and “floor.” You perceive a rich tapestry of items: a coffee mug on a coaster, a specific book on a shelf, a power strip tucked behind a cabinet. Humans understand scenes in high fidelity. However, the field of indoor 3D object detection has largely been stuck seeing the world in low resolution, focusing primarily on large, room-defining furniture while ignoring the clutter of daily life.

For years, the standard approach to 3D detection has relied on processing point clouds—3D representations derived from depth sensors. While effective, this approach has hit a ceiling caused by limited datasets and “entangled” annotations, where ground truth labels are biased by the noisy sensors used to collect them.

In a recent paper, “Cubify Anything: Scaling Indoor 3D Object Detection,” researchers from Apple propose a paradigm shift. They introduce a massive new dataset, Cubify-Anything 1M (CA-1M), and a new Transformer-based model, Cubify Transformer (CuTR). Their work challenges the assumption that complex 3D point-based architectures are necessary for 3D understanding, demonstrating that image-centric models—when fed enough high-quality data—can outperform traditional methods and truly “cubify anything.”

In this post, we will dissect their methodology, explore how they built a dataset an order of magnitude larger than existing benchmarks, and analyze why this might signal the end of the point-cloud era for indoor detection.


The Problem: Entanglement and Scale

To understand the contribution of this paper, we first need to understand the limitations of the current landscape.

The “Entanglement” of Data

Most existing 3D indoor datasets (like SUN RGB-D or ScanNet) are collected using commodity depth sensors (like a Kinect or an iPad LiDAR). These sensors are noisy. When annotators label these datasets, they draw 3D bounding boxes around the objects visible in the noisy 3D mesh or point cloud.

This creates a problem called entanglement. The “ground truth” labels aren’t actually capturing physical reality; they are capturing the sensor’s interpretation of reality. If a table leg is missing from the LiDAR scan, it is often missing from the annotation. Consequently, models trained on this data learn to replicate the sensor’s noise rather than understanding the true geometry of the scene.

The “Furniture Only” Limitation

Because annotating 3D data is difficult, existing datasets focus on a small taxonomy of large objects: beds, chairs, tables, and sofas. They ignore the “long tail” of small objects—books, staplers, lamps, vases—that actually make up the richness of a scene.

The researchers summarize the goal simply: they want to move from detecting “scene-level” furniture to “image-level” total understanding.


The Dataset: Cubify-Anything 1M (CA-1M)

The foundation of this research is the Cubify-Anything 1M (CA-1M) dataset. It is a massive undertaking designed to disentangle annotations from sensor noise and scale up the diversity of detected objects.

Figure 1 showing the richness of annotations in CA-1M.

1. Spatial Reality vs. Pixel Perfection

The researchers utilized a unique data collection setup involving two distinct capture methods:

  1. Stationary FARO Laser Scans: These are survey-grade, high-resolution laser scans that capture the static scene with extreme precision.
  2. Handheld iPad Captures: These are standard RGB-D videos taken by a user walking through the room.

The innovation lies in the annotation process. Instead of labeling the noisy iPad scans, annotators labeled the high-resolution FARO point clouds. This ensures the 3D boxes represent “spatial reality.”

However, a model sees the world through the iPad’s camera. Therefore, the researchers registered the laser scans to the handheld video trajectory. They then projected the high-quality 3D boxes from the laser scan into every single frame of the iPad video. This results in annotations that are pixel-perfect with respect to the image, yet geometrically accurate with respect to the real world.

2. Exhaustive, Class-Agnostic Labeling

Unlike previous datasets that restricted labels to 10 or 20 categories, CA-1M is class-agnostic. The goal was to label every object in the room, regardless of what it is.

Comparison of datasets showing the exhaustive nature of CA-1M.

As shown in Figure 2, the difference is stark. While datasets like ARKitScenes or SUN RGB-D might show a few boxes around chairs, CA-1M (bottom right) is a dense mesh of bounding boxes capturing items on shelves, clutter on tables, and decorative elements.

The Scale by the Numbers

To appreciate the magnitude of CA-1M, let’s look at the comparison table provided in the paper:

Table 1 comparing CA-1M statistics to other datasets.

CA-1M contains over 440,000 unique objects and 15 million annotated frames. This is roughly an order of magnitude larger than comparable indoor 3D datasets. This scale is crucial because it allows the model to learn from a data-rich regime, similar to how Large Language Models (LLMs) benefit from massive text corpora.

The Annotation Pipeline

Annotating this volume of data required a specialized tool. Because laser scans can sometimes be incomplete (e.g., due to occlusion or reflective surfaces), the annotation tool allowed labelers to view the 3D point cloud and the corresponding high-resolution RGB images simultaneously.

The annotation tool interface.

Furthermore, the team developed a rendering engine to handle occlusions. Since a 3D box exists in the world, it might be visible in one video frame but blocked by a wall in the next. The system renders the scene geometry to automatically “cut” or filter boxes that shouldn’t be visible in a specific frame, ensuring the training data is clean.

Figure 4 illustrating the rendering and occlusion handling process.


The Model: Cubify Transformer (CuTR)

With a massive dataset in hand, the researchers needed a model architecture capable of exploiting it. Traditional indoor 3D detection methods are “point-based.” They take a depth map, convert it into a point cloud or voxel grid, and use 3D sparse convolutions to find objects.

These methods have significant downsides:

  1. Complexity: 3D operations (voxelization, sparse convs) are computationally heavy and hard to deploy on standard hardware (like mobile NPUs).
  2. Inductive Bias: They rely on the geometry of the point cloud. If the depth sensor is noisy (which it usually is on commodity devices), the model fails.

Architecture Overview

The researchers introduce Cubify Transformer (CuTR). The philosophy behind CuTR is to treat 3D detection more like 2D detection. It operates primarily on the RGB image, using depth only as a supplementary hint if available.

Figure 6: The Architecture of Cubify Transformer (CuTR).

The architecture consists of three main stages:

  1. Backbone (The Eye):
  • The model uses a Vision Transformer (ViT) to process inputs.
  • For RGB-D (Red, Green, Blue + Depth): They use a MultiMAE backbone. This effectively fuses RGB patches and Depth patches into a single latent representation.
  • For RGB-only: They utilize a Depth-Anything backbone, which is pre-trained to understand depth cues from monocular images.
  1. Decoder (The Brain):
  • Based on “Plain DETR,” the decoder uses a Transformer to process object queries.
  • It employs non-deformable attention mechanisms. This makes the model “accessible,” meaning it uses standard matrix multiplications supported by almost all hardware accelerators (GPUs, Apple Neural Engine, etc.), avoiding custom CUDA kernels required by sparse 3D methods.
  1. 3D Box Predictor (The Output):
  • Instead of just predicting a 2D box, the model predicts a 3D box directly from the image features.
  • It outputs the 3D center \((x, y, z)\), dimensions \((l, w, h)\), and orientation (yaw).
  • Crucially, CuTR does not use Non-Maximum Suppression (NMS). NMS is a post-processing step used to remove duplicate boxes. By avoiding it, CuTR simplifies the pipeline and avoids errors where NMS might accidentally delete a valid object sitting directly behind another (a common occurrence in 3D).

Handling Depth and Gravity

A clever detail in CuTR is how it handles scale. In the RGB-D variant, the model uses the statistics (mean and standard deviation) of the input depth map to normalize and then re-scale its predictions. This makes the model robust to different scene scales.

Additionally, the model assumes a “gravity-aligned” world. Most mobile devices provide a gravity vector from their accelerometer. CuTR uses this to simplify the orientation prediction, focusing only on the “yaw” (rotation around the vertical axis), which is sufficient for almost all indoor objects.


Experiments and Results

The researchers compared CuTR against state-of-the-art point-based methods (like FCAF3D and TR3D) across several datasets.

1. Performance on CA-1M

When trained and evaluated on the massive CA-1M dataset, CuTR shines.

Table 2: Comparison of CuTR vs Point-based methods.

Looking at the CA-1M columns in Table 2:

  • Recall is King: CuTR (RGB-D) achieves an Average Recall (AR25) of 60.2%, significantly higher than the best point-based method (FCAF at 56.5%).
  • RGB-only Surprise: Even the RGB-only version of CuTR is competitive, outperforming older point-based methods that require depth inputs.

This result suggests that when you have enough training data, the strong inductive biases of 3D-specific architectures (like sparse convolutions) become less important than the scalability of Transformers.

2. The Impact of Noisy Depth

Why do point-based methods struggle on CA-1M compared to CuTR? The researchers hypothesized it was due to sensor noise. Point-based methods trust the input point cloud implicitly. If the LiDAR data is messy, the model is confused.

To prove this, they ran an ablation study where they trained models using the “perfect” ground-truth depth from the FARO scanner instead of the noisy iPad LiDAR.

Table 3: Ablation on depth quality.

As shown in Table 3, point-based methods (FCAF, TR3D) saw a massive performance jump when given “perfect” depth. CuTR improved too, but the gap narrowed. This confirms that image-based transformers are more robust to sensor noise. They can use visual cues (texture, edges) from the high-res RGB image to compensate for missing or noisy depth information.

3. Pre-training: The “ImageNet Moment” for 3D?

Perhaps the most significant finding is the value of CA-1M as a pre-training dataset. The researchers took a CuTR model pre-trained on CA-1M and fine-tuned it on the smaller, older SUN RGB-D dataset.

Table 4: Pre-training results on SUN RGB-D.

The results in Table 4 are dramatic. With CA-1M pre-training, CuTR’s performance on the Omni3D SUN RGB-D benchmark leaps forward, surpassing point-based methods by a wide margin (AR25 of 73.6% vs FCAF’s 65.9%).

This validates the core hypothesis: Scale matters. Just as pre-training on massive text or 2D image datasets revolutionized NLP and computer vision, pre-training on a massive, diverse 3D dataset like CA-1M unlocks superior performance on downstream tasks.

Qualitative Visualization

The numbers are backed up by visual evidence.

Figure 7: Visual comparison of detections.

In Figure 7, we see CuTR (top row) versus FCAF (bottom row). Notice how CuTR successfully detects small, thin objects and clutter on shelves that the point-based method misses entirely or groups into a single blob. The “Reprojected View” columns show how the detected 3D boxes align with the physical world; CuTR’s boxes are tighter and more aligned with the actual objects.


Conclusion and Implications

“Cubify Anything” represents a maturation point for indoor 3D object detection. The paper makes a compelling case that we are moving away from the era of “geometry-first” detection toward “data-first” detection.

Key Takeaways:

  1. Data Quality > Model Complexity: A simpler Transformer model (CuTR) beats complex 3D-specific architectures when fed with high-quality, large-scale data.
  2. Disentanglement is Critical: Training on “spatial reality” (laser scans) rather than “sensor noise” (iPad meshes) creates more robust models that handle uncertainty better.
  3. The Rise of Image-Centric 3D: By treating 3D detection as an image-to-box problem (augmented by depth), we gain the benefits of the massive ecosystem of 2D Vision Transformers—better scaling, easier pre-training, and broader hardware compatibility.

The release of the CA-1M dataset is likely to accelerate research in this field, potentially enabling applications in robotics, augmented reality, and spatial computing where machines need to understand not just where the walls are, but where everything is.