Breaking the Parts Barrier: How CAP-Net Masters Articulated Object Perception
Imagine you are a robot tasked with a seemingly simple household chore: opening a laptop. To a human, this is trivial. You identify the lid, find the edge, and lift. But to a robot, this is a geometric nightmare. The laptop is not a solid brick; it is an articulated object—a structure composed of rigid parts connected by joints. The lid moves relative to the base, changing the object’s overall shape.
For a robot to manipulate such objects, it needs to understand not just where the object is, but the 6D pose (position and orientation) and size of every moving part. Traditionally, this has been a stumbling block for robotic perception.
In this deep dive, we are exploring a fascinating paper titled “CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image.” The researchers propose a novel, unified approach that significantly outperforms existing methods by treating the object as a whole rather than a sum of isolated parts.

The Problem with Moving Parts
Before dissecting the solution, we must understand why articulated objects are so difficult for computer vision systems.
Rigid vs. Articulated
Rigid object pose estimation is a well-studied field. If you want to pick up a coffee mug, the handle is always in a fixed position relative to the cup’s body. However, articulated objects (like scissors, drawers, or refrigerators) have multiple kinematic parts. A drawer can be open, closed, or halfway in between.
The “Segment-Then-Pose” Trap
Most recent state-of-the-art methods, such as GAPartNet, tackle this problem using a multi-stage pipeline. First, they try to segment the point cloud to isolate the specific part (e.g., just the laptop lid). Then, they run a pose estimation algorithm on that isolated segment.
This sounds logical, but it creates a critical flaw: Loss of Global Context.
When an algorithm looks at a segmented part in isolation, it loses the visual cues provided by the rest of the object. This leads to visual ambiguity. For example, if a robot only sees a flat rectangular surface (a laptop lid), it might struggle to distinguish the front from the back. However, if it sees the hinges and the keyboard base attached to it, the orientation becomes obvious.

As illustrated in Figure 2 above, relying on isolated parts often leads to incorrect pose estimations (indicated by the red X), whereas considering the whole object resolves the ambiguity (green check).
The Sim-to-Real Gap
Another major hurdle is data. Training neural networks requires massive datasets. Annotating 6D poses for articulated parts in the real world is expensive and slow. Researchers rely on synthetic data, but often, the synthetic images (renderings) and depth maps don’t look like the real world. Real depth sensors (like a RealSense camera) are noisy; synthetic depth is usually perfect. This discrepancy creates a “domain gap,” causing models trained on simulation data to fail when deployed on real physical robots.
Enter CAP-Net: A Unified Approach
To solve these issues, the authors introduce CAP-Net (Categorical Articulated Parts Network). Instead of chopping the object into pieces and processing them separately, CAP-Net is a single-stage network. It processes the entire object to estimate:
- Semantic Labels: What class of part is this? (e.g., a handle, a button).
- Instance Centers: Which specific part is it? (e.g., the top drawer vs. the bottom drawer).
- NPCS Maps: What is the pose and orientation? (Normalized Part Coordinate Space).
The Architecture: Fusing RGB and Geometry
One of CAP-Net’s biggest strengths is how it handles input data. Many previous methods relied almost exclusively on geometric data (point clouds) because RGB images were considered too difficult to bridge between simulation and reality. CAP-Net, however, leans heavily on the rich semantic information found in RGB images.

As shown in the architecture diagram above, the pipeline works as follows:
- Feature Extraction: The system takes an RGB-D image.
- Vision Backbones: The RGB image is processed by SAM2 (Segment Anything Model 2) and FeatUp. These are powerful, pre-trained vision encoders. SAM2 provides dense feature representations, while FeatUp extracts high-resolution, category-agnostic semantic features.
- Geometric Backbone: The Depth image is converted into a point cloud.
- Point-wise Fusion: The rich semantic features from the RGB encoders are concatenated (fused) with the geometric point cloud data.
- Processing: This enriched point cloud is passed through PointNet++, a deep learning architecture designed to process 3D point sets.
Multi-Task Learning
Once the features are extracted, the network splits into three parallel “heads” or modules, each responsible for a specific prediction task. This is an end-to-end process, meaning the network learns all these tasks simultaneously.
1. Semantic Part Learning (\(M_{sem}\))
This module predicts the class label for every point in the cloud. It answers the question: “Is this point part of a handle, a lid, or a button?”
The loss function used here is Focal Loss, which helps the model focus on hard-to-classify examples:

where \(\hat{c}_i\) is the predicted label and \(c_i\) is the ground truth.
2. Centroid Offset Learning (\(M_{inst}\))
Semantic labels aren’t enough. A cabinet might have three identical handles. To distinguish them, this module predicts the center of the part instance that each point belongs to. It does this by predicting an offset vector (direction and distance) from the point to the center.

This encourages points belonging to the same handle to “vote” for the same center location, allowing the system to cluster them into distinct instances later.
3. NPCS Learning (\(M_{npcs}\))
This is the core of the pose estimation. NPCS (Normalized Part Coordinate Space) is a canonical 3D space where an object part is standardized (centered, scaled, and aligned).
The network predicts the NPCS coordinate for every point in the observed point cloud. Essentially, it creates a map linking the real-world distorted object back to a perfect, “ideal” version of itself.

Because predicting continuous coordinates is difficult, the authors discretize the coordinates into bins and use a classification approach (Soft-max Cross Entropy), which is often more stable than direct regression.
From Predictions to 6D Pose
After the network makes these predictions, a post-processing step occurs:
- Clustering: Points are grouped into instances based on their predicted semantic labels and instance centers (using the DBSCAN algorithm).
- Alignment: For each identified part, the system now has two sets of points: the observed real-world points and the predicted NPCS (canonical) points.
- Pose Fitting: The Umeyama algorithm is used to calculate the transformation matrix (Rotation, Translation, and Scale) that best aligns these two sets of points. This transformation matrix is the 6D pose and size.
The Data Engine: RGBD-Art Dataset
A sophisticated model is useless without good training data. The researchers identified that existing datasets were insufficient for bridging the sim-to-real gap. They either lacked realistic lighting (rendering) or had “perfect” depth maps that didn’t resemble the noisy data from real sensors.
To fix this, they introduced the RGBD-Art Dataset.

As detailed in Table 1 above, RGBD-Art distinguishes itself by offering both Photorealistic RGB (P-RGB) and Realistic Depth (R-D).
What makes it realistic?
- Ray-Tracing: The RGB images are rendered using ray-tracing technology to simulate realistic lighting, shadows, and materials.
- Sensor Noise Simulation: The depth maps are not perfect ground truths. They are generated to include noise patterns similar to active stereo depth cameras (like the RealSense D415), making the jump to real hardware much smoother.

Figure 3 showcases the quality of this data. Notice the complex lighting on the coffee machine and the subtle gradients in the depth maps.
Experimental Results
The researchers compared CAP-Net against several baselines, including point-based methods (PointGroup, SoftGroup) and the state-of-the-art GAPartNet.
Segmentation Performance
One of the most striking results is in Instance Segmentation (identifying which points belong to which part).

In Table 2, look at the difference in the “Avg. AP50” (Average Precision) column for “Seen” objects.
- GAPartNet: 11.35%
- CAP-Net (Ours): 53.58%
This is a massive leap in performance. The authors attribute this to the inclusion of dense semantic features from the RGB images (via SAM2/FeatUp). While geometric features (point clouds) can be noisy and ambiguous for small parts like buttons, RGB textures provide clear boundaries.
Pose Estimation Accuracy
But did accurate segmentation translate to better pose estimation?

Table 3 confirms it did.
- \(R_e\) (Rotation Error): CAP-Net achieves 10.39 degrees, significantly lower than GAPartNet’s 83.3 degrees.
- A10 (Accuracy at 10 degrees/10cm): CAP-Net achieves 58.44%, compared to GAPartNet’s 1.40%.
Note: The extremely high error for GAPartNet is partly because GAPartNet relies on metrics that allow for 180-degree symmetry flips. When evaluated strictly (without symmetry tolerance), its performance drops because it struggles to distinguish the front from the back of parts—the “Global Context” issue discussed earlier.
Visualizing the Success
The numbers are convincing, but the visuals are undeniable.

In Figure 5, look at the “GAPartNet” column versus “Ours.”
- Row 1 (Remote): GAPartNet fails to detect the pose (No Detection). CAP-Net accurately bounds the remote.
- Row 2 (Box): GAPartNet suffers from “180 degree error”—it thinks the box is facing the wrong way. CAP-Net gets the orientation correct because it sees the whole object context.
- Row 3 (Bucket): CAP-Net tightly fits the bounding box around the bucket handle, a notoriously difficult thin part to capture with depth sensors alone.
Real-World Robotic Deployment
The ultimate test of any robotic perception paper is: Does it work on a physical robot?
The team deployed CAP-Net on a Kinova Gen2 robotic arm equipped with a RealSense D435 camera.

They tested the robot on tasks involving drawers, hinge handles, and lids. The results were impressive.

As shown in Table 6, CAP-Net achieved a 28/30 total success rate across tasks, whereas the baseline GAPartNet struggled significantly (likely due to the sim-to-real gap and lack of global context).
Conclusion and Future Implications
The CAP-Net paper represents a significant step forward in robotic manipulation. By moving away from the “segment-then-pose” paradigm and embracing a unified, single-stage network, the researchers solved the critical issue of context loss.
Key Takeaways:
- Global Context Matters: Processing the whole object prevents orientation errors (flipping parts 180 degrees).
- RGB is Powerful: Integrating foundational vision models (SAM2/FeatUp) provides the semantic detail necessary to detect small parts that geometric data misses.
- Realism in Data: The RGBD-Art dataset proves that training on photorealistic synthetic data with simulated sensor noise is the key to bridging the sim-to-real gap.
For students and researchers in robotics, CAP-Net demonstrates that sometimes, looking at the “big picture” (the whole object) is more effective than focusing on the details (isolated parts) too early in the process. With codes and datasets available, this work paves the way for robots that can navigate our messy, articulated world with human-like ease.
](https://deep-paper.org/en/paper/2504.11230/images/cover.png)