Imagine a robot entering a new kitchen for the first time. To a human, the function of the room is obvious: the refrigerator handle pulls open, the drawer slides out, and the cabinet door swings on a hinge. We understand these mechanics intuitively, often predicting how an object moves before we even touch it.

For a robot, however, this is a geometric nightmare. A cabinet isn’t just a static box; it is an articulated object—a rigid body capable of specific motions relative to another part. If a robot miscalculates the axis of rotation for a heavy fridge door, it could rip the handle off or damage its own arm.

Traditionally, robots learn these articulation models in highly controlled environments with stationary cameras and isolated objects. But the real world is messy (“in-the-wild”). Cameras shake, hands get in the way, and lighting is unpredictable.

In this post, we are doing a deep dive into ArtiPoint, a novel framework presented in the paper “Articulated Object Estimation in the Wild.” This research proposes a way for robots to learn articulation models by simply watching humans interact with objects in dynamic, real-world videos. We will explore how the researchers combine deep point tracking with factor graph optimization to solve this complex 3D perception problem.

A high-level overview of the ArtiPoint concept. It shows how the system takes raw video input and outputs articulated models, contrasting itself with previous benchmarks that relied on static cameras.

The Problem with “In-the-Wild” Perception

Why is estimating a door hinge so hard? In a lab, you can place a fiducial marker (like a QR code) on the door and the frame. You track the markers, do some math, and you have your hinge axis.

In the real world, you cannot stick markers on everything. Previous computer vision approaches have tried to solve this by looking at depth images of objects in different states (e.g., open vs. closed). However, these methods usually suffer from significant limitations:

Static Cameras: They assume the camera doesn’t move, simplifying the background subtraction.
Isolated Objects: They assume the object is the only thing in the scene, ignoring the clutter of a real room.
Sensitivity to Occlusion: When a human opens a drawer, their hand and body block the camera’s view. Many algorithms fail when the object is partially hidden.

ArtiPoint flips the script. Instead of relying on static snapshots, it treats the interaction as a dynamic video event. It mimics human learning: by observing a human hand manipulating an object, the system infers the underlying mechanical constraints.

The ArtiPoint Pipeline

The core logic of ArtiPoint is elegant but computationally robust. It follows a four-stage pipeline:

Interaction Extraction: Find the moments where a hand interacts with an object.
Deep Point Tracking: Track specific points on the object surface over time using neural networks.
3D Lifting and Filtering: Convert these 2D video tracks into smooth 3D trajectories in the world frame.
Articulation Estimation: Use probabilistic optimization to figure out what kind of joint (hinge or slider) explains those 3D movements.

Let’s break down each stage in detail.

The complete four-stage pipeline of ArtiPoint: Interaction Extraction, Keypoint Identification, Deep Point Tracking, and Articulation Model Estimation.

Stage 1: Interaction Extraction

You cannot analyze an articulation if nothing is moving. The first step is to identify when an interaction occurs. The researchers leverage the prior knowledge that humans manipulate environments with their hands.

The system analyzes the video feed using a hand segmentation model. It doesn’t just look for a hand; it looks for sustained interaction. By calculating a moving average of hand visibility, the system filters out fleeting false positives (like a hand waving past the camera) and focuses on stable interaction segments.

This graph illustrates the interaction extraction process. The raw detection (blue) is noisy, but the moving average (green) provides a clean signal to trigger the start and end of an interaction segment.

As shown in the figure above, the system triggers a “segment” when the smoothed detection signal crosses a threshold. This creates a temporal window—a specific clip of video—where the system knows an object is likely being manipulated.

Stage 2: Deep Point Tracking

Once an interaction segment is isolated, the system needs to track the object. But what exactly is “the object”? The system doesn’t know yet if it’s looking at a fridge or a microwave.

To solve this, ArtiPoint samples points around the detected hand and uses MobileSAM (a lightweight Segment Anything Model) to generate masks of objects near the hand. Inside these object masks, it identifies “good features to track” (corners, high-contrast areas).

The magic happens with CoTracker3, a state-of-the-art “any-point tracking” model. Unlike traditional optical flow which gives a dense field of motion between two frames, CoTracker3 can track specific sparse points across a long sequence of video frames, even managing to estimate when points are occluded (hidden).

The result of this stage is a set of 2D trajectories: wiggly lines on the screen that represent how specific pixels on the object moved during the video.

Stage 3: 3D Lifting, Filtering, and Smoothing

2D tracks are useful, but robots live in a 3D world. To understand the mechanics of a joint, we need 3D coordinates.

Lifting to 3D

The researchers use the depth channel from the RGB-D camera to “lift” the 2D pixels into 3D space. However, this creates a new problem: the camera itself is moving. A point might look like it’s moving simply because the camera panned left.

To fix this, ArtiPoint assumes access to camera odometry (the position of the camera in the world). It transforms all the 3D points from the camera’s local coordinate system into a global “world” coordinate system.

Filtering the Noise

Real-world data is noisy. Some points tracked by CoTracker3 might actually be on the stationary wall behind the object, or on the hand moving the object.

Static Filtering: The system calculates the variance (movement) of each track in 3D. If a point barely moves in the world frame, it’s considered static and discarded.
Occlusion Filtering: Points that are hidden for too long are unreliable and removed.

Trajectory Smoothing

Depth sensors (like those on Kinect or RealSense cameras) often possess high-frequency “jitter.” If you plotted the raw 3D path of a point, it would look jagged. This jaggedness makes it hard to estimate a smooth axis of rotation.

To solve this, the authors employ an optimization-based smoothing technique. They minimize a cost function \(E(\mathbf{p})\) that balances fidelity to the original data with smoothness constraints:

The smoothing cost function. The first term keeps the point close to the observation. The second and third terms penalize sudden changes in velocity and acceleration (jerk).

Data Term: Keeps the smoothed point close to the measured point (\(\hat{\mathbf{p}}_t\)).
Velocity Term (\(\lambda_{vel}\)): Penalizes large gaps between consecutive points.
Jerk Term (\(\lambda_{jerk}\)): Penalizes sudden changes in acceleration (spikes).

The result is a set of clean, smooth 3D curves representing the motion of the object parts.

Visual comparison of smoothed point trajectories for a revolute joint (cabinet) and a prismatic joint (slider). Different colors represent different tracked keypoints.

Stage 4: Exploiting the Articulation Prior (The Math)

Now we have clean 3D curves. The final challenge is to find the mathematical model that explains these curves. Is the object rotating around a hinge (revolute joint)? Or sliding along a rail (prismatic joint)?

The researchers use a Factor Graph formulation. They treat the articulation parameters as variables to be optimized.

The Transformation Model

They represent the movement using Screw Theory, where a motion is defined by a “twist” \(\xi \in \mathfrak{se}(3)\). This allows them to represent both rotation and translation in a unified mathematical framework.

The goal is to find a shared “base twist” \(\hat{\xi}\) (the axis of the hinge or slider) and a sequence of angles/positions \(\theta\) that best fit the observed data.

First, they define a set of relative pose observations derived from the point tracks:

Equation defining the set of point pairs observed at different times, used to create pose constraints.

These observations allow them to construct a Factor Graph. A Factor Graph is a probabilistic graphical model that connects variables (the hinge axis, the door angle at time \(t\)) with factors (the error between the expected position and the observed position).

The system solves for the articulation parameters by minimizing the global error across the entire trajectory:

The final optimization objective. It minimizes the difference between the observed point positions and the positions predicted by the estimated articulation model (twist and angle).

By solving this equation using a library like GTSAM, ArtiPoint outputs:

The type of joint (Revolute vs. Prismatic).
The precise 3D axis of rotation or translation.
The trajectory of the object part.

Arti4D: A Benchmark for the Real World

One of the major contributions of this paper is the Arti4D dataset. Existing datasets were too sterile—they didn’t capture the chaotic nature of handheld cameras and messy rooms.

Arti4D contains 45 RGB-D sequences covering 414 interactions in varying environments (kitchens, labs). It includes:

Dynamic Camera Motion: The operator walks around while recording.
Scene-Level Context: Objects aren’t isolated; they are part of a larger room.
Ground Truth: Accurate camera poses and labeled articulation axes.

Visualizations of object interactions in the Arti4D dataset. Notice the clutter, different lighting conditions, and variety of objects (drawers, fridges, toolboxes).

This dataset serves not just as a test for articulation estimation, but also as a torture test for SLAM (Simultaneous Localization and Mapping) algorithms, as moving objects often break the “static world” assumption that mapping software relies on.

Reconstructed scenes from the Arti4D dataset. These 3D maps provide the context for the articulation interactions.

Does It Work? Experimental Results

The researchers compared ArtiPoint against several baselines, including:

ArtGS and Ditto: Recent deep learning methods based on Gaussian Splatting and implicit functions.
Sturm et al.: A classic probabilistic method.

The metric used was Angular Error (\(\theta_{err}\))—the difference in degrees between the predicted hinge axis and the real one—and Positional Error (\(d_{L2}\)) for the location of the axis.

Table comparing ArtiPoint against baselines. ArtiPoint achieves significantly lower angular error (14.54 degrees for prismatic, 17.14 for revolute) compared to ArtGS and Ditto (over 50 degrees).

The results are striking. As seen in Table 1, ArtiPoint achieves an angular error of around 14-17 degrees, while the competing deep learning methods (ArtGS, Ditto) hover around 50-60 degrees.

Why the huge difference? The baseline methods were designed for “semi-static” isolated views. They struggle when the camera moves aggressively or when the object is partially occluded by a hand. ArtiPoint’s reliance on temporal data—tracking points over time—makes it much more robust to these disturbances.

Qualitative Success

The system works remarkably well on everyday objects. In the figure below, you can see the estimated axes (yellow arrows) and trajectories (coordinate frames) for a drawer and a storage case. Even with the visual noise of a real kitchen, the system correctly identifies the sliding motion of the drawer and the hinge of the case.

Qualitative results showing the estimated axes and trajectories for a drawer and a storage case. The system correctly identifies the motion models despite the cluttered environment.

Ablation Studies

The team also tested which parts of their pipeline mattered most.

Smoothing: Turning off trajectory smoothing increased error, proving that raw depth data is too noisy.
Keyframe Stride: They found that processing every second frame (stride 2) offered the best balance between tracking density and computational speed.

Ablation study graphs showing the impact of keyframe stride. A stride of 2 minimizes both angular and positional errors.

Conclusion and Future Implications

ArtiPoint represents a significant step forward in robotic perception. By moving away from static snapshot analysis and embracing the dynamic, messy nature of video, it allows robots to learn from human demonstration in the wild.

The key takeaways are:

Motion is Information: Tracking how points move over time provides a robust signal for articulation, even when the camera is moving.
Priors Matter: Leveraging the prior knowledge that “hands cause motion” filters out a vast amount of irrelevant data.
Hybrid Approaches Win: Combining modern Deep Learning (for point tracking) with classical Probabilistic Optimization (Factor Graphs) yields better results than relying on end-to-end “black box” neural networks for this specific task.

For students and researchers in robotics, this paper highlights the importance of “interactive perception.” Robots shouldn’t just look at the world; they should watch how we interact with it to understand the hidden mechanics of our daily lives.

The Arti4D dataset and code are publicly available, enabling the community to build upon this work and push the boundaries of what robots can perceive.

The Problem with “In-the-Wild” Perception#

The ArtiPoint Pipeline#

Stage 1: Interaction Extraction#

Stage 2: Deep Point Tracking#

Stage 3: 3D Lifting, Filtering, and Smoothing#

Lifting to 3D#

Filtering the Noise#

Trajectory Smoothing#

Stage 4: Exploiting the Articulation Prior (The Math)#

The Transformation Model#

Arti4D: A Benchmark for the Real World#

Does It Work? Experimental Results#

Qualitative Success#

Ablation Studies#

Conclusion and Future Implications#