Introduction
In the world of computer vision, teaching machines to understand human movement has been a longstanding goal. We have become quite good at tracking runners on a track, pedestrians on a sidewalk, or dancers in a studio. These are what researchers call “ground-based motions.” The physics are somewhat predictable: gravity pulls down, and feet interact with a flat ground plane.
But what happens when humans leave the ground?
Rock climbing presents a fascinating and incredibly difficult challenge for Human Motion Recovery (HMR). Climbers are not merely walking; they are solving vertical puzzles. Their bodies contort into extreme poses, limbs stretch to their limits, and their interaction with the environment is complex—hands and feet must find purchase on tiny holds while the body defies gravity. Most existing AI models, trained on walking or running data, fail spectacularly when tasked with analyzing a climber. They struggle to understand where the climber is in the “world” (global position) and often hallucinate poses that are physically impossible on a vertical wall.
This brings us to a breakthrough paper titled “ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate.” In this article, we will dive deep into how a team of researchers tackled the problem of “off-ground” motion capture. They didn’t just tweak an existing algorithm; they built a massive new dataset called AscendMotion and designed a novel method, ClimbingCap, that cleverly fuses visual data (RGB) with depth data (LiDAR) to reconstruct climbing motions with unprecedented accuracy.

As illustrated in Figure 1, the system is designed to take raw sensor data and output a digital twin of the climber, complete with accurate skeletal tracking and global trajectory.
The Background: Why is Climbing So Hard to Capture?
To appreciate the solution, we must first understand the problem. Traditional motion capture (MoCap) often relies on markers—those little ping-pong balls attached to suits—captured by an array of cameras in a controlled studio. This works for movies, but it is intrusive for athletes and nearly impossible to set up on a massive outdoor rock wall.
Alternatively, “markerless” capture uses standard video (RGB). While great for accessibility, standard video lacks depth. If a climber reaches for a hold, a single camera struggles to determine exactly how far away that hand is from the wall. Furthermore, computer vision models suffer from coordinate ambiguity. They might estimate the pose of the body correctly (i.e., the knee is bent 90 degrees), but fail to place that body correctly in the 3D world (i.e., the climber is floating two feet off the wall).
The researchers identified two main gaps in the field:
- Data Scarcity: There were no large-scale, high-quality, 3D-labeled datasets for climbing.
- Methodological Limitations: Existing methods could not handle the “off-ground” nature of climbing, where the global position is just as important as the local body pose.
The Foundation: The AscendMotion Dataset
You cannot train a superior AI without superior data. The first major contribution of this work is AscendMotion, a dataset that dwarfs previous attempts in both scale and complexity.
The Hardware Setup
To capture the intricacies of climbing, the researchers moved beyond simple video. They constructed a multi-modal hardware system capable of recording data streams that are synchronized in time and space.

As shown in Figure 3, the collection rig is a technological beast:
- LiDAR (Ouster-OS1): This sensor shoots laser pulses to create a precise 3D point cloud of the climber and the wall. It operates at 20Hz.
- RGB Camera (Hik 1080P): Captures standard video, providing visual context and texture.
- IMU System (Xsens MVN): For the “labeled” portion of the dataset, climbers wore a suit with 17 Inertial Measurement Units. These sensors measure acceleration and rotation, providing a “ground truth” for how the body is moving.
- 3D Scanners: The rock walls themselves were pre-scanned to create a perfect digital replica of the environment.
The dataset features 22 skilled climbers (including coaches) tackling 12 different walls. This distinction is important; skilled climbers move differently than novices. They use techniques like “flagging” or “drop knees” that require complex biomechanics.
The Annotation Pipeline
Collecting raw data is only half the battle. To train an AI, you need “labels”—the correct answers that tell the AI what it is looking at. The researchers developed a sophisticated pipeline to generate these labels.

Figure 4 outlines this rigorous process. It is not as simple as trusting the IMU suit. IMU sensors suffer from “drift”—over time, the calculated position wanders away from reality. To fix this, the pipeline uses a Multi-stage Global Optimization:
- Preprocessing: Syncing the time and space of LiDAR, RGB, and IMU data.
- Global Refit & Scene Touch: The system optimizes the body position so that the digital avatar actually touches the rock wall (Scene Touch Loss) and aligns with the LiDAR point cloud (Global Refit Loss).
- Manual Repair: Finally, human annotators verify the data, correcting any errors where the automated system might have twisted a limb unnaturally.
The result is a dataset containing 412,000 frames of highly accurate, challenging climbing motions.
The Core Method: ClimbingCap
With the data in hand, the researchers proposed ClimbingCap. This is a global Human Motion Recovery (HMR) method designed specifically to handle the difficulties of climbing.
The “secret sauce” of ClimbingCap is how it handles coordinate systems. It adopts a strategy of Separate Coordinate Decoding. Instead of trying to guess everything at once, it uses the strength of each sensor type:
- RGB Images are excellent for understanding the pose (the shape of the body and joint angles).
- LiDAR Point Clouds are excellent for understanding the position (where the body is in the 3D world).

Figure 2 provides the architectural blueprint. The framework operates in three distinct stages: Separate Coordinate Decoding, Post-Processing, and Semi-Supervised Training. Let’s break these down.
Stage 1: Separate Coordinate Decoding
The network takes two inputs: a sequence of images and a sequence of point clouds.
The Camera Coordinate Decoder focuses on the RGB data (enhanced by point cloud features). It utilizes a Transformer architecture (ViT) to extract features. Its job is to predict the SMPL parameters. (SMPL is a standard digital model of the human body that controls Shape \(\beta\) and Pose \(\theta\)).
The mathematical formulation for the decoder output is:

Here, the decoder takes a token and backbone features to output a state \(\mathbf{t}_{out}\). This token is then used to iteratively update the body pose and shape.
Simultaneously, the Global Translation Decoder focuses on the LiDAR data. Since LiDAR provides native 3D depth, it is tasked with figuring out exactly where the climber is in the world. It predicts the global translation parameters \(\Gamma^{trans}\).

By updating the translation iteratively (\(\Psi\) represents the weight matrix), the model allows the digital climber to “climb” through the virtual space, matching the real-world trajectory.
To ensure the network learns correctly, the training process uses a composite loss function:

This equation might look intimidating, but it is a sum of logical constraints:
- \(\mathcal{L}_{kp3d}\) and \(\mathcal{L}_{kp2d}\): Keypoint losses. The digital joints (elbows, knees) should match the real ones in both 3D space and 2D image projections.
- \(\mathcal{L}_{\theta}^{smpl}\) and \(\mathcal{L}_{\beta}^{smpl}\): SMPL losses. The predicted body pose and shape parameters must match the ground truth.
- \(\mathcal{L}_{traj}\): Trajectory loss. The path the climber takes up the wall must align with reality.
Stage 2: Post-Processing
Even the best deep learning models can produce “jittery” or physically inconsistent results. The post-processing stage refines the output.
The researchers transform the poses from the camera coordinate system into the world coordinate system using the extrinsic matrix (the known physical relationship between the camera and the LiDAR).
They apply three specific optimizations here:
- Limb Weight Differentiation (\(L_{LWD}\)): Not all body parts are equally stable. The torso is usually more stable than flailing limbs. This weighting helps stabilize the core position.
- Speed Direction Smoothing (\(L_{SDS}\)): In climbing, velocity changes smoothly. You don’t instantly teleport or reverse direction at light speed. This smoothing constraint ensures natural motion.
- Visible Limb Repair (\(L_{VLR}\)): Sometimes a limb is occluded (hidden) from the camera. This step uses the LiDAR point cloud to “find” the missing limb and correct its posture.
Stage 3: Semi-Supervised Training
One of the cleverest parts of ClimbingCap is how it handles data. Labeling data (putting the MoCap suit on climbers) is expensive and slow. However, recording raw video and LiDAR (without the suit) is cheap and fast.
The researchers used a Teacher-Student framework.
- They trained a “Teacher” model on the labeled data.
- They fed the unlabeled data into the Teacher.
- The Teacher made its best guess (pseudo-labels) for this new data.
- A “Student” model was then trained on both the original labeled data and the Teacher’s pseudo-labeled data.
This semi-supervised approach allowed the model to learn from a much wider variety of movements than would have been possible using only the labeled dataset.
Experiments and Results
Did it work? The researchers compared ClimbingCap against several state-of-the-art methods, including RGB-only methods (like WHAM and TRACE) and LiDAR-based methods.
Quantitative Analysis
The results on the AscendMotion dataset were categorized into “Horizontal” and “Vertical” scenes. Vertical scenes are generally harder due to the gravity-defying poses.

Table 3 shows a clear victory for ClimbingCap.
- MPJPE (Mean Per Joint Position Error): This measures how far off the joints are from reality. Lower is better. In vertical scenes, ClimbingCap achieved an MPJPE of 75.45, while the next best method (GVHMR) sat at 107.09.
- W-MPJPE (World Coordinate MPJPE): This is the ultimate test—are the joints correct in the global 3D world? ClimbingCap achieved 78.99, drastically outperforming purely RGB methods which often scored above 200 or even 600. This proves that integrating LiDAR is essential for global accuracy.
The researchers also tested the model on a completely different dataset, CIMI4D, to check for generalization.
(Note: Referencing the CIMI4D table provided in the deck, labeled Table 4 in the source text)
Even on a dataset it wasn’t primarily designed for, ClimbingCap maintained superior performance, indicating the method is robust and not just memorizing the training data.
Qualitative Analysis
Numbers are great, but visual proof is often more compelling.

In Figure 5, we see a side-by-side comparison.
- Left (Camera Coordinate): Note how standard methods (like TRACE or WHAM) often fail to align the skeleton with the image, especially during complex reaches.
- Right (World Coordinate): The difference is stark. The red circles highlight errors in other methods—floating feet, twisted torsos, or climbers positioned inside the wall. The “Ours” column (ClimbingCap) shows a climber properly attached to the wall, with limbs extending naturally toward holds.
Ablation Study: Do We Need All the Parts?
Science requires verification. The researchers performed an ablation study, removing parts of the system to see what happened.

Table 5 tells the story:
- RGB Input Only: Without LiDAR, the error (MPJPE) jumps from 75.45 to 105.67. This confirms that visual data alone isn’t enough for high-precision global climbing capture.
- w/o Semi-Supervised (SS): Removing the teacher-student training increased the error, proving that the extra unlabeled data helped the model learn better representations.
- w/o SDS (Smoothing): Removing the speed smoothing caused a significant drop in accuracy, highlighting the importance of physical motion constraints.
Conclusion and Implications
The “ClimbingCap” paper represents a significant step forward in computer vision for sports. By moving away from the “flat earth” assumption of traditional motion capture, the researchers have opened the door to analyzing human movement in complex, 3D environments.
Key Takeaways:
- Context Matters: You cannot accurately capture climbing motion without understanding the wall (the scene) and the global position.
- Sensor Fusion is Key: RGB gives you the pose; LiDAR gives you the place. Combining them via Separate Coordinate Decoding offers the best of both worlds.
- Data is King: The creation of AscendMotion provides the research community with a benchmark to test future “off-ground” algorithms.
This technology has implications beyond just helping athletes improve their beta. The principles used here—integrating scene geometry with body pose—could be applied to construction worker safety monitoring, search and rescue robotics, or any domain where humans interact with complex, vertical environments. By teaching AI to “climb,” we are teaching it to understand the world in truly three dimensions.
](https://deep-paper.org/en/paper/2503.21268/images/cover.png)