Imagine someone tossing a cardboard box at you. You don’t freeze, calculate the wind resistance, solve a differential equation, and then move. You react fluidly. You extend your arms, anticipate the contact, and as the box hits your hands, you pull back slightly to absorb the impact. This is dynamic manipulation—interacting with objects through rapid contact changes and physical forces.

For humans, this is instinct. For robots, it is an algorithmic nightmare.

Robots typically prefer “quasi-static” tasks—moving slowly enough that momentum and impact forces can be ignored. To make robots truly useful in the real world, they need to handle dynamic tasks like catching, throwing, or sliding objects. However, existing methods often fail to balance the speed required for reaction with the foresight required for planning.

In this post, we are doing a deep dive into the Latent Adaptive Planner (LAP). This research introduces a method that allows robots to learn agile catching skills directly from human videos and, crucially, adapt their plans in real-time as objects fly through the air.

The Problem: Why Catching is Hard for Robots

To catch a flying object, a robot must deal with three compounding challenges:

Unpredictable Dynamics: A box tumbling through the air has complex aerodynamics. Its mass, friction, and elasticity are unknown to the robot until contact is made.
Real-Time Latency: Traditional path planning algorithms are often too slow. By the time the robot calculates the optimal trajectory, the box has already hit the floor.
The Embodiment Gap: We want to train robots using videos of humans (because human data is cheap and abundant). But humans and robots have different body shapes, joint limits, and strengths.

Recent advances in Imitation Learning (like Diffusion Policies) have shown promise, but they often struggle with the inference speeds needed for high-frequency control loops. On the other hand, Reinforcement Learning (RL) requires thousands of dangerous, real-world trials or simulations that don’t perfectly match reality.

The researchers behind LAP propose a hybrid solution: formulate planning as an inference problem in a latent space, and use a smart data regeneration pipeline to teach robots using human videos.

Part 1: Learning from Humans (Without Teleoperation)

Before the robot can plan, it needs data. Collecting robot data via teleoperation (controlling a robot with a joystick or VR rig) is expensive and slow. The authors devised a pipeline to “regenerate” robot-ready data directly from standard videos of humans performing the task.

The goal is to translate what a human does (visual pixels) into what a robot needs to know (joint torques and positions).

Figure 1: Robot Model-Based Data Regeneration Pipeline.

Step 1: Scene State Estimation

First, the system analyzes the video to track the box and the human’s pose. It extracts the 3D position of the box and the joint angles of the human demonstrator.

Step 2: Object-Robot Proportional Mapping

A human arm and a robot arm are rarely the same length. If a human reaches 50cm forward, a smaller robot might need to fully extend, while a larger robot barely moves. To fix this, the researchers use Proportional Mapping.

They scale the object’s position and dimensions based on the ratio between the robot’s arm length and the human’s arm length (measured in pixels).

Equation 1 and 2: Proportional mapping equations.

Here, \({}^R p_{\mathrm{obj}}\) is the object position in the robot’s frame, and \(s\) is the scaling factor. This ensures that the “intent” of the motion is preserved even if the physical dimensions differ.

Step 3: Kinematic-Dynamic Reconstruction

This is the most critical step. A video only gives you positions. But to catch a heavy box, a robot needs to know about forces (torques).

First, the system maps human joint angles to robot joint angles (\(q\)) using a mapping function \(f_{map}\):

Equation 3: Joint mapping function.

Next, it calculates velocities (\(\dot{q}\)) and accelerations (\(\ddot{q}\)) by differentiating the positions over the video’s time steps (\(\Delta t\)):

Equation 4: Joint velocity calculation. Equation 5: Joint acceleration calculation.

Finally, using a physics model of the robot (Inverse Dynamics), the system calculates the torque (\(\tau\)) required to perform that movement.

Equation 6: Inverse dynamics equation.

In this equation:

\(\mathbf{M}(\mathbf{q})\ddot{\mathbf{q}}\) accounts for the robot’s inertia.
\(\mathbf{C}(\mathbf{q},\dot{\mathbf{q}})\dot{\mathbf{q}}\) accounts for Coriolis and centrifugal forces (rotational physics).
\(\mathbf{G}(\mathbf{q})\) accounts for gravity.
\(\mathbf{F}_{\mathrm{ext}}\) accounts for external forces (like the impact of the box).

By the end of this pipeline, the researchers have converted a video of a human catching a box into a dataset of robot joint positions, velocities, and the specific torques needed to execute the catch.

Part 2: The Latent Adaptive Planner (LAP)

With data in hand, we need a brain. The core innovation of this paper is treating the robot’s plan not as a fixed sequence of actions, but as a probability distribution in a Latent Space.

The Latent Variable Model

The LAP defines a “latent plan,” denoted as vector \(\mathbf{z}\). You can think of \(\mathbf{z}\) as a compressed, abstract summary of the entire trajectory (e.g., “catch the box high to the left” or “catch low and scoop”).

The model defines a joint probability distribution between the trajectory \(\mathbf{x}\) (observations and actions) and this latent plan \(\mathbf{z}\):

Equation 7: Joint probability distribution.

The trajectory generator, \(p_{\theta}(\mathbf{x}|\mathbf{z})\), is a causal Transformer (similar to the architecture behind GPT). It generates the next action based on the history of observations and the specific latent plan \(\mathbf{z}\) it is currently following.

Equation 8: Trajectory generator equation.

Training via Classical Variational Bayes

How does the robot learn valid \(\mathbf{z}\) vectors? The researchers use Classical Variational Bayes (VB).

In typical machine learning (like VAEs), we train an “encoder” network to predict \(\mathbf{z}\) from the input. However, LAP takes a different approach: it directly optimizes the specific \(\mathbf{z}\) vector for each training trajectory.

During training, the model tries to maximize the Evidence Lower Bound (ELBO). This objective balances two goals:

Reconstruction: The plan \(\mathbf{z}\) should accurately regenerate the demonstration trajectory.
Regularization: The distribution of \(\mathbf{z}\) should stay close to a standard prior (a simple Gaussian distribution), which keeps the latent space smooth and navigable.

Equation 9: The ELBO objective function.

The training alternates between optimizing the local parameters (the specific plan \(\mathbf{z}\) for a specific video) and the global parameters (the weights of the Transformer network \(\theta\)).

Part 3: Real-Time Adaptation via Variational Replanning

This is where LAP shines. In a dynamic environment, things change. The robot might plan to catch the box at point A, but air resistance or a bad throw might send it toward point B.

Standard planners have two flaws:

Open-loop: They plan once at the start. If the world changes, they fail.
Replanning from scratch: They re-calculate the entire plan every few milliseconds. This is computationally expensive and can lead to “jittery” behavior if the plan changes drastically between steps.

LAP introduces Variational Replanning.

Instead of calculating a new plan from scratch, LAP maintains a “belief” (a posterior distribution) about the latent variable \(\mathbf{z}\). As new observations arrive (e.g., the box is lower than expected), the model performs a Bayesian update.

Crucially, the posterior from the previous timestep becomes the prior for the current timestep.

Equation 10: Bayesian updating formula.

The robot essentially asks: “Given my previous plan, and these new observations, how should I slightly shift my plan \(\mathbf{z}\) to fit reality?”

This is mathematically formulated as an optimization problem where the model tries to maximize the likelihood of the new observations while minimizing the divergence (change) from the previous plan (\(q_t\)).

Equation 11: Optimization for replanning.

This acts like a “trust region.” It allows the robot to adapt to the box’s movement without making wild, erratic changes to its strategy. It ensures the motion remains smooth—a vital requirement for dynamic manipulation.

The Full System Architecture

The implementation uses a dual-loop frequency:

High-Level Planner (30Hz): The LAP updates the latent plan \(\mathbf{z}\) via variational replanning.
Low-Level Controller (100Hz): The Transformer generates the immediate motion commands (actions) based on the current \(\mathbf{z}\).
Safety Layer (1000Hz): A Model Predictive Control (MPC) layer ensures the requested torques don’t violate the robot’s physical limits.

Figure 2: System Architecture for LAP Framework.

Experiments and Results

The researchers evaluated LAP on a box catching task. This is notoriously difficult because boxes tumble chaotically, and catching them requires “soft” hands—absorbing the energy rather than colliding with it rigidly.

They tested the system on two different robots (Robot A and Robot B) to prove that the data regeneration pipeline works across different physical embodiments.

Visual Analysis: The “Human-Like” Catch

One of the most striking results is the qualitative difference in motion. Because LAP learns from human data including torques, it learns compliance.

Look at Figure 3 below. On the left, a human catches a box. Notice how the arm yields backward upon contact? This “retreat” trajectory absorbs the impact energy. On the right, the LAP-controlled robot mimics this behavior perfectly. It doesn’t just go to a coordinate; it executes a dynamic, energy-efficient motion.

Figure 3: Impact-aware retreat trajectory comparison.

Quantitative Comparison

The team compared LAP against three baselines:

Model-Based: A traditional physics solver.
Behavior Cloning (BC): Standard supervised learning.
Diffusion Policy: A state-of-the-art generative approach.

The results were compiled for both robots:

Table 1: Performance comparison table.

Key Takeaways from the Data:

Success Rate: LAP achieved near-perfect success rates (29/30 or 30/30), matching the Model-Based planner. Diffusion and BC struggled (around 20-24 successes), largely because they couldn’t adapt quickly enough to the specific trajectory of the box.
Energy Efficiency: This is the big win. Look at the energy consumption (Joules). The Model-Based planner succeeded, but it was “stiff,” using high torques to force the robot into position (74.99 J for Robot A). LAP used drastically less energy (11.47 J), comparable to the efficiency of the human demonstration. It learned to work with the physics, not fight against it.

Conclusion

The Latent Adaptive Planner (LAP) represents a significant step forward in robotic manipulation. By moving the planning process into a latent space, the researchers found a sweet spot between the rigidity of classical planning and the unpredictability of pure learning-based methods.

Three concepts define this success:

Data Regeneration: Unlocking the ability to train robots using cheap, abundant human videos by mathematically scaling motions and reconstructing forces.
Latent Planning: controlling high-level behaviors via a compressed abstract variable.
Variational Replanning: The ability to “slide” that latent variable in real-time to adapt to a changing world, treating previous plans as priors for current decisions.

For students of robotics, LAP illustrates that solving dynamic problems isn’t always about faster processors or better sensors. often, it is about finding the right representation—in this case, a latent distribution that can be updated on the fly. As robots move out of factories and into our unstructured world, abilities like “catching a tossed object” will transition from being impressive party tricks to essential survival skills.

The Problem: Why Catching is Hard for Robots#

Part 1: Learning from Humans (Without Teleoperation)#

Step 1: Scene State Estimation#

Step 2: Object-Robot Proportional Mapping#

Step 3: Kinematic-Dynamic Reconstruction#

Part 2: The Latent Adaptive Planner (LAP)#

The Latent Variable Model#

Training via Classical Variational Bayes#

Part 3: Real-Time Adaptation via Variational Replanning#

The Full System Architecture#

Experiments and Results#

Visual Analysis: The “Human-Like” Catch#

Quantitative Comparison#

Conclusion#