Imagine trying to teach a robot to wipe a whiteboard. To a human, this is trivial. To a robot, it is a nightmare of physics and control. The robot must navigate to the board without crashing, lift its arm, apply just enough pressure to erase the marker without punching through the wall, and coordinate its wheels and joints simultaneously.

This is a “high-degree-of-freedom” (high-DoF) problem involving contact-rich manipulation. Traditionally, roboticists have had two main ways to solve this:

Sim-to-Real: Train the robot in a physics simulator (like a video game) and transfer the “brain” to the real robot. The problem? The Reality Gap. If the simulation isn’t perfect—if the friction coefficient of the whiteboard is slightly off—the robot fails in the real world.
Real-World Reinforcement Learning (RL): Let the robot learn by trial and error in the physical world. The problem? It is slow and dangerous. A robot flailing its arms to learn how to move can damage itself or its surroundings, and standard algorithms might require weeks of continuous interaction to converge.

In a new paper, researchers introduce SLAC (Simulation-Pretrained Latent Action Space), a hybrid approach that achieves the best of both worlds. SLAC allows complex mobile manipulators to learn difficult, contact-rich tasks in the real world in under an hour, without requiring human demonstrations or high-fidelity “Digital Twins.”

Figure 1: SLAC uses a task-agnostic action space trained in low-fidelity simulation (left) to learn downstream tasks in the real world. This latent action space is safe,temporally extended,and disentangled,enabling a bimanual mobile manipulator to solve challenging contact-rich whole-body tasks (right) with less than an hour of autonomous real-world interactions.

The Core Problem: The Curse of Dimensionality

Why is controlling a mobile manipulator so hard? It comes down to the action space. A robot like the one shown above might have 17 or more joints (wheels, torso, shoulders, elbows, wrists). If you try to learn a task by wiggling all 17 joints randomly (the starting point of Reinforcement Learning), the search space is astronomically large. Most random movements are useless or dangerous.

Direct RL in the real world struggles because it wastes thousands of samples just figuring out how to coordinate its own body before it even begins to solve the task.

SLAC proposes a clever workaround: Don’t learn how to move your body in the real world. Learn how to move in a cheap simulation, and then learn how to solve the task in the real world.

The SLAC Framework

SLAC operates in two distinct phases. By decoupling the acquisition of “skills” from the learning of a “task,” it bypasses the need for photorealistic simulations.

Phase 1: Unsupervised Latent Action Learning (Simulation). The robot plays in a low-fidelity simulation to learn a set of useful behaviors (a “latent action space”).
Phase 2: Downstream Task Learning (Real World). The robot uses these pre-learned behaviors as a vocabulary to solve specific tasks in the real world.

Let’s break down the architecture.

$Figure 2: The two-step SLAC procedure to enable real-world policy learning. \$( L e f t )\$ In the first step, SLAC learns a Latent Action Decoder that maps each latent action, \$z \\in { \\mathcal { Z } }\$ ,to a sequence of low-level robot actions, \$( a _ { 0 } , \\hdots , a _ { T } ) , a _ { t } \\in \\mathcal { A }\$ . This decoder is learned in low-fidelity simulation via unsupervised skilldiscovery with novel objectives that encourage the robot to independently control different state features (e.g., camera directions, contacts with table, base locations) while being safe. \$( R i g h t )\$ In the second step, once the decoder is trained, the robot learns downstream tasks with RL in the real world using the SLAC latent action space.The task policy directly takes in the onboard sensor observations of the robot (i.e., images,proprioception) and outputs latent actions \$z\$ that are decoded into safe robot actions. SLAC applies Factorized Latent-Action SAC to optimize the policy for downstream tasks with multi-term reward (e.g., look at the objects, keep a bag close,sweep the trash) directly in the real world with very few samples,converging in less than an hour,by taking advantage of high-frequency off-policy updates and factorized Q decomposition.$

Step 1: Learning the Language of Movement in Simulation

The first insight of SLAC is that you don’t need a perfect simulation to learn basic motor skills. You don’t need to simulate the exact friction of a specific whiteboard marker to learn that “extending the arm forward” causes the hand to move away from the body.

The researchers use a low-fidelity simulation. It doesn’t look realistic, and it doesn’t even have the specific task programmed into it. Instead, they use Unsupervised Skill Discovery (USD).

The Objective: Disentanglement and Empowerment

The goal in the simulation isn’t to “wipe the board.” It is to learn a set of “Latent Actions” ($z$) that control different parts of the environment. The researchers want the action space to be disentangled.

Ideally, one dimension of the latent action $z$ should control the robot’s base movement, another should control the arm height, and another should control the gripper orientation. If the robot can learn to control these independently, the downstream learning becomes much easier.

They achieve this by optimizing a Mutual Information objective:

$()\n\\mathcal { I } ( \\theta ) = \\sum _ { i = 1 } ^ { N } I ( \\mathcal { S } ^ { i } ; \\mathcal { Z } ^ { i } ) - \\lambda I ( \\mathcal { S } ^ { - i } ; \\mathcal { Z } ^ { i } ) ,\n[$

Here is what this equation implies:

Maximize $I(S^i; Z^i)$: The $i$-th dimension of the latent action ($Z^i$) should have a strong correlation with a specific feature of the state ($S^i$), such as the position of the hand.
Minimize $I(S^{\neg i}; Z^i)$: That same action $Z^i$ should not affect other parts of the state ($S^{\neg i}$).

This forces the robot to learn a “keyboard” of actions where every key does something specific and predictable.

Temporal Extension

Rather than deciding what to do 100 times a second, the latent actions are temporally extended. One latent action $z$ might correspond to a 2-second trajectory of motor commands. This effectively shortens the horizon of the task—instead of making 1,000 micro-decisions to cross a room, the high-level policy only needs to make 5 or 10 macro-decisions.

Universal Safety

A major risk of training in simulation is that the robot might learn aggressive behaviors that work in physics engines but break hardware in reality. To prevent this, SLAC incorporates a Universal Safety Reward during the pre-training phase:

$]\nr _ { s a f e } = - \\lambda _ { 1 } | a | ^ { 2 } - \\lambda _ { 2 } | a - a _ { \\mathrm { p r e v } } | ^ { 2 } - \\lambda _ { 3 } \\cdot \\mathbb { I } _ { \\mathrm { c o l l i s i o n } } - \\lambda _ { 4 } \\cdot \\mathbb { I } _ { F > 7 0 }\n[$

This reward penalizes:

$\|a\|^2$: Large, high-energy actions.
$\|a - a_{prev}\|^2$: Jerky, erratic movements (high acceleration).
$\mathbb{I}_{collision}$: Collisions with the environment.
$\mathbb{I}_{F > 70}$: Excessive contact forces (e.g., pushing too hard).

The result of Phase 1 is a Latent Action Decoder ($\pi_{dec}$). You feed it a high-level command $z$, and it spits out a safe, smooth, physically coherent sequence of motor torques.

Step 2: Learning the Task in the Real World

Now we move to the real world. The robot is placed in front of a real whiteboard or a table with trays. We want to teach it a specific task using Reinforcement Learning.

The robot uses the standard RL objective function:

$]\n\\pi ^ { * } ( a | o ) = \\arg \\operatorname* { m a x } _ { \\pi } \\mathbb { E } _ { \\pi } \\left[ \\sum _ { t = 0 } ^ { \\infty } \\gamma ^ { t } R _ { t a s k } ( s _ { t } , a _ { t } ) \\right]\n[$

However, instead of outputting raw motor torques ($a$), the policy outputs latent actions ($z$) which are then decoded by the pre-trained decoder.

$]\n\\pi ( a | o ) = \\int _ { z } \\pi _ { d e c } ( a | o _ { d e c } , z ) \\pi _ { t a s k } ( z | o ) d z\n[$

The Algorithm: Factorized Latent-Action SAC (FLA-SAC)

The researchers developed a specific algorithm, FLA-SAC, to train this policy efficiently.

Standard Soft Actor-Critic (SAC) works well for continuous actions. However, the SLAC latent space is often discrete (a specific set of skills). To make this compatible with the differentiable training required for deep learning, they use the Gumbel-Softmax trick:

$]\n\\hat { z } ( s ) = s o f t m a x \\left( \\frac { \\log \\pi _ { \\theta } ( z \\mid s ) + g _ { z } } { \\tau } \\right) , \\quad g _ { z } \\sim G u m b e l ( 0 , 1 )\n[$

This allows the network to sample discrete skills while still allowing gradients to flow back through the network during training.

The “Secret Sauce”: Factored Q-Function Decomposition

The most critical innovation for speed is how SLAC handles rewards. Real-world tasks usually have composite rewards. For example, “Sweep trash into a bag” involves:

Move base near table.
Move hand near trash.
Move other hand holding bag.

Because the latent action space was disentangled in Phase 1, the researchers can decompose the Q-function (which estimates the value of an action). They know that the “Move Base” latent action likely affects the “Navigation” reward, but probably doesn’t affect the “Gripper” reward.

$]\n\\begin{array} { r l r } { { Q _ { \\pi } ( s , z ) = \\mathbb { E } _ { \\pi } { \\sum _ { t = 0 } ^ { \\infty } \\gamma ^ { t } r _ { t } } } } \\ & { } & { = \\mathbb { E } _ { \\pi } { \\sum _ { i = 1 } ^ { m } \\sum _ { t = 0 } ^ { \\infty } \\gamma ^ { t } r _ { t } ^ { i } } } \\ & { } & { = \\sum _ { i = 1 } ^ { m } \\mathbb { E } _ { \\pi } { \\sum _ { t = 0 } ^ { \\infty } \\gamma ^ { t } r _ { t } ^ { i } } } \\ & { } & { = \\sum _ { i = 1 } ^ { m } Q _ { \\pi } ^ { i } ( s , z ) } \\end{array}\n()$

By splitting the global Q-function into smaller, factored Q-functions ($Q^i$), the algorithm learns much faster. It effectively turns one giant, noisy learning problem into several smaller, cleaner ones.

Experimental Results

The researchers tested SLAC on a bimanual mobile manipulator. They set up four challenging tasks:

Board: Clean a mark off a whiteboard.
Board-Obstacle: Clean the board while reaching over an obstacle.
Table-Tray: Push an object into a tray.
Table-Bag: Sweep an object into a bag held by the other hand.

These tasks require contact-rich interactions and whole-body coordination.

Speed and Success

The results were stark. SLAC was compared against three baselines:

SERL: Direct RL in the real world (no pre-training).
Sim2Real: Zero-shot transfer from simulation.
RLPD: Finetuning a sim-policy with real data.

$Table 1: We compare the Success rates \$( \\uparrow )\$ over 1O rollouts and the total safety violation counts during training (↓) of SLAC against baseline methods across four tasks. In all four tasks,SLAC achieves the highest success rate while also inducing the least number of safety violations.$

As shown in Table 1, SLAC achieved high success rates (70-90%) across all tasks.

SERL completely failed (0% success) because the search space was too large.
Sim2Real performed poorly (0-40%) because modeling the exact friction and contact dynamics of a sponge on a whiteboard or trash on a table is incredibly difficult.

Most impressively, look at the Unsafe # column. SLAC had almost zero safety violations. SERL and RLPD triggered safety stops dozens of times, requiring human intervention.

Learning Efficiency

How long did this take?

Figure 3: Training curves for SLAC. SLAC can learn contact-rich whole-body manipulation tasks within an hour of real-world interactions (Fig.3a),and can be applied to non-robotics domains as well (Fig. 3c).Ablation (Fig.3b) shows that allthe techniques in SLAC are critical to its success.

Figure 3a shows the training curves. SLAC converges to a high return in roughly 40-60 minutes of real-world interaction. This is a game-changer. Typically, training a policy on hardware takes hours or days.

The ablation study (Figure 3b) confirms that the components are essential:

Entangled: If you remove the disentanglement constraint in the simulation phase (Phase 1), performance drops significantly.
No Temp: If you remove temporal extension (making decisions at high frequency), learning becomes too slow.

Conclusion: The Best of Both Worlds

SLAC represents a pragmatic leap forward in robot learning. It acknowledges that simulations are imperfect representations of reality, but they are excellent environments for learning abstract skills.

By treating simulation as a “gym” for general coordination and the real world as the “field” for task execution, SLAC bypasses the Reality Gap. The robot doesn’t memorize what to do in the simulation; it learns how to control itself safely and efficiently.

The implications are significant:

Safety: We can trust RL agents on expensive hardware if the action space is constrained by safety priors learned in sim.
Efficiency: Factored Q-learning combined with disentangled actions makes real-world training feasible in strictly limited timeframes.
Simplicity: We no longer need to spend months building “Digital Twin” simulations. A low-fidelity sandbox is enough to bootstrap capable real-world agents.

SLAC demonstrates that with the right abstraction, robots can learn to interact with the complex, messy physical world in less time than it takes to watch a movie.

The Core Problem: The Curse of Dimensionality#

The SLAC Framework#

Step 1: Learning the Language of Movement in Simulation#

The Objective: Disentanglement and Empowerment#

Temporal Extension#

Universal Safety#

Step 2: Learning the Task in the Real World#

The Algorithm: Factorized Latent-Action SAC (FLA-SAC)#

The “Secret Sauce”: Factored Q-Function Decomposition#

Experimental Results#

Speed and Success#

Learning Efficiency#

Conclusion: The Best of Both Worlds#