Introduction
In the biological world, adaptation is survival. A newborn calf learns to walk minutes after birth. A dog with an injured leg instinctively shifts its weight to a three-legged gait to keep moving. Humans can walk on sand, ice, or stilts, adjusting their motor control in real-time based on sensory feedback.
In the world of robotics, however, this level of flexibility has historically been a pipe dream. Traditional locomotion controllers are brittle “specialists.” A controller tuned for a quadruped (four-legged robot) will fail instantly if deployed on a biped (two-legged robot). Even worse, if a robot’s motor burns out or its limb is damaged, the pre-programmed control policy usually fails catastrophically because the robot’s physical reality no longer matches its internal model.
But what if we could build a “generalist” brain—a single artificial intelligence model capable of controlling any robot body, even ones it has never seen before?
This is the promise of LocoFormer, a groundbreaking model presented by researchers from Skild AI. As detailed in their paper, LocoFormer moves away from the “one robot, one policy” paradigm. Instead, it utilizes a massive Transformer-based architecture to adapt to different bodies and environmental conditions on the fly.

As shown in Figure 1 above, LocoFormer can control wheeled robots, quadrupeds, and humanoids. Perhaps most impressively, it exhibits emergent adaptation behaviors: if a robot loses a limb or is forced to walk on stilts, LocoFormer analyzes the sensor history, realizes the dynamics have changed, and alters its control strategy in real-time—all without explicit retraining.
In this post, we will tear down the architecture of LocoFormer, explain why “context” is the secret ingredient to robotic adaptation, and analyze the results of this shift toward foundation models for robot control.
Background: The Problem with “Myopic” Control
To understand why LocoFormer is a significant leap, we first need to understand the limitations of current Reinforcement Learning (RL) approaches in robotics.
Typically, when engineers train a robot to walk, they use a “proprioceptive” history. The robot looks at the state of its joints and sensors over a very short window—usually the last few hundred milliseconds. We call these policies myopic (nearsighted).
A myopic policy is excellent for immediate reactions. If a robot trips, the policy sees the sudden change in velocity and corrects it. However, a few hundred milliseconds is not enough time to understand complex changes in dynamics. If a robot is walking on a slippery surface or dragging a heavy weight, a short history looks like noise. The robot cannot distinguish between “I tripped” and “my body has fundamentally changed.”
Because these standard policies cannot adapt deeply, engineers have to “bake in” the robot’s morphology (body shape) and dynamics during training. This results in rigid, specialized controllers.
The authors of LocoFormer drew inspiration from Large Language Models (LLMs). LLMs like GPT-4 are not trained for a single specific conversation. They are trained on web-scale data and use long-context windows to understand the nuance of a prompt. The researchers hypothesized that if a robot controller could access a much longer history of sensor data—spanning seconds or even multiple trials—it could perform in-context learning. It could look at the stream of data and deduce, “Based on how my motors are responding, I must be heavy,” or “I seem to be missing a leg,” and adjust accordingly.
The LocoFormer Method
LocoFormer combines three critical ingredients to achieve generalist control: a unified input space, large-scale procedural training, and a Transformer-XL architecture for long-term memory.
1. Procedural Generation: Training on “Fake” Robots
If you want a generalist robot, you cannot train it on just one or two types of bodies. You need a dataset that encompasses the vast diversity of physics. Since building thousands of physical robots is impossible, the researchers turned to simulation.
They created a massive dataset of procedurally generated robots. Instead of training on a specific “Unitree Go2” or “Boston Dynamics Spot,” they wrote code to generate random robot bodies.

As Figure 6 illustrates, these generated robots vary wildly:
- Morphology: Bipeds, quadrupeds, wheeled-legged hybrids.
- Kinematics: Different limb lengths, joint arrangements, and body masses.
- Dynamics: Randomized friction, motor strength, and sensor noise.
By training on this chaotic, randomized set of “fake” robots, the model is forced to learn general principles of locomotion rather than memorizing a specific gait for a specific body. It learns how to learn the body it currently inhabits.
2. A Unified Joint Space
A major challenge in generalist learning is that different robots have different numbers of motors (degrees of freedom). A humanoid might have 20 motors; a simple quadruped might have 12. Neural networks typically require a fixed input size.
LocoFormer solves this by defining a Unified Joint Space. They create a “superset” of inputs that covers the maximum number of joints found in most robots.
- If a robot has fewer joints than the superset, the extra inputs are padded with zeros.
- The policy outputs target positions for this superset.
- The robot only executes commands relevant to its actual joints.
This allows a single neural network weights file to process data from a wheeled robot and a humanoid without any architectural changes.
3. Architecture: The Transformer-XL
The core of LocoFormer is its brain. Standard Transformers (like the original GPT) scale poorly with sequence length—the computation cost grows quadratically. If you want a robot to remember the last 10 seconds of data at 50Hz (500 steps), a standard Transformer becomes too slow for real-time control.
The authors utilized the Transformer-XL (TXL) architecture. TXL introduces a recurrence mechanism that allows the model to attend to history beyond the current processing batch without re-computing it.

Figure 3 visualizes this segment-level recurrence:
- Segmentation: The input history is broken into fixed-length segments.
- Caching: When processing Segment 3 (the current moment), the model uses the hidden states from Segment 2 (the immediate past) as “memory.”
- Stop-Gradient (Blue Lines): Crucially, the model does not backpropagate gradients through the cached memory. This saves massive amounts of memory and compute during training.
- Extended Receptive Field (Green Lines): By stacking multiple layers, the effective “context length” grows. A deeper network can “see” further back in time through the cached states.
The mathematical formulation for this recurrence is shown below:

Here, \(\mathbf{h}_z^{n-1}\) represents the hidden states from the previous segment. The operator SG stands for Stop-Gradient. The current layer attends to the concatenation of the cached past and the current input. This allows LocoFormer to maintain a memory of up to 18 seconds (assuming a 6-layer network and specific segment lengths). This is orders of magnitude longer than the ~0.5 seconds used by standard controllers.
4. Multi-Episodic Learning
The adaptation doesn’t just happen within a single walk. The researchers structured the training to support adaptation across trials.
In this setup, a “Trial” is a single attempt to reach a goal. An “Episode” consists of multiple trials. If the robot falls (fails a trial), the memory cache is not wiped. The hidden states persist into the next trial.

This objective function (Equation 1) drives the robot to maximize reward across the entire sequence of trials (\(\sum_{i=1}^{k}\)). The variable \(H_{i-1}\) represents the history of previous trials. This incentivizes the robot to “remember” why it fell in Trial 1 and adjust its strategy for Trial 2. This mimics how a human might slip on a patch of ice, stand up, and immediately adopt a more cautious gait.
Experimental Results
The researchers evaluated LocoFormer in both simulation (Sim) and the real world (Real), specifically looking for Out-Of-Distribution (OOD) performance—how well does it work on robots and situations it never saw during training?
Simulation Benchmarks
LocoFormer was tested against three baselines:
- GRU: A recurrent neural network policy (older architecture).
- Conditioning: A Transformer that receives explicit information about the robot’s physics (cheating, in a sense, as LocoFormer doesn’t get this info).
- Expert Policy: A policy trained specifically for that specific robot (the theoretical upper limit).

Table 1 reveals the results. On average, LocoFormer (Zero-shot) achieves a normalized score of 0.96, remarkably close to the Expert Policy’s 0.99. It significantly outperforms the GRU (0.37), proving that the Transformer architecture is essential.
More interestingly, look at the Few-shot row. This represents the robot being given 5 seconds to run around and adapt before the test. Performance jumps to 0.98, showing that the longer the robot interacts with the world, the better it understands its own body.
The Power of Adaptation Time
Does a longer memory actually help? The researchers tested this by “doubling” the domain randomization (making the physics strictly harder/wilder than training) and measuring success rates.

Figure 4(a) shows a clear trend: as the adaptation time (x-axis) increases from 0 to 5 seconds, the robot’s survival time increases across all robot types (bipeds, quadrupeds, wheeled).
Figure 4(b) provides a fascinating look “under the hood.” It visualizes the internal neural activations of the policy. At \(t=0s\), the representations for different robots are clustered together (the brain is confused). By \(t=5s\), distinct clusters emerge. The model has internally “identified” whether it is a Unitree H1 or a Fourier GR1 solely by feeling how its motors respond to commands.

Figure 2 reinforces this. Unlike standard myopic policies (which flatline instantly), LocoFormer continues to improve its reward collection as it gathers more history (up to several seconds).
Real-World Emergent Capabilities
The most compelling part of the paper is the real-world deployment. The researchers took a standard Unitree Go2 robot and subjected it to “torture tests” to see if LocoFormer could adapt.
Important: The model was not trained on “robots with missing legs” or “robots on stilts.” It was only trained on the procedural random robots.

Figure 5 showcases these emergent behaviors:
- A (Leg Locking): The researchers locked a knee joint, effectively turning a quadruped into a tripod. A standard controller would flip over. LocoFormer stumbles, realizes the leg isn’t moving, and shifts its center of mass to balance on the remaining three legs.
- C (Wheel Locking): On a wheeled robot, they locked the wheels. The robot realized rolling was impossible and spontaneously switched to a walking gait, lifting the locked wheels like feet.
- D (Stilts): They attached wooden stilts to the robot. This drastically changes the center of mass and limb length. LocoFormer adjusted its stride timing to compensate for the longer “legs.”
- F (Amputation): In a drastic test, they removed the lower legs entirely. The robot learned to walk on its “knees” (thighs) after about 8 seconds of struggle.
Cross-Trial Adaptation
Finally, the multi-episodic capability was tested on a highly unstable robot (TRON1) in simulation.

In Figure 8, we see the robot failing in Trial 1 (tipping over). It resets. In Trial 2, it lasts longer. By Trial 4, it has synthesized the failure data from the previous attempts to form a stable gait. This “learning from failure” is a hallmark of intelligent systems and is rarely seen in standard low-level control policies.
Limitations and Computational Cost
While LocoFormer is impressive, it comes with a cost. Training a generalist Foundation Model requires significantly more compute than training a specialist.

Figure 9 highlights the trade-off. To get the best performance (the blue/orange curves reaching high rewards), you need massive compute resources (64-128 GPUs) and a deep network (6 layers). However, the “amortized” cost is low: once trained, this single model replaces thousands of specific controllers.
Additionally, the procedural generation is currently hand-crafted. The authors note that future versions could use generative AI to design the training robots, ensuring even wider coverage of potential physics scenarios.
Conclusion
LocoFormer represents a paradigm shift in robotic control. It moves us away from the era of manual tuning and specific system identification toward the era of Foundation Models for Control.
By combining massive-scale procedural data with an architecture capable of long-context learning (Transformer-XL), Skild AI has created a system that exhibits true generalization. It doesn’t just memorize movements; it understands the relationship between action and consequence.
For students of robotics and AI, LocoFormer demonstrates that the “Lesson of Bitter Variation” applies to hardware as well as language: if you want a system to be robust to the real world, don’t hard-code the rules. Instead, create a massive, diverse training sandbox and give the model the memory capacity to learn the rules itself. The result is a robot that can lose a leg, stand back up, and figure out how to keep moving.
](https://deep-paper.org/en/paper/2509.23745/images/cover.png)