Closing the Sim-to-Real Gap: How Active Exploration Makes Legged Robots Agile and Precise
If you have ever watched a robot fail spectacularly—falling over while trying to walk, or missing a jump by several inches—you have witnessed the “Sim-to-Real gap.” In a physics simulator, robots are perfect. They have infinite battery life, their motors respond instantly, and the ground is perfectly flat. In the real world, however, motors lag, friction varies, and mass distributions are rarely what the spec sheet says they are.
For years, the robotics community has relied on Reinforcement Learning (RL) to teach robots how to walk. We train them in simulation (Sim) and then deploy them in the real world (Real). When the physics of the two worlds don’t match, the robot fails.
Today, we are diving deep into a new paper titled “Sampling-Based System Identification with Active Exploration for Legged Sim2Real Learning” (or SPI-Active). This research proposes a fascinating solution: instead of training the robot to survive any possible world (a common technique called Domain Randomization), we should teach the robot to run experiments on itself to figure out exactly what world it is in.
The result? Robots that can perform high-precision parkour, weaving through poles and jumping across gaps with centimeter-level accuracy, without external motion capture systems.

The Problem: Why Sim-to-Real is so Hard
To understand why this paper is significant, we first need to understand the limitations of current approaches.
The Limits of Domain Randomization
The standard approach to bridging the gap is Domain Randomization (DR). In DR, you train an RL policy in thousands of simulations, each with slightly different parameters. One simulation might have a heavy robot; another might have a slippery floor. The hope is that if the policy works in all these random versions, it will work on the real robot.
The problem with DR is that it often forces the robot to be conservative. If you train a robot to walk on ice and concrete, it might adopt a slow, stomping gait that works on both but excels on neither. It sacrifices precision for robustness.
The Challenge of System Identification (Sys-ID)
The alternative is System Identification. This involves taking real-world data from the robot, analyzing it, and updating the simulator parameters (like mass, center of mass, and motor strength) to match reality. If your simulator is accurate, your policy will work.
However, traditional Sys-ID is a nightmare for legged robots for two reasons:
- Differentiability: Many modern optimization techniques require “differentiable” simulators (where you can calculate gradients to guide the tuning). Most robust physics engines are not differentiable, especially when complex contact forces (feet hitting the ground) are involved.
- Data Quality: To identify a robot’s parameters, you need good data. If you just make the robot walk in place, you won’t learn much about its rotational inertia. You need to excite the system—make it do dynamic moves—to reveal its true physics.
Enter SPI-Active
The researchers propose a two-stage framework called SPI-Active. It combines massive parallel computing to estimate parameters (without needing gradients) and an “Active Exploration” phase where the robot intelligently decides how to move to gather the best possible data.

As shown in the figure above, the process flows as follows:
- Stage 1: Collect initial data using standard policies. Identify a rough model.
- Stage 2: Use that rough model to calculate an “Active Exploration” policy. This policy forces the robot to perform specific maneuvers that maximize information gain.
- Refinement: Collect new data using the exploration policy and refine the parameters.
- Deployment: Train the final locomotion policy on the highly accurate simulator.
Let’s break down the technical innovations that make this possible.
Stage 1: Sampling-Based Parameter Identification (SPI)
The first core contribution is a robust method for identifying physical parameters without needing a differentiable simulator or expensive torque sensors.
The Optimization Problem
The goal is simple: find a set of parameters \(\theta\) (mass, inertia, motor strength) such that the trajectory of the simulated robot matches the trajectory of the real robot.
Mathematically, they formulate this as a minimization problem:

Here, \(\mathbf{x}\) represents the state of the robot (position, velocity, joint angles). The term \(f(\mathbf{x}_t, \mathbf{u}_t; \theta)\) represents the simulator predicting the next state based on the current state and action \(\mathbf{u}_t\).
Handling the “Black Box” Simulator
Since they use Isaac Gym (a high-performance but non-differentiable simulator), they cannot use gradient descent to find \(\theta\). Instead, they use CMA-ES (Covariance Matrix Adaptation Evolution Strategy).
Think of CMA-ES as a “guess and check” strategy on steroids. It samples a population of parameter settings, runs thousands of simulations in parallel (which is easy on modern GPUs), checks which ones match the real data best, and then shifts its “guessing distribution” toward the good results.
Physical Feasibility: The Log-Cholesky Decomposition
One subtle but critical challenge in System ID is ensuring the identified parameters are physically possible. For example, a mass matrix (Inertia) must be positive definite. If the optimization algorithm predicts a negative mass or an impossible inertia tensor, the physics engine will crash or explode.
To prevent this, the authors don’t identify the inertia matrix directly. Instead, they use the Log-Cholesky decomposition. They optimize a lower-triangular matrix \(\mathbf{U}\) and use it to construct the pseudo-inertia matrix \(\mathbf{J}\).

This mathematical trick guarantees that no matter what numbers the optimizer comes up with, the resulting robot model obeys the laws of physics.
Modeling Actuator Dynamics
In standard simulations, we assume that if we command 10 Nm of torque, the motor delivers 10 Nm. In reality, motors have friction, back-EMF, and saturation limits. This discrepancy is a huge source of the Sim-to-Real gap.
The authors introduce a hyperbolic tangent (tanh) model to capture the non-linear relationship between the commanded torque (\(\tau_{PD}\)) and the actual output torque (\(\tau_{motor}\)).

The parameter \(\kappa\) determines the saturation limit. By including this in the identification process, the system learns exactly how “weak” or “strong” the real motors are compared to the ideal ones.
The Cost Function
The actual cost function used to evaluate parameter sets is comprehensive. It doesn’t just look at where the robot is; it looks at velocities, orientation, and joint states over a horizon of time \(H\).

This equation essentially says: “Find parameters that minimize the difference between real states (\(\mathbf{x}^r\)) and simulated states (\(\mathbf{x}\)), while keeping the parameters close to a reasonable prior (\(\theta_0\)).”
Stage 2: Active Exploration
The method described above works well, but it has a flaw: it relies on the data you happen to have. If you only have data of the robot standing still, you can’t possibly identify its moment of inertia. You need to excite the system.
But random shaking is dangerous for a legged robot—it might fall over. This is where Active Exploration comes in.
The Fisher Information Matrix (FIM)
The researchers use a concept from information theory called the Fisher Information Matrix (FIM). In simple terms, the FIM quantifies how much information a sequence of data provides about an unknown parameter.
The “cramér-Rao bound” states that the variance of your parameter estimate is lower-bounded by the inverse of the FIM. Therefore, maximizing the FIM minimizes uncertainty.

Optimizing Command Sequences
Instead of training a neural network from scratch to explore (which causes erratic behavior), the authors optimize the input commands to an existing locomotion policy.
They take a pre-trained policy that can walk, trot, or jump based on inputs (velocity \(v_x\), heading \(w_z\), gait frequency, etc.). They then search for a sequence of these commands \(\mathbf{c}_{1:T}\) that maximizes the trace of the FIM.

This effectively asks: “What combination of fast walking, turning, and gait switching will stress the robot’s physics the most, thereby revealing its true parameters?”
Calculating the exact FIM is hard, so they use an approximation based on the sensitivity of the dynamics to the parameters:

This calculation is computationally heavy, but thanks to the parallel nature of the SPI framework, it can be computed efficiently.
Experiments and Results
Does this complex pipeline actually work better than just randomizing the domain? The authors tested this on a Unitree Go2 quadruped and a Unitree G1 humanoid.
1. Prediction Accuracy
First, they checked if the identified model could predict the future. They collected validation data on the real robot and checked if the simulator (using the new parameters) matched the reality.
As seen in Table 1(a) below, SPI-Active (the full method) achieved the lowest prediction error across root position, joint angles, and velocity compared to gradient-based methods (GD) and standard SPI without active exploration.

2. Task Performance
The real test is deploying RL policies. They trained policies for four distinct tasks:
- Forward Jump: Precision jumping to a target 0.85m away.
- Yaw Jump: Jumping while rotating 135 degrees.
- Velocity Tracking: Following complex speed commands.
- Attitude Tracking: Tilting the body precisely.

The results were stark. As shown in the performance table below (Table 1b), SPI-Active consistently outperformed “Vanilla” (default URDF) and “Heavy DR” (aggressive domain randomization).

In the Forward Jump task, the Vanilla policy often fell short or lost balance. The SPI-Active policy, trained on a simulator that “knew” the robot’s exact actuator limits and inertia, landed the jump consistently.

We can see the trajectory tracking in Figure 4. The red line (Real-world SPI-Active) tracks the blue line (Simulation) and the grey line (Reference) much more closely than the Vanilla approach, particularly in the phase of the movement.

3. Why Active Exploration Matters
Was the second stage really necessary? The authors performed an ablation study comparing SPI-Active against “SPI-Random” (where the exploration phase used random commands).
The data revealed something interesting: The FIM-optimized trajectories induced higher velocities and more diverse gait patterns (like pronking) compared to random commands.

In plot (a), you can see the FIM strategy (blue) pushed the robot to higher velocity magnitudes. In plot (b), it selected “Pronking” (a high-impact gait) more often. This high-energy data provided a clearer signal for the system identification algorithm, resulting in significantly lower error in the final Forward Jump task (Plot c).
4. The Importance of Motor Modeling
Finally, the researchers verified their choice of motor model. They compared their Tanh-based model against a simple linear model and a unified model (assuming all motors are the same).

The “Ours” method (Equation 3 from earlier) resulted in the lowest error. This proves that treating the motors as non-linear, joint-specific components is crucial for high-performance tasks.
Conclusion
The SPI-Active framework represents a shift in how we think about robot learning. Rather than trying to make a policy robust to everything (which makes it mediocre at everything), we can use the robot’s own intelligence to probe its physical reality.
By combining massive parallel simulation with information-theoretic exploration, we can build “digital twins” that are incredibly faithful to the physical world. This allows us to train policies in simulation that transfer zero-shot to the real world with the kind of precision usually reserved for hard-coded control theory.
As legged robots move from research labs to real-world applications, techniques like SPI-Active will be essential for ensuring they operate safely, efficiently, and precisely in complex environments.
Reference
Sobanbabu, N., He, G., He, T., Yang, Y., & Shi, G. (2025). Sampling-Based System Identification with Active Exploration for Legged Sim2Real Learning. 9th Conference on Robot Learning (CoRL 2025).
](https://deep-paper.org/en/paper/1020_sampling_based_system_ide-2416/images/cover.png)