Introduction

Imagine training a robot to carry a tray of drinks. In a simulation, if the robot trips and shatters the glass, you simply hit “reset.” In the real world, however, that failure is costly, dangerous, and messy. This is the fundamental tension in Deep Reinforcement Learning (DRL). While DRL has achieved incredible feats—from defeating grandmasters in Go to controlling complex robotic hands—it essentially learns through trial and error. It explores the world, makes mistakes, and gradually optimizes a policy to maximize rewards.

For robotics, “maximizing rewards” isn’t enough; we need stability. We need to know that the robot will not enter dangerous states or oscillate wildly. Traditionally, control theory provides these guarantees using Lyapunov stability, a mathematical framework that certifies a system will converge to a safe equilibrium.

Recently, researchers have tried to merge these two worlds by teaching neural networks to learn Lyapunov functions alongside their control policies. This provides a “safety certificate” during the learning process. However, there is a catch: current methods are sample inefficient. They rely on “on-policy” learning, meaning they must discard old data and constantly generate new experiences to verify stability. For a physical robot, this means hours of wear and tear.

In the paper “Off Policy Lyapunov Stability in Reinforcement Learning,” researchers from the University of Victoria propose a novel solution. They introduce a framework that allows agents to learn stability guarantees using off-policy data—experience collected in the past. By decoupling the safety check from the current policy, they create a method that is both mathematically rigorous and highly data-efficient.

In this post, we will break down how they achieved this, the mathematics behind off-policy Lyapunov learning, and how they integrated this into state-of-the-art algorithms like Soft Actor-Critic (SAC) and Proximal Policy Optimization (PPO).


Background: The Stability Problem

To understand the contribution of this paper, we first need to understand the tools involved: Lyapunov functions and how they are currently used in RL.

What is Lyapunov Stability?

Lyapunov stability is the gold standard in control theory. Visually, imagine a marble rolling inside a bowl. No matter where you drop the marble, gravity pulls it toward the bottom (the equilibrium point).

A Lyapunov function, \(L(x)\), is a mathematical representation of this “bowl.” To prove a system is stable, we need to find a function \(L(x)\) that satisfies two main conditions:

  1. Positive Definite: The function is positive everywhere except at the equilibrium (the goal), where it is zero. Think of this as the “energy” of the error.
  2. Negative Derivative: The value of the function decreases over time along the system’s trajectory. As the system moves, its “energy” must dissipate until it reaches the goal.

In classical control, engineers derive these functions manually using physics equations. In Model-Free RL, we don’t have the physics equations. Therefore, we must use a Neural Lyapunov Function—a neural network trained to approximate this geometry.

The Problem with On-Policy Learning

Previous approaches, such as the Lyapunov Actor-Critic (LAC), attempt to learn this function alongside the policy. They rely on the standard definition of the Lie derivative (the rate of change of \(L\)).

For a sampled system, we approximate the change in the Lyapunov function using a finite difference between the current state \(s\) and the next state \(s'\):

The finite difference approximation of the Lie derivative.

Here, \(\Delta t\) is the time step. If this value is negative, the system is moving toward stability.

The limitation of existing methods is that they calculate this change based on the current policy’s behavior. If you want to verify stability, you have to run the robot using the current policy to see where it goes. You cannot easily use data collected 10 minutes ago (generated by an older, worse policy) because the transition from state \(s\) to \(s'\) might not reflect what the current agent would do.

This restriction forces algorithms to be on-policy, which requires vast amounts of fresh data—a luxury we rarely have in robotics.


The Core Method: Learning Stability Off-Policy

The authors propose a shift in perspective. Instead of relying on a Lyapunov function that depends only on the state, \(L(s)\), they introduce a Lyapunov function that depends on both the state and the action, \(L(s, a)\).

This seemingly small change is the key to unlocking off-policy learning. By including the action in the function, the agent can look at a transition \((s, a, s')\) stored in a replay buffer (old data) and still learn about the geometry of stability, regardless of what policy generated that action.

The New Lyapunov Risk Function

To train this neural network, the authors define a specific loss function (or “risk”) that the network must minimize. The goal is to punish the network if it violates the Lyapunov conditions.

Here is the mathematical formulation of the objective function:

The loss function for the off-policy Lyapunov candidate.

Let’s break down this equation (Eq 5):

  1. \(\max(0, -L_{\eta}(s, a))\): This term ensures the function is positive. If the network outputs a negative value for “energy,” the loss increases.
  2. \(\max(0, \mathcal{L}_{f, \Delta t}L_{\eta})\): This term ensures the function is decreasing. It penalizes the network if the “energy” increases over a time step (i.e., if the derivative is positive).
  3. Dependence on \(\pi(s')\): Notice in the second line of the image that the finite difference relies on \(L_{\eta}(s', \pi(s'))\). This is the “off-policy” magic. We use the stored transition to get to \(s'\), but we evaluate the Lyapunov function at \(s'\) based on the action the current policy \(\pi\) would take.

By minimizing this risk, the neural network learns a valid Lyapunov candidate using historical data (\(D\)), significantly improving sample efficiency.

Ensuring Convergence with \(\mu\)

Simply ensuring the energy “decreases” (is less than 0) isn’t always enough for practical learning. The system might decrease so slowly that it effectively stalls. To fix this, the authors introduce a hyperparameter \(\mu\), which enforces a minimum rate of decay.

The modified loss function looks like this:

The loss function with the minimum decay rate parameter mu.

Here, the derivative must be less than \(-\mu\). This forces the system to converge toward the equilibrium at a healthy pace. While this makes the function non-differentiable exactly at the equilibrium, it provides a robust signal for the neural network during training.

Algorithm 1: Lyapunov Soft Actor-Critic (LSAC)

The authors integrate this off-policy Lyapunov function into the Soft Actor-Critic (SAC) algorithm, a popular off-policy RL method known for its exploration capabilities.

In standard SAC, the agent tries to maximize an objective \(J_{\pi}\) that includes the expected reward and an entropy term (randomness/exploration):

The standard Soft Actor-Critic objective function.

The authors modify this objective to create LSAC. They add a penalty term that activates whenever the policy takes an action that violates the stability condition.

The Lyapunov Soft Actor-Critic (LSAC) objective function.

In Equation 10 above, \(\beta\) is a “Lyapunov temperature” hyperparameter.

  • If the stability condition is met (the derivative is more negative than \(-\mu\)), the \(\max\) term becomes 0. The Lyapunov function does not interfere, allowing the agent to maximize reward freely.
  • If the action causes instability (the derivative is too high), the penalty kicks in, forcing the policy to adjust its behavior toward safety.

Algorithm 2: Lyapunov PPO (LPPO)

Proximal Policy Optimization (PPO) is technically an on-policy algorithm, but it is one of the most widely used methods in robotics due to its reliability. The authors show that their off-policy Lyapunov function can essentially “supercharge” PPO.

Standard PPO updates its policy by maximizing an “advantage” function, \(\hat{A}_t\), which estimates how much better an action is compared to the average.

The authors create LPPO by augmenting this advantage function:

The augmented advantage function for LPPO.

If an action satisfies the Lyapunov decrease (stability), the advantage remains unchanged. However, if the action is unstable, the advantage is penalized. This trick effectively tells the PPO algorithm: “Even if this action leads to a high reward, it is ‘disadvantageous’ because it is unsafe.”

This modified advantage is then plugged directly into the standard PPO clipping objective:

The LPPO objective function.


Experiments and Results

To validate their framework, the researchers tested these algorithms on two distinct control tasks: the classic Inverted Pendulum and a high-dimensional Quadrotor (drone) simulation.

1. Inverted Pendulum

The Inverted Pendulum is a standard benchmark where an agent must swing up a pendulum and balance it at the top.

Sample Efficiency: The authors compared their LSAC (Lyapunov SAC) against standard SAC, PPO, and existing Lyapunov-based methods like LAC and POLYC.

Training rewards for the Pendulum experiment.

In Figure 2(a) above, the green line represents LSAC. Notice how it shoots up to the maximum reward faster than the other methods. This confirms the hypothesis: using off-policy data to learn the stability certificate makes learning significantly faster.

Visualizing Stability: One of the most compelling results is the visualization of the learned Lyapunov functions. The researchers plotted the “state space” of the pendulum (angle vs. angular velocity) and marked regions where the stability condition was violated with red dots.

Level curves and stability violations for LSAC, POLYC, and LAC.

  • Left (LSAC): The plot is clean with very few red dots (0.84% violations). The algorithm has learned a robust stability region.
  • Right (LAC): The plot is covered in red dots (51% violations). The on-policy nature of LAC struggled to explore and certify the entire space effectively.

This visualization proves that LSAC isn’t just learning to balance; it is learning a mathematically valid region of attraction where safety is guaranteed.

2. Quadrotor Tracking

The second experiment involved a Quadrotor drone tracking a reference trajectory—a much harder task with 13 state dimensions and 4 control inputs. Previous off-policy methods (like pure SAC or LAC) often fail entirely on this task due to the complexity. Therefore, the comparison focused on the PPO-based variants.

Performance:

Training rewards for the Quadrotor environment.

As shown in Figure 4, LPPO (green) converges faster than POLYC (cyan) and standard PPO (red). It achieves high rewards in fewer timesteps, reinforcing the sample efficiency argument.

Trajectory Tracking:

3D trajectory tracking comparison for Quadrotor.

Figure 5 shows the drone’s actual flight path.

  • The red line is the target path.
  • The blue line is the drone’s path.
  • Plot (a) LPPO: The tracking is tight and accurate.
  • Plot (b) POLYC: The drone tracks well but struggles slightly near the end.
  • Plot (c) PPO: Significant deviation from the target.

The LPPO agent uses the off-policy Lyapunov function to guide its exploration. It avoids “unsafe” erratic movements during training, which helps it converge on a smooth, accurate tracking policy much faster than standard PPO.


Conclusion & Implications

The research presented in “Off Policy Lyapunov Stability in Reinforcement Learning” tackles a critical bottleneck in robotic learning. By decoupling the Lyapunov function from the current policy and making it dependent on state-action pairs \((s, a)\), the authors enabled the use of historical data to certify safety.

Key Takeaways:

  1. Efficiency: Off-policy Lyapunov learning is significantly more sample-efficient than on-policy counterparts.
  2. Flexibility: The framework is versatile. It can transform off-policy algorithms like SAC into stable controllers (LSAC) and improve the safety/efficiency of on-policy algorithms like PPO (LPPO).
  3. Safety: The method produces policies with fewer stability violations, as visualized by the phase plots of the pendulum.

What’s Next? While the results are promising in simulation, the authors note that physical hardware testing is the next logical step. Real-world dynamics introduce noise and delays that simulation often misses. Additionally, theoretical work is needed to provide formal guarantees for the stability of these learned functions, ensuring they hold up mathematically as well as they do empirically.

For students and practitioners in RL, this paper highlights an important trend: we don’t have to choose between the efficiency of off-policy learning and the safety of control theory. With the right mathematical formulation, we can have both.