Introduction
One of the most persistent challenges in robotics is controlling systems that are difficult to model mathematically. Soft robots, for example, have infinite degrees of freedom and complex, nonlinear dynamics that defy standard “first principles” physics modeling.
Historically, engineers have faced a dilemma. They could use Model-Based Control, which is rigorous and efficient but fails when the mathematical model doesn’t perfectly match reality. Alternatively, they could use Data-Driven approaches, like Reinforcement Learning (RL). RL is powerful because it learns from experience, but it is notoriously “sample inefficient”—often requiring millions of trial-and-error interactions to learn a simple task. This makes RL impractical for real hardware, where gathering data is slow and expensive.
But what if there was a middle ground? What if we could use the efficiency of linear models but apply them to highly nonlinear systems, updating our understanding of the robot in real-time as it moves?
This brings us to Recursive Koopman Learning (RKL). This new pipeline combines the mathematical elegance of Koopman Operator Theory with the speed of Recursive Least Squares (RLS). The result is a system that can learn to control complex, nonlinear robots with less than 10% of the data required by state-of-the-art RL methods, all while running computationally lightweight updates in real-time.

As illustrated in Figure 1, the RKL pipeline creates a closed loop. The system estimates the state, computes an optimal control input using Model Predictive Control (MPC), applies it to the environment, and then immediately uses the resulting data to update its internal model via Recursive Least Squares (RLS). This allows the controller to adapt on the fly, essentially “learning while doing.”
Background: Linearity in a Nonlinear World
To understand how RKL works, we must first establish the mathematical foundation: Koopman Operator Theory.
The Koopman Operator
Dynamical systems in the real world are usually nonlinear. In a linear system, if you double the input, you double the output. In a nonlinear system, doubling the input might triple the output, or do nothing at all. Linear systems are easy to control; we have decades of solved math for them. Nonlinear systems are hard.
Koopman theory offers a loophole. It proposes that a nonlinear dynamical system in a finite-dimensional state space can be represented as a linear system in an infinite-dimensional space of “observables.”
Let \(\mathbf{x}\) be the state of our robot and \(\mathbf{u}\) be the control input. Instead of looking at \(\mathbf{x}\) directly, we look at a set of observation functions \(\phi(\mathbf{x})\), such as polynomial features (\(x, x^2, x^3\)) or trigonometric functions (\(\sin(x), \cos(x)\)).
If we lift the state into this higher-dimensional space, the evolution of these observables can be described linearly by the Koopman Operator, \(\mathcal{K}\).

Here, \(\mathbf{K}_z\) and \(\mathbf{K}_g\) represents the discrete-time linear dynamics in the lifted space. This allows us to use powerful linear control techniques on nonlinear robots.
Extended Dynamic Mode Decomposition (EDMD)
Since we cannot compute an infinite-dimensional operator, we approximate it using a finite number of basis functions. The standard method for finding the matrix \(\mathbf{K}\) from data is Extended Dynamic Mode Decomposition (EDMD).
Given a dataset of snapshots, EDMD finds the best linear operator that maps the current observables to the next time step’s observables by minimizing the prediction error.

Here, \(\mathbf{Y}\) contains the current snapshots and \(\bar{\mathbf{Y}}\) contains the snapshots one time step later. The solution to this optimization problem typically involves the Moore-Penrose pseudoinverse (\(\dagger\)):

While EDMD is effective, it has a major drawback: it is a batch process. Every time you get new data, you have to add it to your giant matrices \(\mathbf{Y}\) and \(\bar{\mathbf{Y}}\) and recompute the solution. As the dataset grows, the computational cost explodes (\(O(N)\)), making it too slow for real-time updates on a robot moving at high frequency.
The Core Method: Recursive Koopman Learning (RKL)
The innovation of RKL is replacing the batch processing of EDMD with an iterative, online update mechanism known as Recursive Least Squares (RLS).
The Power of Recursive Updates
The goal is to update the Koopman matrix \(\mathbf{K}\) immediately after every single time step (\(k\)), using the new data pair \((\boldsymbol{\alpha}_k, \boldsymbol{\beta}_k)\), without retraining on the whole history.
To do this, we track a matrix \(\mathbf{P}\), which represents the inverse of the data covariance. In standard EDMD, we would calculate:

Calculating the inverse (or pseudoinverse) of large matrices is computationally expensive (\(O(n^3)\)). RLS avoids this by using the Sherman-Morrison formula, which allows us to update the inverse of a matrix after a small rank-1 change (adding one data point) using simple matrix-vector multiplication (\(O(n^2)\)).
The Algorithm
The RKL process works as follows:
- Initialization: We start with a small, offline dataset to get an initial estimate of \(\mathbf{K}_0\) and \(\mathbf{P}_0\) using standard EDMD.
- Control Loop: At each time step \(k\), we obtain the current state \(\mathbf{z}_k\) and control \(\mathbf{u}_k\).
- Update P: We calculate a gain factor \(\gamma_k\) and update the covariance matrix \(\mathbf{P}\).

- Update K: We then update the Koopman model \(\mathbf{K}\) using the prediction error (the difference between the actual next state \(\boldsymbol{\beta}_k\) and the predicted state \(\mathbf{K}_k \boldsymbol{\alpha}_k\)).

Notice that the computational complexity here depends only on the size of the observables (\(n\)), not on the size of the dataset (\(N\)). This means the algorithm runs just as fast after 1,000,000 steps as it does after 1 step. This is the key to real-time capability.
Theoretical Convergence: Why it Works
A critical contribution of this research is not just the engineering pipeline, but the mathematical proof that this method actually converges.
The researchers analyzed EDMD and RLS in the context of Markov Chains. Robot data is not independent and identically distributed (i.i.d.); where the robot is now depends on where it was previously. This dependency creates a Markov chain.
The paper provides the first formal convergence analysis of EDMD under continuous data growth. By leveraging the Strong Law of Large Numbers for Markov Chains, the authors identify sufficient conditions for convergence:
- The data generation process must be ergodic (the robot eventually explores the relevant state space).
- The observables must be square-integrable.
- The covariance matrix must remain full rank.
This analysis supports the Attempting Control Goal (ACG) hypothesis: data collected while trying to control the system is particularly informative. Even if the initial policy is imperfect, the act of trying to reach the goal generates data that pushes the model to converge exactly where accuracy is needed most.
Control Synthesis: MPC-SAC
With an up-to-date linear model \(\mathbf{K}\), RKL synthesizes control inputs using Sequential Action Control (MPC-SAC).

MPC-SAC is a variation of Model Predictive Control that calculates continuous-time optimal control actions analytically. It is highly stable and avoids the numerical issues often found in standard discrete-time MPC solvers when dealing with learned models.
Experiments & Results
The researchers validated RKL on two distinct platforms: a simulated planar arm and a real-world soft robot.
Simulation: Planar Two-Link Arm
The first test was a tracking task where a simulated arm had to follow a figure-8 trajectory. This provided a comparison against strong baselines, including Soft Actor-Critic (RL-SAC) and Randomized Ensembled Double Q-Learning (REDQ).
Sample Efficiency: RKL showed massive improvements in sample efficiency.
- RKL achieved high-performance tracking with only 3,500 data steps.
- RL-SAC required nearly 2,000,000 steps to reach comparable performance.
- REDQ, a state-of-the-art sample-efficient RL method, still required significantly more data than RKL.
The visual difference in trajectory tracking is stark. Below, Figure 7 compares standard Koopman Learning (KL) against Recursive Koopman Learning (RKL).

In the columns above, notice how the red line (actual path) overlaps the blue line (reference) much tighter for RKL-SAC (bottom row) compared to KL-SAC (top row), even with small datasets.
In contrast, look at the results for pure Reinforcement Learning (RL-SAC) below (Figure 9). Even after 1 million to 2.5 million training steps, the tracking is still imperfect.

Similarly, REDQ (Figure 10) performs better than standard RL but still lags behind RKL in convergence speed.

Hardware: The Soft Stewart Platform
The ultimate test was on the Soft Stewart Platform, a parallel robot actuated by soft, pneumatic-like artificial muscles. This system is highly nonlinear, hybrid (involving contact with walls), and difficult to model physically.

The task was to balance a puck on the platform or move it along specific trajectories.
Speed and Stability: RKL was able to learn a high-performance controller in just 1 minute and 20 seconds (8,000 steps). By comparison, RL-SAC took 2 hours and 46 minutes and achieved less than 50% of the performance.
The box plot below (Figure 2) summarizes the balancing error.

- Chart (a) shows the mean error. RKL-SAC (orange) achieves very low error even with tiny datasets (1-2 minutes).
- Chart (b) shows stability (standard deviation). RKL is significantly more stable than the RL baseline (green), which suffers from oscillation.
It is worth noting that standard Koopman Learning (KL) without updates performed poorly. This proves that the online recursive update is the “secret sauce.” The pre-trained model is never perfect; the robot must adapt to the specific conditions of the current task (friction, specific actuator behavior) in real-time.
Conclusion
Recursive Koopman Learning (RKL) represents a significant step forward in data-driven control. By combining the theoretical guarantees of Koopman operators with the algorithmic efficiency of Recursive Least Squares, RKL solves the “sample efficiency” problem that plagues modern robotics.
Key Takeaways:
- Efficiency: RKL learns in minutes what RL learns in hours (or days).
- Scalability: The computational cost of RKL does not grow with the dataset size, enabling “forever” learning.
- Adaptability: Continuous updates allow the controller to refine its model locally, capturing complex dynamics that a static global model might miss.
- Theory: Formal analysis confirms that learning on the fly (using data from the control task itself) leads to model convergence.
For students and researchers interested in controlling soft robots or other complex systems, RKL offers a compelling alternative to deep reinforcement learning—one that respects the constraints of the physical world and the value of rigorous mathematical modeling.
](https://deep-paper.org/en/paper/2509.08241/images/cover.png)