Imagine trying to move a heavy sofa up a winding staircase with two friends. It requires constant communication: “Pivot left,” “lift higher,” “wait, it’s slipping.” Now, imagine doing that while flying, buffeted by wind, connected to the object only by loose cables, and—here is the kicker—you aren’t allowed to speak to each other.

This is the challenge of cooperative aerial manipulation. While a single drone (Micro Aerial Vehicle or MAV) is often too weak to carry heavy payloads, a team of drones can lift significantly more. By tethering multiple drones to a single load, we can transport construction materials or emergency supplies to remote areas.

However, coordinating a swarm of drones attached to a swinging pendulum is incredibly difficult. The state-of-the-art solution has typically been centralized control: a single, powerful computer calculates the physics for every drone and tells them exactly what to do. While effective, this approach is computationally expensive and fragile. If the central computer fails, or communication bandwidth drops, the whole system crashes.

In a new paper presented at CoRL 2025, researchers propose a breakthrough: a fully decentralized method using Multi-Agent Reinforcement Learning (MARL). Their system allows a team of drones to manipulate a cable-suspended load with high precision without communicating with each other.

Figure 1: Multi-MAV lifting system performing full-pose control of a cable-suspended load. Left: simulation environment used to train the decentralized outer-loop control policy. Right: policy transferred to the real system.

The Problem with Centralization

To understand why this research is significant, we first need to look at the physics of the problem. When multiple drones carry a single object via cables, they are dynamically coupled. If Drone A moves left, it pulls the cable, which tugs the load, which changes the tension on Drone B’s cable.

Traditional methods, like Nonlinear Model Predictive Control (NMPC), treat the entire squad and the load as one giant robot. The central brain solves complex differential equations to ensure Drone A doesn’t pull Drone B out of the sky.

This works in the lab, but it has major drawbacks in the real world:

Scalability: As you add more drones, the math gets exponentially harder. A central computer eventually hits a limit on how fast it can calculate.
Communication: It requires perfect, high-speed data transfer between all drones and the central computer.
Fragility: It introduces a single point of failure.

The researchers set out to solve this by making each drone an independent thinker.

The Solution: Decentralized MARL

The team developed a method where each drone runs its own control policy onboard. The drones do not share their internal states, nor do they coordinate explicitly. Instead, they rely on implicit communication through the load itself. By observing how the load moves and rotates, a drone can infer what the others are doing and adjust its behavior accordingly.

The Architecture: Implicit Coordination

The researchers modeled the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). They used a training paradigm known as Centralized Training, Decentralized Execution (CTDE).

As shown in Figure 2, the system consists of two main loops:

Outer-Loop (The “Brain”): A neural network trained via Reinforcement Learning (RL) that decides what the drone should do (e.g., “accelerate forward”).
Inner-Loop (The “Reflexes”): A robust low-level controller that figures out how to spin the motors to achieve that acceleration, handling immediate disturbances like wind or cable tension.

Training vs. Execution

During training (in simulation), the system uses a “Critic” network that sees everything—the global state. This includes the exact position and velocity of the load, the goal, and every single drone.

Equation 1: Global State definition

However, during execution (actual flight), the “Actor” (the policy running on the drone) wears blinders. It only receives local observations (\(o_i\)).

Equation 2: Local Observation definition

The observation \(o_i\) includes:

The Load’s position and rotation (\(p_L, R_L\)).
The Goal state relative to the load (\(x_G\)).
The drone’s own state (\(x_{M,i}\)).
A “one-hot” vector identifying itself (\(e_i\)) so it knows which agent it is (e.g., “I am Drone #1”).

Crucially, it does not see the other drones. It doesn’t know where they are or what they are thinking. It only knows how the load is behaving. If the load tilts unexpectedly, the neural network infers that another drone must be pulling on it and compensates.

The Secret Sauce: Action Space Design

One of the biggest hurdles in using Reinforcement Learning for robotics is the Sim-to-Real gap. A policy trained in a perfect simulation often fails in the real world because physics engines aren’t perfect representations of reality, especially when complex aerodynamics and slack cables are involved.

To bridge this gap, the researchers innovated on the Action Space—essentially, what commands the neural network outputs.

Typical approaches use:

Direct Motor Thrusts (CTBR): The RL outputs exact motor speeds. This is very hard to transfer to reality because it requires the sim to perfectly model aerodynamics.
Velocity: The RL tells the drone how fast to go. This often leads to dangerous oscillations.

The researchers proposed a hybrid action space called ACCBR (Acceleration and Body Rates). The policy outputs:

Reference Linear Acceleration.
Reference Body Rates (how fast the drone should spin).

These commands are fed into a low-level controller based on Incremental Nonlinear Dynamic Inversion (INDI).

Equation 4: Acceleration Controller

The INDI controller is a robust feedback loop that uses accelerometer data to instantly correct for external forces—like the yank of a cable or a gust of wind—before the high-level RL policy even notices. This allows the RL agent to focus on the high-level strategy of moving the load, while the mathematical controller handles the messy physics of keeping the drone stable.

Does It Actually Work?

The researchers tested their method in both simulation and real-world experiments using three custom-built quadrotors carrying a 1.4kg payload.

1. Precision Tracking

First, they compared their decentralized RL method against the centralized NMPC (the mathematical “gold standard”).

Figure 3: Time series of pose tracking results comparing our method and a centralized NMPC method [6]. Our method also includes a setup with 4 MAVs.

The results (Figure 3) are impressive. The decentralized method (Ours) tracks the reference almost as well as the centralized NMPC. While the NMPC is slightly smoother (because it plans a full trajectory ahead of time), the RL method achieves comparable accuracy in final positioning.

Most importantly, the RL method is computationally constant. Whether you have 3 drones or 100, the computation time for each drone remains 6 milliseconds. In contrast, the centralized NMPC takes 78ms for 3 drones and slows down exponentially as more agents are added.

2. Choosing the Right Actions

The team validated their choice of the ACCBR action space through ablation studies in simulation.

Figure 5: Positional and attitude errors comparing different action spaces at test time in the Gazebo environment. Table 1: Pose tracking RMSEs of different action spaces at test time in the Gazebo environment.

As seen in Figure 5:

ACC (Acceleration only): Fails to control orientation well (Orange line).
VEL (Velocity): Causes hazardous oscillations (Blue line).
ACCBR (Green line): Provides the best balance of stability and tracking accuracy.

Note that the raw thrust method (CTBR), often used in other RL papers, failed to even take off in this complex scenario because it couldn’t handle the unpredictable forces from the cables.

3. Robustness: Hacking and Failure

The most exciting part of the research is how robust the system is. Because each drone is independent, the system creates a natural resilience.

Scenario A: The “Hacked” Drone The researchers intentionally “hacked” one drone, overriding its RL policy with a manual controller that tried to pull the load away from the goal.

Figure 4: Real-world experiments. (A) Snapshot of the test with heterogeneous agents… (B) Snapshot of the test where additional load is added… (C) Snapshot of the case where one MAV fails in flight…

In Figure 4A (and detailed in the data below), we see that when the hacked drone (let’s call it Drone X) pulls outward, the other two drones sensing the load shifting immediately counter-pull to keep the load steady. They didn’t know Drone X was hacked; they just felt the load move and corrected it. A centralized system might have fought against itself or become unstable because the model of the system no longer matched reality.

Figure 7: Time series of the load pose in the heterogeneous agents scenario…

Scenario B: Mid-Flight Engine Failure In a dramatic test, the researchers cut power to one drone completely during flight (Figure 4C). The drone dropped, hanging as dead weight from the load.

Remarkably, the remaining two drones recovered. They stabilized the load and continued to control its position and yaw.

Figure 8: Time series of load pose in the in-flight failure of one MAV case…

Figure 8 shows the failure moment (purple line). The Partially Observable (Decentralized) policy (Blue line) recovers quickly. The Fully Observable (Centralized) policy (Red line), surprisingly, fails catastrophically. Because the centralized policy relies on the states of all agents, the sudden erratic data from the falling drone confuses the entire system, causing it to crash into the ground.

This proves that ignorance can be bliss: by ignoring the specific states of their teammates and focusing only on the task (the load), the drones became resilient to teammate failure.

Why We Don’t Need to See Everything

One might assume that seeing more data is always better. However, the training curves reveal an interesting insight about observation spaces.

Figure 6: Training curves of fully observable, partial augmented, and partially observable observation spaces.

The researchers compared a policy that sees everything (Fully observable) against their decentralized policy (Partially observable). While the fully observable policy learns slightly faster initially, the decentralized policy catches up. This confirms that the load pose acts as a sufficient statistic. The physics of the load contains all the necessary information about what the other drones are doing.

Conclusion

This research represents a significant step forward for aerial robotics. By combining Multi-Agent Reinforcement Learning with a robust low-level controller, the authors demonstrated that we don’t need expensive, centralized supercomputers to perform complex tasks.

The key takeaways are:

Decentralization Works: Drones can coordinate complex tasks using only local information and implicit communication through the physical object they are carrying.
Action Space Matters: Choosing the right output (Acceleration + Body Rates) is the bridge that allows AI learned in simulation to work in the real world.
Resilience: A decentralized swarm is harder to kill. Even if agents go rogue or fail completely, the remaining agents adapt naturally.

While the system currently relies on motion capture cameras for precise load positioning (a limitation for outdoor use), the move toward onboard vision systems seems like the next logical step. Future construction sites might just feature swarms of silent, independent drones, working in perfect harmony to build the skylines of tomorrow.

The Problem with Centralization#

The Solution: Decentralized MARL#

The Architecture: Implicit Coordination#

Training vs. Execution#

The Secret Sauce: Action Space Design#

Does It Actually Work?#

1. Precision Tracking#

2. Choosing the Right Actions#

3. Robustness: Hacking and Failure#

Why We Don’t Need to See Everything#

Conclusion#