Competitive sports have long been the proving ground for Artificial Intelligence. We’ve seen AI conquer Chess, Go, Poker, and even complex video games like StarCraft II. But moving from the virtual world of pixels to the physical world of robotics introduces a massive spike in difficulty. In “embodied” sports, agents don’t just need to outsmart an opponent; they have to grapple with physics, gravity, aerodynamics, and the chaos of the real world.

While robot soccer has been a popular benchmark for years, a new, highly dynamic challenge has emerged: 3v3 Multi-Drone Volleyball.

In a fascinating new paper from Tsinghua University, researchers propose Hierarchical Co-Self-Play (HCSP), a framework that teaches a team of quadrotors to play volleyball. The challenge is immense: the drones must coordinate a three-hit combo (pass, set, attack), maneuver in 3D space, hit a moving ball with a small racket, and adapt to opponents—all without ever seeing a human expert demonstrate how to do it.

In this post, we will dive deep into how they achieved this, breaking down the hierarchical architecture, the three-stage training pipeline, and the surprising emergent behaviors that the drones developed on their own.

The Challenge: Why Drone Volleyball?

Before looking at the solution, we need to understand the problem. The VolleyBots testbed introduces a scaled-down 6m x 12m court with a net height of 2.43m. The rules mimic real volleyball:

  1. Two teams of three drones.
  2. The ball must be returned over the net within three hits.
  3. The same drone cannot hit the ball twice in a row.
  4. If the ball hits the ground or goes out of bounds, the other team scores.

Illustrations of the 3v3 multi-drone volleyball task.

This task is a nightmare for standard Reinforcement Learning (RL) for several reasons:

  • Coupled Challenges: It requires high-level strategy (Who should hit the ball? Where should we aim?) and low-level control (How much thrust for each of the 4 rotors to hit the ball at exactly the right angle?).
  • Long Horizons: An action taken now (a pass) might not result in a reward (a point scored) for several seconds, making it hard for the AI to learn cause-and-effect.
  • Underactuated Dynamics: Quadrotors are “underactuated,” meaning they can’t independently control all their degrees of freedom simultaneously. To move forward, they must tilt, which changes the angle of the racket attached to them.

The Solution: Hierarchical Co-Self-Play (HCSP)

To solve this, the researchers moved away from “end-to-end” learning (where one massive neural network tries to do everything). Instead, they adopted a Hierarchical Reinforcement Learning (HRL) approach.

Think of a professional sports team. You have a Coach (High-Level Strategy) who calls the plays and decides who does what. Then you have the Athletes (Low-Level Skills) who actually execute the movements using their muscle memory.

HCSP mimics this structure:

  1. High-Level Strategy (The Coach): An event-driven policy that observes the whole game and issues commands. It runs only when specific events happen (e.g., a ball is hit or crosses the net).
  2. Low-Level Skills (The Athletes): A pool of distinct policies (Serve, Set, Attack, Hover) that control the drone’s motors at a high frequency (50Hz).

HCSP architecture: an event-driven high-level strategy handles strategic decisions, while multiple low-level skills manage continuous control.

As shown in Figure 1 above, the workflow is cyclical. The environment sends observations to the High-Level Strategy. The Strategy picks a skill (e.g., “Drone 2, execute Set”) and sends specific parameters to the Low-Level Skill network. That network then outputs continuous motor commands to the drone.

The Three-Stage Training Pipeline

The brilliance of this paper lies in how the authors trained this system. You can’t just throw all of this into a simulator and hope for the best. The researchers designed a three-stage pipeline to build capability from the ground up.

Stage I: Low-Level Skill Acquisition

Before a player can play a match, they need to learn the basics. In Stage I, the researchers trained individual neural networks for specific motion primitives. They defined seven core skills:

Table 1: Description of seven low-level skils acquired in stage I.

The Problem of Transitions

Training these skills in isolation is risky. If you train an “Attack” policy starting from a perfect hover, it might fail during a game because the drone is actually moving fast after a previous maneuver.

To solve this, the authors used Policy Chaining. They trained skills in sequences. For example, the Hover skill is trained starting from the state where an Attack skill just finished. This ensures that the end of one skill matches the beginning of the next, creating smooth transitions.

Ablation study on policy chaining in Stage I.

As seen in Figure 4(c) above, Policy Chaining was critical. Without it (the “Single-policy” line), the drone simply couldn’t learn to stabilize (Hover) after an aggressive Attack maneuver.

Stage II: High-Level Strategy Pretraining

Once the “athletes” (low-level skills) were trained, they were frozen. In Stage II, the focus shifted to training the “Coach” (High-Level Strategy).

The High-Level policy is a Multi-Layer Perceptron (MLP) with three “heads”—one for each drone on the team. It sees the global state of the game (positions of all drones and the ball) and outputs which skill each drone should perform.

Event-Driven Control & Sample Reallocation

A key innovation here is how the team handled time. In a continuous game, making a high-level strategic decision every 0.02 seconds is inefficient and noisy. Instead, the High-Level strategy is event-driven. It only wakes up when:

  1. A racket hits the ball.
  2. The ball crosses the net.

This creates a “sparse” decision process. To make training efficient, the researchers used Sample Reallocation.

Illustrations of the high-level strategy pretraining stage (Stage II).

Standard RL training gathers data in batches. Because the high-level events are rare, a standard batch would be mostly empty or “do nothing” steps. Sample reallocation (shown in Figure 2b) extracts only the meaningful transition moments and reassigns the rewards accumulated over the waiting period to that single decision.

The reward function for the high-level strategy is simple: win the game.

\[ r _ { j , t } ^ { H } = c _ { 1 } \times \mathrm { w i n \_ o r \_ l o s e } _ { j } + c _ { 2 } \times \mathrm { r a c k e t \_ h i t \_ b a l l } _ { j } , \]

This sparse reward (Equation 1) is enough because the low-level skills already know how to hit the ball; the high-level policy just needs to learn when and where.

By using Population-Based Training (PBT), specifically a method called PSRO (Policy-Space Response Oracles), the strategy evolved by playing against previous versions of itself.

Win-rate heatmap illustrating the evolution of high-level strategy training in Stage II.

The heatmap above visualizes this evolution. As training progresses (from policy 1 to 5), the newer policies consistently beat the older ones (red blocks in the lower triangle).

Stage III: Co-Self-Play

By the end of Stage II, the team has decent skills and a good strategy. But there’s a problem: the “athletes” haven’t improved their technique to match the “coach’s” new tactics.

Stage III is where Co-Self-Play happens. The researchers unfreeze the low-level skills and train both levels simultaneously.

Illustration of the co-self-play stage (Stage III).

This is conceptually difficult. If the low-level skills change too much, the high-level strategy will get confused. If the high-level strategy changes too fast, the low-level skills won’t have time to adapt.

To stabilize this, they introduced two key mechanisms:

  1. Shared High-Level Reward: In Stage I, skills were trained with engineering rewards (e.g., “hit the ball to coordinate X”). In Stage III, the low-level skills abandon those specific goals and adopt the high-level goal: Win the game. This allows the skills to evolve in ways the engineers didn’t anticipate.
  2. KL Divergence Penalty: To prevent the skills from forgetting their basics, a penalty is applied if the new policy deviates too far from the original Stage I policy. \[ \begin{array} { r } { r _ { i , t } ^ { L } = r _ { j , t } ^ { H } - c _ { 3 } \times K L ( \pi _ { i } ^ { L } | | \pi _ { i , r e f } ^ { L } ) . } \end{array} \] As shown in Equation 2, the low-level reward (\(r^L\)) combines the team’s victory reward (\(r^H\)) with a penalty for drifting too far from the reference skill (\(\pi_{ref}\)).

Emergent Behaviors

The most exciting result of Stage III Co-Self-Play is the emergence of strategies that were never explicitly programmed.

The “Dump” Shot

In Stage II, the teams rigorously followed the “Pass -> Set -> Attack” structure because that’s how the skills were defined. However, in Stage III, the “Setter” drone realized that if the opponent was out of position, it could just hit the ball over the net immediately instead of setting it for a teammate.

Sequence of six temporally sampled frames illustrating an emergent team behavior.

In Figure 5, you can see this “Dump” shot in action (Frame e). This behavior emerged purely from the desire to win, proving that the hierarchical structure didn’t lock the agents into rigid patterns.

The Front-Flip Attack

Perhaps even more impressive was a physical maneuver. During the training of the Attack skill, the drones discovered that performing a front-flip allowed them to spike the ball with significantly more velocity.

Sequence of six frames, sampled sequentially in time from the start to the completion of the front-flip attack.

This complex acrobatic maneuver (Figure 9) utilizes the drone’s angular momentum to smack the ball harder—a technique discovered entirely by the AI through trial and error (policy chaining).

Results and Performance

So, how good is HCSP? The researchers compared it against several baselines:

  • SP / FSP / PSRO: Various flat (non-hierarchical) self-play methods.
  • Bot: A handcrafted rule-based hierarchical agent.

Experiment results. (a) HCSP performance against baseline policies.

The results in Figure 4(a) are staggering. HCSP (the red bar) achieves an 82.9% win rate on average against all baselines. It crushes the non-hierarchical methods and handily beats the rule-based Bot.

Furthermore, the “Co-Self-Play” (Stage III) proved essential. When comparing the final policy against the Stage II policy (where skills were frozen), the Stage III policy won 71.5% of the matches.

Table 2: Win rates of Stage II policy and Stage III policy against different opponents.

Table 2 highlights this dominance. Against a “Nash-averaged” opponent (a mix of the best strategies), the Stage II policy only wins 31.4% of the time, while the Stage III policy reaches nearly 50%—essentially reaching parity with the theoretical best mix of all agents.

From Simulation to Reality

Finally, a major question in robotics is always “Sim-to-Real gap.” Does this only work in the simulator?

While the full 3v3 game hasn’t been played physically yet (due to space and safety constraints), the authors validated the low-level skills on real hardware. They equipped a drone with a badminton racket and tested the Serve and Solo Bump skills.

Real-world experiments.

As shown in Figure 12, the drone successfully tracked and hit the ball in the real world. In the Solo Bump task (Figure 12c/e), the drone managed to keep the ball in the air for 29 consecutive hits, demonstrating that the control policies learned in simulation are robust enough for reality.

Conclusion

The paper “Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning” represents a significant step forward in embodied AI. It demonstrates that by breaking a complex problem down (Hierarchy), training parts sequentially (Three-Stage Pipeline), and then allowing them to evolve together (Co-Self-Play), we can achieve performance that far exceeds traditional methods.

The emergence of the “dump shot” and the “front-flip attack” serves as a powerful reminder: when we give AI the right structure and the right incentives, it will often find solutions that surprise us. As hardware improves, we might soon see robot sports leagues that aren’t just novelties, but displays of genuine athletic strategy.