Introduction

Imagine a high-stakes search-and-rescue mission in a dense forest. A fleet of drones is scanning the ground below. Suddenly, one drone suffers a battery failure and must return to base. A backup drone is immediately deployed to take its place.

In an ideal world, this swap is seamless. The new drone joins the formation, understands the current strategy, and collaborates perfectly with the existing team. But in reality, this is an immense challenge for robotics. Most multi-agent systems are trained to work with specific, pre-defined partners. They rely on “over-training” with a fixed team, developing a secret language of movements and reactions. When a stranger—an “unseen” teammate—enters the mix, the coordination often falls apart.

This capability is called Adaptive Teaming, and while it has been explored in virtual video game environments like Overcooked or Hanabi, it has been largely overlooked in the complex, physical world of multi-robot systems.

In this post, we are diving deep into a new research paper titled “AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit.” The researchers behind this work have created the first dedicated benchmark to train and test how well drones can collaborate with unfamiliar partners in pursuit-evasion scenarios. We will explore how they bridged the gap between simulation and reality, the novel algorithms they developed (including one based on hypergraph game theory), and what this means for the future of autonomous swarms.

The Background: The “Stranger” Problem in Robotics

To understand why AT-Drone is necessary, we first need to look at the current state of Multi-Agent Reinforcement Learning (MARL).

In standard MARL, agents learn a joint policy. Agent A learns to go left specifically because it knows Agent B will go right. They are like a synchronized swimming duo that has practiced the same routine a thousand times. If you swap Agent B for a different swimmer who hasn’t practiced that specific routine, Agent A doesn’t know what to do.

This limitation is critical in real-world applications like border surveillance or counter-terrorism, where:

  1. Team composition changes: Drones are damaged or depleted.
  2. Heterogeneity exists: Different drones might run different software versions or come from different manufacturers.
  3. Communication is limited: You cannot always rely on high-bandwidth data sharing to sync up.

Two Flavors of Adaptive Teaming

The researchers categorize the solution to this problem into two main approaches, which they benchmark in the paper:

  1. Adaptive Teaming without Teammate Modeling (AT w/o TM): Also known as Zero-Shot Coordination (ZSC). The agent must figure out a robust strategy that works reasonably well with any partner, without trying to explicitly guess what the partner is thinking.
  2. Adaptive Teaming with Teammate Modeling (AT w/ TM): Also known as Ad-Hoc Teamwork. The agent actively observes the partner’s actions to infer their intent or “type,” then adjusts its strategy accordingly.

Until now, there hasn’t been a standardized way to test these methods on drones. Existing benchmarks were either too simple (video games with discrete moves like “up/down/left/right”) or lacked the “unseen teammate” aspect.

Table 1: Comparison of related work. Grey rows represent literature related to multi-drone pursuit, while pink rows highlight adaptive teaming studies from the machine learning field.

As shown in Table 1, previous works were scattered. Some focused on drones but ignored adaptive teaming (grey rows). Others focused on adaptive teaming but stayed inside video games (pink rows). AT-Drone fills the gap by combining multi-learner adaptive teaming with the continuous, physics-based reality of drone flight.

The AT-Drone Benchmark Architecture

The AT-Drone benchmark is not just a software simulation; it is a full-stack framework designed to take algorithms from a computer screen to physical flight. The system consists of four pillars, illustrated in the figure below:

  1. Simulation: A customizable training ground.
  2. Deployment: A pipeline to push code to real Crazyflie drones.
  3. Algorithm Zoo: A collection of cutting-edge algorithms (including new ones proposed by the authors).
  4. Evaluation: Standardized protocols to measure success against “unseen” drones.

Figure 1: Overview of the AT-Drone Benchmark, comprising four key components: simulation, deployment, training, and evaluation.

1. The Simulation Environment

The researchers built a highly configurable environment based on Gymnasium (a standard standard in RL). The task is Multi-Drone Pursuit: a team of “pursuer” drones must catch “evader” targets while navigating around obstacles.

The complexity here is key. If the environment is too open, the task is too easy and doesn’t require teamwork. If it’s too cluttered, it becomes a navigation task rather than a coordination task. The authors designed four specific environments with increasing difficulty, coded by the number of Pursuers (p), Evaders (e), and Obstacles (o).

Figure 4: Illustration of four multi-drone pursuit environments in real world.

  • 4p2e3o (Easy): 4 Pursuers, 2 Evaders, 3 Obstacles. Plenty of space to maneuver.
  • 4p2e1o (Medium): Only 1 obstacle, but it’s central, giving evaders more freedom to loop around, requiring drones to cordon them off.
  • 4p2e5o (Hard): 5 obstacles create “choke points.”
  • 4p3e5o (Superhard): 3 evaders and high clutter. This requires complex splitting of the team to corner multiple targets simultaneously.

Users can create their own scenarios using a JSON-based configuration file, allowing them to tweak the physics, boundaries, and agent counts instantly.

Figure 5: An example of environment configuration file.

2. Real-World Deployment

Simulation is safe; reality is messy. The benchmark includes a pipeline using Crazyflie drones (small, agile quadcopters) managed by a motion capture system.

The “brain” of the operation isn’t on the drone itself (which has limited compute). Instead, the system uses edge devices (like Nvidia Jetson Orin Nano). The motion capture system sends position data to the edge device, the adaptive teaming algorithm processes it, and control commands are beamed back to the drones. This setup mimics a realistic operational setup where a ground station or a mothership drone handles the heavy computational lifting.

The Algorithm Zoo: How to Train for the Unknown

The heart of the paper lies in the algorithms. The authors didn’t just provide a testing ground; they introduced improved methods for solving the adaptive teaming problem. They provide a “Zoo” of algorithms, but two stand out as the primary contributions for handling the “unseen teammate” problem.

Approach 1: HOLA-Drone V2 (Zero-Shot Coordination)

How do you train an agent to work with anyone? One approach is to train it against a variety of partners during the learning phase. However, randomly picking partners is inefficient. You need to train against partners that expose your weaknesses.

The authors propose HOLA-Drone V2, an algorithm based on Hypergraphic Game Theory.

The Hypergraph Concept

In a standard graph, an edge connects two nodes (Agent A connected to Agent B). But in a multi-drone team, interactions are not just pairwise. The success of the mission depends on the group combination. A “Hypergraph” allows an edge to connect multiple nodes at once, representing a full team composition.

The researchers use a Preference Hypergraph. Imagine a population of different drone strategies. Some strategies work well together; others crash into each other. The algorithm maps these relationships.

Figure 6: An example of a hypergraph representation (left) and its corresponding preference hypergraph (right).

In Figure 6 (Left), we see a hypergraph where edges connect groups of agents, with weights representing their collaborative score. On the right, this is converted into a Preference Hypergraph. An arrow points from a node to the team it “prefers” (works best with).

To quantify how good a teammate is, they calculate Preference Centrality (\(\eta\)).

Equation for Preference Centrality

Essentially, if many other strategies “point” to you as their preferred partner, you have high centrality. You are a good team player.

The Max-Min Preference Oracle

The goal is to find a set of learners that are Preference Optimal—meaning they work well with the widest range of possible teammates.

Equation describing Preference Optimality

To achieve this, the authors introduce a Max-Min Preference Oracle. This is a training loop that fundamentally changes how the drones learn:

  1. Min-Step (The Adversary): The system analyzes the current population and identifies the worst possible partners for the current learners. It doesn’t just pick one; it creates a probability distribution (a “mixed strategy”) that focuses on the partners the learners struggle with the most.
  2. Max-Step (The Learner): The learners then train specifically to maximize their reward against these “worst-case” partners.

Equation for Max-Min Oracle Equation showing the iterative relationship between learners and teammates

This creates a robust cycle. As the learners get better at handling bad teammates, the “worst” teammates change, and the learners must adapt again. This process is visualized in the algorithm overview below:

Figure 7: Overview of our proposed HOLA-Drone (V2) algorithm.

By the end of training, the drone hasn’t just memorized a route; it has learned generalizable skills to handle uncooperative or erratic partners.

Approach 2: NAHT-D (Teammate Modeling)

The second major contribution is NAHT-D (N-Agent Ad-Hoc Teamwork for Drones). Unlike HOLA-Drone, which tries to be robust to everyone, NAHT-D tries to understand the specific partner it is flying with right now.

It extends the popular MAPPO (Multi-Agent Proximal Policy Optimization) algorithm by adding a Teammate Modeling Network.

This network acts like an encoder. It takes in the history of the partner’s actions and observations and compresses them into a “team encoding vector” (embedding).

  • Input: “Teammate moved left, then accelerated toward Evader 1.”
  • Embedding: “Teammate is aggressive/fast.”

This embedding is then fed into the drone’s policy network. The drone effectively says, “I am flying with an aggressive partner, so I should play a supporting role to avoid collision.”

Table 3: Implementation hyperparameters of NAHT -D algorithm.

The hyperparameters in Table 3 show that they use a short history length (1 step) to keep the system reactive, crucial for fast-moving drones where stale data can lead to crashes.

Experiments and Results

To rigorously test these algorithms, the authors created “Unseen Drone Zoos”—sets of behaviors the trained drones had never encountered before.

  1. Greedy Drone: Always chases the nearest target.
  2. VICSEK Drone: Uses a bio-inspired swarm movement (like birds).
  3. Self-Play Drone: A drone trained with standard Reinforcement Learning, resulting in unpredictable behavior.

They grouped these into three test sets (Zoos), with “Unseen Zoo 3” being the hardest (a random mix of all types).

Performance: Zero-Shot Coordination (No Modeling)

The researchers compared HOLA-Drone V2 against standard baselines like Self-Play (SP) and Population-Based Training (PBT).

Figure 2: Success rate (SUC) across different difficulty levels for adaptive teaming without teammate modelling.

In Figure 2, look at the SUC (Success Rate).

  • In the easy environments (left), most methods do okay.
  • In the superhard environment (4p3e5o, far right), standard Self-Play (SP) drops significantly.
  • HOLA-Drone (V2) (the purple bar) consistently maintains higher success rates.

The “Red Dotted Line” represents the theoretical maximum (Best Response)—if you knew exactly who you were playing with and trained just for them. HOLA-Drone gets remarkably close to this ceiling despite not knowing the partners beforehand.

Table 2 provides the granular data. Notice the COL (Collision Rate).

Table 2: Performance comparison of adaptive teaming without teammate modeling.

HOLA-Drone (V2) generally achieves lower collision rates while maintaining high efficiency (lower AST - Average Success Timesteps), proving that it isn’t just winning by being aggressive; it’s winning by being smarter and safer.

Performance: With Teammate Modeling

Next, they tested NAHT-D. They compared it against MAPPO (standard RL) and a version of NAHT-D without the decoder (ablation study).

Figure 3: Performance comparison of adaptive teaming with teammate modeling.

The results in Figure 3 yield an interesting insight. While NAHT-D outperforms standard MAPPO, the complex teammate modeling (predicting exact actions) sometimes backfired in the hardest environments (4p3e5o). The simpler version (NAHT-D w/o Dec) often performed slightly better.

This suggests that in highly chaotic, obstacle-filled environments, trying to perfectly predict a partner’s next move might be “over-thinking” it. A generalized understanding of the teammate is valuable, but excessive complexity can introduce noise.

Real-World Case Study

The ultimate test was deploying this on physical Crazyflie drones. The following sequence shows a “Superhard” scenario (4 Pursuers, 3 Evaders, 5 Obstacles).

Figure 8: Case Study: Capture strategy executed by NAHT-D learners.

  1. Frame 1: The pursuers (red circles) detect the swarm of evaders.
  2. Frame 2: Instead of all rushing one target, they split. Two pursuers lock down one evader, while the others maintain a perimeter.
  3. Frames 3 & 4: They systematically corner the remaining evaders one by one.

This coordination emerged naturally. The drones were not hard-coded to “split up”; they learned that this was the only way to succeed when working with their specific partners in that environment.

Conclusion

The AT-Drone paper represents a significant step forward for robotic collaboration. By moving the benchmark from simple 2D grid games to continuous, physics-based drone environments, the authors have exposed the real difficulties of Adaptive Teaming.

Their contributions offer two distinct paths forward:

  1. HOLA-Drone V2 shows that by using Hypergraphic Game Theory to identify and train against “worst-case” partners, we can build agents that are inherently robust to strangers.
  2. NAHT-D demonstrates that real-time teammate modeling is viable on edge devices, allowing drones to adapt their personality based on who they are flying with.

As we look toward a future where autonomous drones are first responders in disasters, benchmarks like AT-Drone will be the proving grounds where these machines learn to work not just for us, but with each other.