Introduction
We are witnessing a golden age of Artificial Intelligence. From Large Language Models (LLMs) drafting emails to reinforcement learning agents mastering complex strategy games like Go and Dota 2, AI capabilities are skyrocketing. However, a critical gap remains between an AI’s ability to solve a problem in isolation and its ability to solve a problem with us.
Consider a self-driving car. It is not enough for the vehicle to navigate a track perfectly when alone; it must interpret the subtle, unwritten rules of negotiation with human drivers at a four-way stop. Similarly, an AI assistant in a hospital cannot simply optimize for the fastest procedure; it must coordinate with doctors and nurses, understanding their intent and adapting to their workflow.
The prevailing method for training agents in games—Self-Play (SP)—often fails in these cooperative scenarios. In SP, an agent plays millions of games against itself. While this creates superhuman performance, the agent often develops idiosyncratic, “secret” conventions that no human can understand. When paired with a human, these agents look like geniuses speaking an alien language; the coordination breaks down, and the team fails.
This problem is known as the Ad-Hoc Human-AI Coordination challenge. To solve it, researchers need a standardized way to test if an AI can play nice with humans without requiring thousands of hours of expensive real-time human testing. Enter the Ad-Hoc Human-AI Coordination Challenge (AH2AC2), a new benchmark and framework centered around the card game Hanabi. This work introduces robust “Human Proxy” agents and a controlled evaluation system to push the field toward AI that is not just smart, but compatible.
Background: The Hanabi Testbed
To study coordination, the authors selected Hanabi, a cooperative card game that has become the drosophila (fruit fly) of cooperative AI research.
Hanabi is unique because it features imperfect information and requires a high degree of Theory of Mind. In Hanabi, two to five players try to build fireworks stacks of ascending numbers (1–5) in different colors. The twist? You can see everyone else’s cards, but you cannot see your own. You must rely on limited communication tokens to give hints to your teammates (e.g., “These two cards are Red”), and they must deduce which cards are safe to play.
Success in Hanabi requires recognizing the intent behind a partner’s move. If a partner hints at a specific card, they aren’t just conveying information; they are signaling that the card is likely playable or critical. This dynamic makes it the perfect environment to test Ad-Hoc Teamplay—the ability to collaborate with a new partner on the fly without prior coordination.
The Architecture of the Challenge
The AH2AC2 framework addresses the scarcity of open-source human data and the difficulty of consistent evaluation. The researchers provide a small, open-source dataset of human gameplay for training, while keeping a massive dataset private to train high-quality “Human Proxy” agents.

As shown in Figure 1, the workflow allows researchers to train their agents using limited open data and then submit them to an API. This API pairs the submission with the hidden Human Proxy agents to evaluate performance, ensuring the AI isn’t just memorizing specific human moves but is genuinely adapting to human-like playstyles.
Core Method: Building the Human Proxies
The heart of this research lies in the creation of the Human Proxy (HP) agents. Evaluating against real humans is slow, costly, and noisy. Evaluating against other AI agents usually results in inflated scores due to non-human conventions. The solution is to build an AI that plays exactly like a competent human.
The authors developed these proxies using a method called Human-Data-Regularised IPPO (HDR-IPPO). This is a two-stage process designed to balance skill with human compatibility.
Step 1: Behavioral Cloning (BC)
First, the researchers utilized a massive, closed-source dataset of over 100,000 Hanabi games from the online platform hanab.live. They used Behavioral Cloning (BC), a supervised learning technique where a neural network (using Long Short-Term Memory, or LSTMs) is trained to predict the exact move a human made given a specific game history.
While BC agents capture the “flavor” of human play, they are often brittle. If the game enters a state not well-represented in the training data, a BC agent can become confused and make catastrophic errors. They mimic well, but they struggle to recover from mistakes.
Step 2: Regularized Reinforcement Learning
To fix the brittleness of BC, the researchers refined the policy using Independent Proximal Policy Optimization (IPPO). In this phase, the agent plays against itself to learn better strategies and maximize the score.
However, pure RL brings back the “alien language” problem; the agent might learn to win by using strategies humans don’t use. To prevent this, the authors added a Kullback-Leibler (KL) Regularization term.

The equation above measures the divergence between the new policy (\(\pi^{HP}\)) and the original frozen Behavior Cloning policy (\(\pi^{BC}\)). This acts as a “leash.” The agent is encouraged to improve its score, but penalized if its probability distribution over actions drifts too far from what the human-imitating BC model would do.
The final loss function combines the standard PPO objective with this KL penalty:

Here, \(\lambda\) (lambda) is a hyperparameter that controls the strength of the leash. If \(\lambda\) is too low, the agent becomes alien. If \(\lambda\) is too high, the agent learns nothing new and remains brittle.
The Effect of Regularization
The authors conducted an ablation study to prove that this regularization is necessary. They trained agents with varying values of \(\lambda\).

Figure 6 shows the KL divergence over training steps. The teal line represents \(\lambda=0\) (no regularization). As the agent learns to play, its behavior rapidly diverges from the human baseline. It “unlearns” being human to become a generic AI optimizer.

In contrast, Figure 7 zooms in on agents with proper regularization. The divergence remains low and stable, ensuring the agent retains human conventions while improving its general competence.
The impact of this on gameplay is profound. When an unregularized agent (\(\lambda=0\)) tries to play with a standard human-like BC agent, coordination collapses.

Figure 3 illustrates a “Cross-Play Matrix.” The heatmap shows the average scores when different agents play together.
- BC2 vs BC2 (Bottom Right): Score of 18.89. Decent, but not great due to brittleness.
- HP2 vs HP2 (Top Left): Score of 23.04. The Human Proxy (HP) is very strong in self-play.
- \(\lambda=0\) vs BC2 (Middle Right): Score of 1.13. Catastrophic failure.
The unregularized agent (\(\lambda=0\)) cannot coordinate with the human-like BC agent at all. However, the final Human Proxy (HP2) achieves a score of 21.77 when paired with the BC agent. This confirms that HDR-IPPO creates agents that are both highly skilled and compatible with human strategies.
Validating the Proxies
Before using these proxies as the gold standard for the challenge, the authors had to verify their quality. A good proxy must not only score well but also exhibit the behavioral statistics of real humans.
The researchers compared the Human Proxies against the original BC policies in both 2-player and 3-player settings.

Figure 2 demonstrates consistent cross-play performance. In the 2-player setting (left), the Human Proxy (HP1) coordinates effectively with the BC agents, achieving high median scores. The 3-player setting (right) is notoriously harder, yet the proxies (HP3, HP4) maintain high scores when paired with each other and respectable scores when paired with the brittle BC agents.
Furthermore, Table 2 highlights the robustness gained through this method.

While the raw BC policies result in “Zero-Score Games” (total failure) up to 75.82% of the time in 3-player settings due to cascading errors, the Human Proxies reduce this to nearly 0%. They learned to “save” the game when things go wrong, a crucial skill for coordinating with imperfect human partners.
Experiments & Results: The Leaderboard
With the proxies established, the authors launched the AH2AC2 benchmark. They evaluated several baseline methods to set the stage for future research.
The Contenders
- BC (Behavioral Cloning): Pure imitation of the small open-source dataset.
- IPPO: Reinforcement learning from scratch (Self-Play), ignoring human data.
- BR-BC (Best Response to BC): An agent trained specifically to play well with a fixed BC partner.
- FCP (Fictitious Co-Play): Training an agent against a population of past versions of itself.
- OBL (Off-Belief Learning): A Zero-Shot Coordination (ZSC) method that assumes no human data is available and tries to play purely based on optimal, grounded logical conventions.
- DeepSeek-R1: A modern Large Language Model (LLM) prompted with the rules of Hanabi and the game state, testing the reasoning capabilities of foundation models.
The Outcome
The results, presented in Table 5, reveal the difficulty of the challenge.

Key Takeaways:
Zero-Shot Supremacy (for now): The OBL agent (L4) currently tops the leaderboard in the 2-player setting with a mean score of 21.04. This is fascinating because OBL does not use the human training data. It relies on a mathematical formalism to prevent “secret languages.” This suggests that current methods for leveraging small amounts of human data (like BC or BR-BC) are not yet efficient enough to beat a theoretically grounded, data-free approach.
Self-Play Fails: As expected, the IPPO agent performs poorly (10.16). Without human data or ZSC constraints, it learns conventions that humans (and the proxies) cannot interpret.
The LLM Gap: DeepSeek-R1, despite being a powerful reasoning model, struggles significantly. Even when prompted with the “H-Group” conventions (a standard human strategy guide), it achieves a mean score of only 9.91 in 2-player games. While it performs slightly better in 3-player settings, it lags far behind specialized algorithms. This indicates that while LLMs have general reasoning, fine-grained coordination in partially observable environments requires capabilities that zero-shot prompting has not yet unlocked.
Data Efficiency is Hard: The BR-BC agent performs well (19.41) but relies on training against a clone of the human data. The challenge highlights a major open problem: How do we take a small dataset of human interactions and turn it into a robust coordination policy?
Conclusion & Implications
The Ad-Hoc Human-AI Coordination Challenge (AH2AC2) represents a significant step forward in multi-agent reinforcement learning. By moving away from self-play scores and providing a rigorous, reproducible way to test against “Human Proxies,” the authors have grounded the field in reality.
The construction of the proxies themselves—using Behavioral Cloning refined by KL-regularized Reinforcement Learning—provides a blueprint for how to build AI that is both competent and compatible. The ablation studies clearly demonstrate that without a “leash” to human data, AI agents naturally drift toward alien, incompatible strategies.
However, the leaderboard results serve as a humble reminder of how far we have to go. The fact that a data-free method (OBL) currently beats methods that attempt to learn from human data suggests we have not yet cracked the code on few-shot coordination. We need algorithms that can look at a handful of human games and instantly grasp the “vibe” or convention being used, rather than needing millions of examples or rigid mathematical proofs.
Furthermore, the poor performance of off-the-shelf LLMs suggests that general intelligence alone isn’t enough for coordination; specific training on the dynamics of shared belief and intent is required.
As AH2AC2 opens to the public, it invites researchers to close this gap. The goal is no longer just to beat the game—it is to play the game with us.
](https://deep-paper.org/en/paper/2506.21490/images/cover.png)