From Dribbling to Tactics: How Hierarchical RL Teaches Quadruped Robots to Play Soccer

Robotic soccer has long been viewed as a “Grand Challenge” for artificial intelligence and robotics. Since the inception of RoboCup in the 90s, the dream has been to field a team of robots capable of beating the human World Cup champions. While we aren’t quite there yet, the complexity of soccer makes it the perfect testbed for modern robotics. It combines everything that is difficult about robots: the need for agile, split-second motor control (balancing, kicking) and the need for high-level cognitive planning (strategy, teamwork, anticipation).

For legged robots, like quadrupeds, this challenge is doubled. Unlike wheeled robots, quadrupeds must constantly manage their own stability. If a quadruped robot “thinks” too hard about strategy and forgets to manage its feet, it falls over. If it focuses entirely on walking, it misses the pass.

In a fascinating new paper, “Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams,” researchers from UC Berkeley, Tsinghua University, and Zhejiang University propose a solution that bridges this gap. They have developed a hierarchical Multi-Agent Reinforcement Learning (MARL) framework that allows quadruped robots to learn complex soccer skills and strategies from scratch.

In this deep dive, we will explore how they separated the “body” from the “brain,” how they trained robots to evolve strategies against each other, and how they deployed this purely learned behavior onto real physical robots without external motion capture systems.

The Core Challenge: Muscle vs. Mind

Why is robot soccer so hard? It essentially requires solving two distinct mathematical problems simultaneously:

High-Frequency Control (The Muscle): The robot needs to adjust its joint angles 50 to 200 times per second to maintain balance, dribble a ball without losing it, and kick with precision. This is a continuous control problem with complex physics.
Long-Horizon Planning (The Mind): The robot needs to look at the field, see where its teammates and opponents are, and decide whether to pass, shoot, or defend. This is a lower-frequency strategic problem that requires predicting the future.

Previous attempts usually compromised on one of these. They either used hard-coded rules for strategy (which are brittle) or focused only on 1v1 games where teamwork isn’t required. To solve 2v1 and 2v2 games on real hardware, the researchers adopted a Hierarchical Architecture.

The Solution: A Hierarchical Framework

The researchers’ approach mimics how humans play sports. When a soccer player decides to “dribble forward,” they don’t consciously think about the angle of their ankle for every step. Their brain issues a high-level command (“Run there”), and their muscle memory handles the mechanics.

As shown in Figure 1 below, the researchers split the policy into two levels:

\(\pi_{high}\) (High-Level Strategy): The “Captain.” It looks at the game state and issues commands.
\(\pi_{low}\) (Low-Level Skills): The “Athlete.” It takes the command and figures out how to move the legs.

Figure 1: The proposed hierarchical framework consists of a high-level strategy policy that selects low-level skills and issues corresponding commands, and low-level skill policies that execute motor primitives including walking, dribbling, and kicking.

This separation allows the researchers to train robust motor skills first, and then train the strategy on top of those skills.

Level 1: The Skill Library (Low-Level Policy)

The foundation of this system is the Low-Level Skill Library. Instead of trying to learn everything at once, the robots are first taught three distinct primitives: Walk, Dribble, and Kick.

These skills are trained using Proximal Policy Optimization (PPO), a popular Reinforcement Learning (RL) algorithm. The low-level policy takes in the robot’s proprioception (joint angles, body orientation) and the ball’s position relative to the robot. It outputs target angles for the robot’s 12 joints.

A key innovation here is the use of Privileged Information during training. In the simulation, the “teacher” knows everything—the friction of the ground, the exact mass of the ball, and external forces. The robot’s policy is trained to estimate these hidden values using a history of its past movements. This makes the policy robust enough to handle the real world, where ground friction varies and sensors are noisy.

Figure 2: Hierarchical Architecture. (a) Low-level skills architecture showing how observations feed into the actor network. (b) High-level strategy architecture showing how the GRU handles long-term memory to select skills.

As visualized in Figure 2 (a) above, the low-level architecture creates a specific neural network for the “Actor” that executes the movements. The result is a library of skills that are stable and reusable.

Walk: Omnidirectional walking.
Dribble: The robot learns to manipulate its velocity to keep the ball close while moving.
Kick: The robot learns to approach the ball and deliver a high-velocity strike toward a target.

Does it work in reality? Yes. Figure 8 demonstrates these learned skills deployed on the Unitree Go1 robots. The robots can transition smoothly between walking, precise dribbling, and powerful kicking.

Figure 8: Real-world demonstration of low-level skills: (a) Walk (b) Dribble (c) Kick (d) Receive.

Level 2: The Strategist (High-Level Policy)

Once the robots know how to move, they need to learn what to do. This is the job of the High-Level Policy.

The high-level policy runs at a slower frequency (5 Hz, or 5 decisions per second). It observes the global game state: where the ball is, where the goal is, and the relative positions of teammates and opponents.

To make the learning process efficient, the researchers discretized the action space. The high-level brain doesn’t output exact continuous velocities. Instead, it chooses from a menu of options, as detailed in Table 1.

Table 1: High-Level Policy Action Space showing Skill Types (Walk, Dribble, Kick) and Direction Options.

For example, the strategy policy might output (Dribble, Down-Right). This command is passed to the low-level Dribble policy, which then figures out the leg movements required to move the robot and ball down-right.

Crucially, the high-level policy uses a Gated Recurrent Unit (GRU), a type of memory network (shown in Figure 2b). This allows the robot to remember the immediate past, which is essential for understanding momentum and predicting where an opponent is running.

Training the Brain: Fictitious Self-Play (FSP)

Training a single robot to chase a ball is easy. Training a team of robots to beat another team is incredibly hard. If you simply let two AI agents play against each other (Self-Play), they often fall into a trap called “cycling.” Agent A learns a trick to beat Agent B. Agent B learns a specific counter. Agent A learns a counter to the counter. They essentially chase each other around in a circle without ever learning generally robust strategies.

To solve this, the researchers utilized Fictitious Self-Play (FSP).

Figure 3: The FSP training procedure, where each side is trained against a population of previously trained opponents.

As illustrated in Figure 3, FSP doesn’t just train the robot against the current version of the opponent. It saves snapshots of the opponent at various stages of training into a “Policy Population.”

Attacker Training: The attacker plays against a mix of current defenders and old versions of defenders. This ensures the attacker doesn’t forget how to beat basic strategies while learning to beat advanced ones.
Defender Training: Once the attacker gets good enough (passing a win-rate threshold), it is frozen and added to the pool. The defender then trains against this new pool of attackers.

This co-evolutionary pressure forces the agents to learn general strategies. The attacker evolves from “just shoot” to “dribble around” to “pass to teammate.” The defender evolves from “chase ball” to “intercept pass” to “mark opponent.”

Does Hierarchy Matter? The Ablation Study

You might ask: “Why build this complex two-level system? Why not just feed the camera data into one giant neural network and tell it to win?”

The researchers tested this “End-to-End” approach, and the results were clear. Without the hierarchical structure, the problem is too complex. The robot struggles to learn the basics of walking while simultaneously trying to understand the game score.

Figure 4: Ablation Study comparing End2End vs Hierarchical policies. (a) Ball trajectories are chaotic in End2End but focused in Ours. (b) Training scores show the hierarchical method (Ours) learns much faster and achieves higher scores.

Figure 4 paints a stark picture.

Panel (a): The white dotted lines show the “End2End” policy. The robot barely manages to move the ball, resulting in chaotic trajectories (or the robot running out of bounds). The red lines show the Hierarchical policy (“Ours”), where the robot drives the ball decisively toward the goal.
Panel (b): The training curves show that the Hierarchical method (purple line) achieves high performance quickly. The End-to-End method (and versions with fewer skills) struggle to learn at all.

Emergent Strategy: The 2v1 Game

The most exciting result of this research is the emergence of teamwork. The researchers set up a 2v1 scenario: Two attackers versus one defender.

The robots were not explicitly rewarded for passing. The reward function only cared about scoring goals and winning. Yet, the robots learned that passing is the most effective way to beat a defender.

In the simulation analysis shown in Figure 6, we can see the “brain” of the attacker at work.

Figure 6: Rollouts of policy trained with FSP in 2v1 setting. Shows different strategies like Passing and Solo Runs based on the defender’s behavior.

Scenario (a): The attacker sees the defender approaching. It realizes a “Solo Run” is risky. It chooses to Pass (blue arrow) to its teammate.
Scenario (c): The defender is positioned differently, perhaps blocking the pass lane. The attacker decides to keep the ball and perform a Solo Run to the goal.

This demonstrates that the FSP training produced a multi-modal policy—a brain capable of adapting its strategy based on the specific situation.

Crossing the Gap: Real-World Deployment

Many RL papers stop at simulation. This one goes into the wild (or at least, onto a soccer field).

Deploying to the real world is difficult because real sensors are noisy. To handle this, the team used a fully decentralized system. There is no central computer telling the robots what to do. Each robot carries a LiDAR (Livox MID-360) and an NVIDIA Orin NX computer.

Figure 13: Real world deployment overview. (a) The hardware setup with LiDAR and onboard computer. (b) The software pipeline showing decentralized decision making.

As shown in Figure 13, the robot uses LiDAR for localization (knowing where it is on the field) and object detection (finding the ball and humans). It calculates its own actions and only shares minimal data (like “I am here”) with teammates over Wi-Fi to help with coordination.

Analyzing the “Robot Brain” in Reality

The researchers visualized the “Value Map” of the robots during a real game. In Reinforcement Learning, the “Value Function” represents how good the agent thinks the current situation is.

Figure 7 offers a rare glimpse into the robot’s decision-making process during a real match.

Figure 7: Real-world behavior analysis. Top-down views of attackers coordinating passes, alongside ‘Value Maps’ showing the robot’s perceived strategic advantage in different field positions.

Look at Panel (b) in Figure 7. This is the Attacker’s perspective. The yellow areas represent high value (good spots), and purple areas are low value.

The Attacker recognizes that moving the ball toward the teammate (who is open) yields a higher value than trying to dribble through the defender.
This creates a Pass.

In Panel (g), the system is robust enough that a Human can jump in as Attacker 2. The robot Attacker 1 recognizes the human as a teammate and collaborates to score a goal.

Conclusion

This paper represents a significant step forward for robotic systems. By decomposing the problem into Low-Level Skills (robust motor control) and High-Level Strategy (tactical planning), and training them via Fictitious Self-Play, the researchers created a team of quadruped robots that can actually play soccer.

The key takeaways are:

Hierarchy is essential: You cannot learn complex behaviors end-to-end; you must build a foundation of skills first.
Adversarial Training drives intelligence: Robots get smarter only when their opponents get smarter.
Decentralization works: Complex team behaviors like passing can emerge from individual agents acting on their own observations, provided they share a common goal.

While we aren’t at the level of the World Cup yet, seeing quadrupeds execute coordinated passing plays on a real field suggests that the future of robot sports—and multi-agent robotics in general—is kicking off in the right direction.

The Core Challenge: Muscle vs. Mind#

The Solution: A Hierarchical Framework#

Level 1: The Skill Library (Low-Level Policy)#

Level 2: The Strategist (High-Level Policy)#

Training the Brain: Fictitious Self-Play (FSP)#

Does Hierarchy Matter? The Ablation Study#

Emergent Strategy: The 2v1 Game#

Crossing the Gap: Real-World Deployment#

Analyzing the “Robot Brain” in Reality#

Conclusion#