Training large language models (LLMs) is a monumental task. But what happens after the initial pre-training? How do we refine these models—making them better at complex reasoning, following instructions, and avoiding harmful outputs? One of the most powerful techniques for this is Reinforcement Learning (RL), where a model learns through trial and error, much like a person mastering a new skill.

However, applying RL to massive models comes with a hefty price tag. It often demands enormous, centralized GPU clusters where models are trained in lockstep. This approach is not only incredibly expensive but also poses significant technical challenges: communication bottlenecks, latency issues, and a reliance on highly specialized, homogeneous hardware. It’s a game largely reserved for a few big players with deep pockets.

What if there were another way? What if, instead of one giant centrally-controlled brain, we could create a swarm of models, each learning together in a decentralized network? This is the core idea behind a new paper from the Gensyn AI team, introducing Swarm sAmpling Policy Optimization (SAPO). SAPO enables a diverse collection of models, running on different hardware, to collectively improve by simply sharing their experiences. Critically, they don’t share complex model weights—only the plain text they generate. This simple shift unlocks a more efficient, scalable, and democratic path toward better AI.

In this post, we’ll dive deep into the SAPO algorithm, explore the controlled experiments where it boosted performance by up to 94%, and look at insights from a massive real-world demo involving thousands of community participants.

The Bottleneck of Modern RL Training

Before we get into SAPO, let’s recap why it’s needed. Post-training a language model with RL usually follows a loop like this:

  1. Generate Responses — The model is given a prompt (question or task) and generates an output.
  2. Get a Reward — This output is evaluated by a reward model. In Reinforcement Learning from Human Feedback (RLHF), the reward model learns from human preference data. In Reinforcement Learning with Verifiable Rewards (RLVR), the reward is computed programmatically (e.g., by checking if a math answer is correct).
  3. Update the Model — The model parameters are adjusted via a policy-gradient algorithm such as Proximal Policy Optimization (PPO), making high-reward outputs more likely.

This loop is highly effective for teaching complex reasoning. Problems arise when you try to scale it: generating enough diverse experiences quickly requires running this process in parallel on large clusters, with centrally synchronized weights and rollouts. Synchronization becomes a major bottleneck—every subsystem waits for others, slowing progress and increasing fragility.

Multi-agent systems offer inspiration. In AI research, these involve autonomous agents collaborating—debating, specializing, or bootstrapping each other’s capabilities. SAPO channels this collaborative ethos but applies it to RL in a novel, fully decentralized way.

SAPO: How the Swarm Learns Together

SAPO’s elegance lies in its simplicity: it reframes distributed training as a swarm—a decentralized network of nodes where each is an autonomous agent.

The Anatomy of the Swarm

Imagine a network with N nodes. Each node n has:

  1. A Policy (\(\pi^n\)) — The node’s language model. It can be any architecture, any parameter count. The swarm is heterogeneous: one node might run a 0.5B model on a MacBook, another a 7B model on a gaming rig.
  2. A Dataset (\(\mathcal{D}^n\)) — A set of questions or tasks with verifiable answers that can be automatically checked.

A node’s dataset is a set of questions and their ground-truth solutions.

  1. A Reward Model (\(\rho^n\)) — A local scoring function, possibly rule-based or learned.

These nodes operate asynchronously—they never wait for each other.

A Round of Training with SAPO

Let’s walk through one node—say Alice—in a SAPO training round.

Step 1: Generate Local Experience
Alice samples a batch of questions from her dataset. For each question \(q\), she uses her policy \(\pi^n\) to generate multiple answers.

A rollout is a set of answers generated by a node for a given question.

Each set of answers is a rollout—her local exploration.

Step 2: Share with the Swarm
Alice broadcasts some of these rollouts to other nodes. Each shared packet contains:

  • the question
  • the ground-truth answer
  • the rollout text
  • metadata for verification

The data packet shared with the swarm contains the question, answer, rollout, and metadata.

She shares only decoded text—lightweight and architecture-agnostic—rather than models or gradients.

Step 3: Sample from the Swarm
Alice constructs her training set by combining:

  • \(I^n\) rollouts from herself
  • \(J^n\) rollouts sampled from peers (e.g., Bob, Carol)

Sampling can be filtered—discarding zero-advantage rollouts or prioritizing certain tasks—allowing each node to tailor its learning.

Step 4: Update the Policy
Her reward model scores every rollout. She updates \(\pi^n\) using a policy-gradient algorithm (this paper uses GRPO).

In this way, Alice can learn from answers she would never have produced. If Bob’s model finds a novel logic puzzle solution, Alice can re-encode that text and learn from it. These “Aha moments” ripple across the swarm, bootstrapping collective learning.

The pseudocode for the process is in Algorithm 1 of the paper:

The SAPO algorithm describes the process of local generation, sharing, sampling, and policy updates for each node in the swarm.

Putting SAPO to the Test: Controlled Experiments

Setup:

  • Swarm: 8 identical Qwen2.5 0.5B models, each on its own GPU.
  • Environment: ReasoningGYM — generates endless reasoning tasks across diverse domains (algebra, logic, abstract reasoning) with automatic verifiers for rewards.
  • Goal: Compare different ratios of local vs. external rollouts.

Configurations (8 total rollouts per agent per round):

  1. Baseline (8 local / 0 external) — Standard RL without sharing.
  2. SAPO (6 local / 2 external) — Light sharing.
  3. SAPO (4 local / 4 external) — Balanced sharing.
  4. SAPO (2 local / 6 external) — Heavy sharing.

Results: Sharing Pays Off (to a Point)

As shown in Figure 1, sharing boosts performance.

Figure 1: This grid of plots shows the reward trajectories for all 8 agents across the four configurations. The configurations with more external rollouts (c and d) reach higher peak rewards than the baseline (a).

Agents without sharing learned slowest. More sharing led to faster learning and higher rewards. The 4/4 split was optimal, achieving the highest total accumulated reward—94% improvement over baseline.

A smoothed 100-step average reward plot (Figure 2) confirms this.

Figure 2: This plot shows the smoothed average reward for each configuration. The 4 local / 4 external setup (green) consistently outperforms the others for most of the training process.

The balanced 4/4 strategy consistently beats other setups for most of training.

The Goldilocks Principle of Sharing

Too much sharing can destabilize learning. In Figure 2, the 2/6 setup (orange) oscillates heavily—large gains followed by steep drops. The authors cite two causes:

  1. Quality Dilution: High-performing agents slow down when they over-sample low-quality rollouts.
  2. Pool Stagnation: If agents consume more than they contribute, the shared pool’s quality drops, causing collective forgetting.

Conclusion: share enough to benefit from collective intelligence, but not so much that instability creeps in.

In the Wild: Large-Scale Demo

Controlled tests are idealized. The team also ran a massive open-source demo with thousands of Gensyn community nodes running diverse models and hardware. The swarm was highly dynamic, with nodes joining/leaving.

By comparing swarm-trained vs. isolated models, they found that Qwen2.5 0.5B models gained significant benefits from SAPO.

Figure 3: This chart compares the cumulative rewards of Qwen2.5 0.5B models trained with SAPO in the swarm versus in isolation. After about 175 rounds, the swarm-trained models show a statistically significant performance advantage.

After ~175 normalized rounds, swarm-trained models (blue) pulled away from isolated ones (orange) with statistically significant gains.

Interestingly, more powerful models saw less impact here. The authors suggest SAPO’s collective learning suits mid-capacity models best—they have more room to grow from diverse external rollouts. They also note the demo used uniform random sampling; smarter rollout selection could help all models benefit.

Conclusion: A More Collaborative Future for AI

SAPO is a compelling alternative to centralized, resource-heavy RL post-training:

  • Scalable: No synchronization bottlenecks.
  • Efficient: Shares lightweight text instead of massive parameters.
  • Democratic: Enables heterogeneous participants to contribute and benefit.

The key insight: balanced experience sharing dramatically accelerates learning. One agent’s breakthrough can ripple across the swarm, lifting everyone.

The paper invites exciting future directions:

  • Swarms with task-specialized agents
  • Human participants injecting their own rollouts
  • Multi-modal swarms—imagine AI artists sharing images and aesthetic reward functions, collectively evolving a style

SAPO is more than an algorithm—it’s a vision for collaborative, accessible AI development, where sharing isn’t just caring—it’s how we get smarter, together.