Imagine assembling a team of brilliant experts to solve a complex problem. One is a master strategist who breaks the problem down into manageable steps. The other is a meticulous executor who crunches the numbers and carries out the plan. In theory, this collaboration should outperform any single expert working alone.

This is the promise of multi-agent Large Language Model (LLM) systems—AI teams that work together to reason through challenges far beyond the reach of a single model.

But what if the executor in your expert team just… gets lazy? Instead of performing calculations, they nod along, copy the strategist’s notes, or contribute nothing substantial. The strategist is forced to do all the work. The collaboration collapses, and performance plummets. This isn’t just a human team problem—it’s a critical issue plaguing today’s most advanced AI systems.

A recent research paper, “Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation,” dives deep into this “lazy agent” phenomenon. The authors find that in a popular multi-agent framework, one agent systematically learns to slack off, undermining the entire setup. They don’t stop at identifying the problem—they provide a rigorous theoretical explanation and introduce a new framework called Dr. MAMR (Multi-Agent Meta-Reasoning Done Right) to fix it.

This article will unpack how AI agents turn lazy, how the researchers designed a system to motivate genuine collaboration, and how they even taught an agent to know when to say, “Let’s scrap this and start over.”


Background: The Rise of AI Teams

Before diagnosing lazy agents, let’s understand the framework where they arise. The paper builds on a system called ReMA (Reasoning via Meta-thinking and Acting). ReMA uses two specialized LLM agents that share the same model weights but differ in their roles, defined by distinct system prompts:

  1. Meta-Thinking Agent (π_h) — the high-level planner that decomposes challenges, sets goals, and adapts based on feedback. Think of it as the project manager.
  2. Reasoning Agent (π_l) — the low-level executor that performs detailed computations, proofs, and derivations step-by-step. It’s the technical expert doing the heavy lifting.

The two agents collaborate through a sequence of conversational turns.

A diagram showing the sequential interaction between the meta-thinking agent and the reasoning agent over multiple turns.

“In ReMA, the two agents alternate steps: the meta-thinker plans, and the reasoning agent acts.”

To train such a system, the researchers use Reinforcement Learning (RL)—specifically a technique called Multi-turn Group Relative Preference Optimization (GRPO). It rewards whole conversational trajectories that lead to correct answers and penalizes those that fail. Importantly, it assigns credit to individual conversation turns for finer-grained optimization.

The mathematical formula for the multi-turn GRPO objective function.

“Multi-turn GRPO introduces a turn-level reward structure for multi-agent reasoning.”

This objective uses a turn-level importance ratio measuring how much more likely the new model is to produce a given step compared to its previous version.

The mathematical formula for the turn-level importance ratio.

“Turn-level importance ratio tracks how each update affects the probability of generating a step.”

In theory, this should foster cooperation. But in practice, things went wrong.


The Problem: When Collaboration Collapses

After training with ReMA, the researchers observed something strange: one of the agents—usually the reasoning agent—started acting lazy. It contributed trivially, sometimes outputting a blank or just summarizing the meta-thinker’s text. The meta-thinker ended up doing all the work, collapsing the setup into a single ineffective agent.

A diagram comparing a lazy reasoning process that fails with a non-lazy, collaborative process that succeeds. It also shows schematics for the proposed solutions: Shapley-inspired causal influence and verifiable reward for restart.

“Case study: (a) Lazy reasoning copies the meta-thinker’s mistake; (b) Active collaboration catches errors and arrives at the correct answer.”

To quantify this laziness, the authors measured the causal influence of each agent’s turn—the degree to which one action affects subsequent ones. The intuition: remove an agent’s contribution and see if the system behaves differently. If it doesn’t, the agent’s contribution was essentially meaningless.

The change in behavior was measured using KL divergence between the model’s logits before and after suppression. Low divergence = low influence = laziness.

Density plots comparing the causal influence of the meta-thinking agent and the reasoning agent under different training settings.

“Causal influence distributions reveal how training causes the reasoning agent to become passive under ReMA.”

The results were eye-opening:

  • Initialized: Before training, both agents exert balanced influence—healthy collaboration.
  • ReMA: After training, the reasoning agent’s influence collapses, while the meta-thinker dominates.
  • ReMA w/ prompt: Explicitly telling the reasoning agent to “work harder” helps slightly but doesn’t fix it.
  • Ours (Dr. MAMR): The proposed method restores balanced influence, leading to better reasoning accuracy.

Why does a reinforcement learning system train itself to be lazy? The answer lies in a subtle but powerful bias in its objective.


Dr. MAMR: A Three-Part Cure for Lazy Agents

The authors propose Dr. MAMR—a comprehensive fix consisting of three synergistic components:

  1. A theoretical adjustment that removes a hidden bias favoring short conversations.
  2. A new reward signal based on causal influence, which values meaningful contributions.
  3. A deliberation mechanism that allows agents to restart reasoning when they realize they’re lost.

1. Theoretical Flaw: A Bias Toward Shorter Conversations

The culprit behind laziness is the normalization term \( 1/T_i \) in the GRPO objective. This factor was originally meant to prevent the model from favoring longer trajectories. However, as proven by the authors in Theorem 1, it unintentionally biases learning toward shorter ones.

Intuitively, if two reasoning paths yield the same final reward, the path with fewer turns gives a stronger training gradient. The model learns that fewer steps are better—and the easiest way to shorten a dialogue is to say less. Blank responses and shallow summaries become optimal strategies. Over time, this “shortcut” strategy turns the reasoning agent lazy.

To break this bias, Dr. MAMR removes the \( 1/T_i \) normalization term entirely, eliminating the incentive for brevity-driven laziness.


2. Shapley‑Inspired Causal Influence: Rewarding What Matters

Removing bias is a start, but we also need to actively reward productive collaboration. Dr. MAMR achieves this through a Shapley‑inspired causal influence measure that quantifies how much each step shapes subsequent reasoning.

True causal influence estimation is computationally expensive, so the authors introduce a lightweight, statistically robust alternative:

  1. Group semantically similar steps: Across many conversations, group together steps expressing the same idea (e.g., “find the derivative” ≈ “calculate the rate of change”), based on cosine similarity of embeddings.

    \[ G_S(s_{i,t}) = \{ s_{j,t'} \mid s_{j,t'} \approx s_{i,t}, 1 \le j \le N, 1 \le t' \le 2T_j \} \]
  2. Measure simple one‑step influence: For each step in the group, compute how masking that step affects the probability of the next step. Equation for calculating the one-step causal influence as the difference in log probabilities.

“\( \Delta \ell_{j,t'} = \log p_{\text{mask}}^{(j,t')} - \log p_{\text{full}}^{(j,t')} \)” measures how crucial a step is to what comes next."

  1. Average across similar cases: Aggregate these one‑step effects within the group to get a stable causal‑influence score: \[ \operatorname{CI}(s_{i,t}) = \frac{1}{|G_S(s_{i,t})|} \sum_{s_{j,t'} \in G_S(s_{i,t})} \Delta \ell_{j,t'}. \]

This score rewards steps that genuinely alter the trajectory, encouraging agents to make influential contributions rather than filler responses.


3. Deliberation and the Restart Button

As collaboration improves, dialogue lengthens—and longer interactions risk LLMs becoming entangled in their own prior mistakes. Once an agent makes a wrong assumption early, it often struggles to recover.

The authors hypothesize that letting the reasoning agent discard its prior output and restart could restore focus and correctness. To test this idea, they created an inference‑time variant called ReMA+, where the agent is prompted to restart if stuck.

Scatter plots showing the performance gap between a model with restart prompting (ReMA+) and the baseline (ReMA). The gap widens on more difficult benchmarks.

“Allowing restarts improves performance—especially on harder, multi-turn benchmarks.”

ReMA+ consistently outperformed the baseline, validating the hypothesis. Building on this, Dr. MAMR trains restart behavior directly using a special <restart> token. When emitted, the agent discards previous reasoning and restarts from a clean, consolidated state.

But not all restarts are helpful—so Dr. MAMR defines a verifiable reward function to decide when a restart deserves credit:

  • Reward if a restart increases confidence in a correct result or decreases confidence in an incorrect one.
  • Penalize the opposite cases.

The mathematical formula for the verifiable restart reward.

“Restart reward evaluates whether discarding history improved the confidence in the final answer.”

This reward makes restart behavior learnable in RL rather than heuristic.

Finally, all signals are combined into one unified advantage function that captures correctness, causal influence, and effective restarts.

The final aggregated step-level advantage function for Dr. MAMR.

“Dr. MAMR’s training advantage combines outcome correctness, causal impact, and restart success.”


Experiments and Results: Dr. MAMR in Action

Overall Performance

Across seven math reasoning benchmarks, Dr. MAMR consistently surpassed all baselines—both the single‑agent GRPO and the original multi‑agent ReMA. ReMA’s lazy‑agent issue hurt performance; Dr. MAMR restored collaboration and achieved substantial gains.

Table showing performance results on seven math benchmarks. Dr. MAMR consistently outperforms all baselines, including GRPO and ReMA.

“Dr. MAMR converts a failing multi‑agent system into one that surpasses single‑agent reasoning.”

Inside the Training Process

A composite figure showing causal influence during training, mean reward curves, and pass@K performance.

“(a) Causal influence balances between agents under Dr. MAMR; (b) training remains stable; (c) performance scales better with more samples.”

Key observations:

  • Causal Influence: Under ReMA, the reasoning agent’s influence wanes; under Dr. MAMR, both agents grow together.
  • Training Stability: Dr. MAMR avoids collapse—its reward curve stays strong and steady.
  • Scaling Performance: As more attempts (K) are allowed, Dr. MAMR keeps gaining ground, reflecting richer reasoning diversity.

Component Importance

An ablation study confirms each module’s necessity. Removing normalization debias (ND), causal‑influence reward (CI), or restart behavior (RB) degrades performance.

Table from the ablation study showing that removing any of Dr. MAMR’s components degrades performance.

“Each component—debiasing, causal reward, restart—is essential for full performance.”


Conclusion: Building Better AI Teammates

The “lazy agent” problem reveals a deep truth about collaborative AI: simply linking two intelligent models doesn’t guarantee teamwork. Their training objectives and incentives matter profoundly.

The paper’s contributions are threefold:

  1. Diagnosis: It pinpoints and theoretically explains the lazy‑agent issue stemming from multi‑turn loss normalization bias.
  2. Remedy: It introduces the Shapley‑inspired causal‑influence mechanism to reward meaningful contributions.
  3. Recovery: It designs a verifiable restart reward that teaches agents to recognize their own mistakes and reset when necessary.

With these innovations, Dr. MAMR transforms dysfunctional cooperation into synergistic reasoning. It’s a powerful demonstration of how thoughtful optimization design can turn AI agents from passive bystanders into active collaborators—pushing multi‑agent LLM reasoning ever closer to genuine teamwork.