Beyond Data Selection: How Multi-Agent Debate Can Evolve Better LLM Responses

If you have been following the evolution of Large Language Models (LLMs), you are likely familiar with the concept of Instruction Fine-Tuning (IFT). It is the crucial step that turns a raw, text-predicting base model into a helpful assistant capable of following user commands.

Recently, the research community has shifted its focus from “how much data do we need?” to “how good does the data need to be?” Papers like LIMA (Less Is More for Alignment) demonstrated that a small set of high-quality data often beats massive amounts of noisy data. This led to a gold rush of data selection methods—algorithms designed to sift through datasets and pick the “cherry” samples while discarding the “lemons.”

But what if we didn’t have to discard the lemons? What if the data we have isn’t inherently bad, but just… lazy?

In this post, we are diving into COEVOL, a fascinating framework presented in the paper Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation. The authors propose that we can drastically improve model performance not by finding better data, but by using multi-agent cooperation to evolve and edit the data we already have.

The Problem: The “Good Enough” Trap

Current methods for constructing IFT data often rely on LLMs themselves to generate instructions and responses (like Self-Instruct). While efficient, this approach has a flaw: LLMs, by their nature as causal language models, often default to the most probable, average-quality answer rather than the comprehensive, high-quality response they are actually capable of producing.

When we use these “average” responses to train new models, we are essentially teaching the student to be mediocre. Previous solutions focused on filtering these out. The COEVOL researchers argue that we are wasting potential. Instead of tossing out imperfect data, we should refine it.

The Solution: The COEVOL Framework

The core idea behind COEVOL is inspired by human editorial processes. If a writer produces a draft that is okay but lacks detail, an editor doesn’t just throw it in the trash. They critique it, debate the angle, suggest improvements, and revise it until it shines.

COEVOL implements this using a Multi-Agent Cooperation Framework. It employs five distinct LLM-based agents, each with a specific role, working in a loop to refine data.

Figure 1: Overview of the proposed multi-agent cooperation framework CoEvol.

As shown in Figure 1, the process is a cycle dubbed the Debate-Advise-Edit-Judge paradigm. Let’s break down the cast of characters:

Debaters (Positive & Critical): Two agents that argue about the quality of the current response.
Advisor: An agent that synthesizes the debate and provides actionable writing suggestions.
Editor: An agent that rewrites the response based on the advice.
Judge: An agent that decides if the new response is actually better than the old one.

Step 1: The Two-Stage Debate Strategy

One of the most innovative parts of COEVOL is how it handles feedback. Simply asking an LLM to “critique this” often leads to generic or sycophantic feedback. To solve this, the authors designed a two-stage debate strategy.

Stage 1: Predetermined-Position Debate

In the first round, the agents are forced into roles.

The Positive Debater must argue that the current response is accurate.
The Critical Debater must argue that the response is flawed and needs improvement.

This ensures that the system generates a diverse set of viewpoints immediately, preventing an “echo chamber.”

Equation describing the first round of debate with predetermined positions.

Here, \(\hat{x}\) represents the data sample, and \(t\) represents the specific task prompt (e.g., “support this claim” or “refute this claim”).

Stage 2: Free Debate and Cross-Evaluation

In the second round, the restrictions are lifted. The agents review each other’s arguments from Stage 1. They cross-evaluate the plausibility of the opposing view. This step filters out hallucinations or weak arguments generated during the role-play in Stage 1, ensuring that the final feedback is reliable.

Equation describing the second round of free debate and cross-evaluation.

Step 2: Advising and Editing

Once the dust settles on the debate, we have a rich history of arguments (\(G_{dbt}\)). However, raw debate transcripts are messy instructions for an editor. This is where the Advisor comes in.

The Advisor (\(A_{adv}\)) reads the debate history and summarizes the credible points into clear, actionable writing suggestions (\(h_{adv}\)).

Equation showing the Advisor generating suggestions based on debate history.

Next, the Editor (\(A_{edt}\)) takes the original instruction, the original response, and the Advisor’s specific suggestions to craft a new, improved response (\(h_{edt}\)).

Equation showing the Editor refining the response.

Step 3: The Judge and Iteration

We now have a candidate for a better response. But is it actually better? An LLM editor might hallucinate or make the text unnecessarily verbose.

The Judge (\(A_{jdg}\)) compares the original response (\(r\)) and the edited response (\(r'\)) side-by-side. To avoid position bias (where LLMs prefer the first option presented), the Judge evaluates them in both orders.

Equation showing the Judge comparing the original and edited responses.

The system calculates a score based on the Judge’s decision:

Equation for scoring the response.

If the new response is better (\(s(r') > s(r)\)), it replaces the old one, and the loop continues for another iteration (up to a maximum limit). If the new response is worse or equal, the loop stops, and the current best version is kept.

Experimental Results

The theory sounds solid, but does it work? The researchers tested COEVOL by fine-tuning LLaMA-2-7B and Mistral-7B models on data evolved by their framework. They compared these against models trained on raw data and data selected by other high-performance methods like AlpaGasus.

Beating the Selector Models

The results on the Alpaca dataset were particularly telling.

Table 1: Results of different instruction-tuned models on MT-Bench and AlpacaEval.

In Table 1, look at the comparison between AlpaGasus2-7B and COEVOL-LLaMA2-7B.

AlpaGasus used a sophisticated method to select the best 9,000 samples from the 52k Alpaca dataset.
COEVOL took a random 9,000 samples and improved them.

The result? COEVOL significantly outperformed AlpaGasus on both MT-Bench (4.32 vs 2.86) and AlpacaEval (43.55% vs 8.38%). This suggests that improving random data is more effective than selecting the “best” existing data.

Universality Across Models and Tasks

The researchers didn’t stop at LLaMA-2. They also tested the framework on Mistral-7B and applied it to both single-turn and multi-turn conversation datasets.

Table 2: Results of different Mistral-7B models on MT-Bench and AlpacaEval.

Table 2 shows that the gains hold up. Whether using ChatGPT or Mixtral as the agent backend, and whether the data was single-turn or multi-turn, COEVOL consistently boosted performance. The CoEVOL-Mistral-7B-MIXTRAL model achieved an impressive 89.76% on AlpacaEval, surpassing the baseline DEITA model.

Why Is It Better?

To understand how the data changed, the authors analyzed the text statistics and the types of edits made.

Figure 2: Statistical results of data evolution, showing rounds of evolution and token lengths. Figure 2b: Average token lengths of responses.

Figure 2 reveals two key trends:

Iterative Improvement: A significant portion of the data underwent multiple rounds of evolution (1, 2, or 3 rounds), indicating that the Judge agent was actively pushing for higher quality.
Length and Detail: The evolved responses (Figure 2b) were significantly longer. In the context of instruction following, length often correlates with helpfulness—providing detailed explanations, examples, and context rather than curt answers.

The authors also visualized the direction of the evolution by analyzing the verbs used in the Advisor’s suggestions.

Figure 3: Overview of the evolving direction of CoEvol.

Figure 3 shows that the most common suggestions involved “providing,” “including,” “enhancing,” and “enriching.” The system wasn’t just fixing grammar; it was adding depth, examples, and explanations to the training data.

Case Study

Let’s look at a concrete example of how COEVOL changes a response.

Table 4: Cases of responses generated by the baseline vs. COEVOL.

In the first example of Table 4 (regarding the largest star in the galaxy), the baseline model gives a factual, dry answer. The COEVOL model, however, adds an analogy: “if the Sun were the size of a small grape, VY Canis Majoris would be the size of a basketball.”

This is the kind of helpful, human-like nuance that improves the user experience but is often missing from standard training data.

Conclusion and Implications

The COEVOL paper presents a compelling argument against the “garbage in, garbage out” mentality. It suggests that “garbage” (or at least “mediocrity”) can be recycled into gold through multi-agent cooperation.

Key takeaways for students and researchers:

Don’t just delete bad data: With the right automated feedback loop, low-quality samples can be transformed into high-quality training signals.
Debate drives quality: A single LLM critic is often insufficient. Forcing agents to take opposing sides (Debate) and then verify each other (Cross-Evaluation) yields much more reliable editing suggestions.
Agents are the new annotators: As models get stronger, the pipeline for creating training data is moving away from human annotation and toward autonomous multi-agent systems.

By leveraging the latent capabilities of LLMs to critique and improve their own work, COEVOL offers a scalable path toward smarter, more helpful AI assistants.

The Problem: The “Good Enough” Trap#

The Solution: The COEVOL Framework#

Step 1: The Two-Stage Debate Strategy#

Stage 1: Predetermined-Position Debate#

Stage 2: Free Debate and Cross-Evaluation#

Step 2: Advising and Editing#

Step 3: The Judge and Iteration#

Experimental Results#

Beating the Selector Models#

Universality Across Models and Tasks#

Why Is It Better?#

Case Study#

Conclusion and Implications#