If you have been following the evolution of Large Language Models (LLMs), you are likely familiar with the concept of Instruction Fine-Tuning (IFT). It is the crucial step that turns a raw, text-predicting base model into a helpful assistant capable of following user commands.
Recently, the research community has shifted its focus from “how much data do we need?” to “how good does the data need to be?” Papers like LIMA (Less Is More for Alignment) demonstrated that a small set of high-quality data often beats massive amounts of noisy data. This led to a gold rush of data selection methods—algorithms designed to sift through datasets and pick the “cherry” samples while discarding the “lemons.”
But what if we didn’t have to discard the lemons? What if the data we have isn’t inherently bad, but just… lazy?
In this post, we are diving into COEVOL, a fascinating framework presented in the paper Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation. The authors propose that we can drastically improve model performance not by finding better data, but by using multi-agent cooperation to evolve and edit the data we already have.
The Problem: The “Good Enough” Trap
Current methods for constructing IFT data often rely on LLMs themselves to generate instructions and responses (like Self-Instruct). While efficient, this approach has a flaw: LLMs, by their nature as causal language models, often default to the most probable, average-quality answer rather than the comprehensive, high-quality response they are actually capable of producing.
When we use these “average” responses to train new models, we are essentially teaching the student to be mediocre. Previous solutions focused on filtering these out. The COEVOL researchers argue that we are wasting potential. Instead of tossing out imperfect data, we should refine it.
The Solution: The COEVOL Framework
The core idea behind COEVOL is inspired by human editorial processes. If a writer produces a draft that is okay but lacks detail, an editor doesn’t just throw it in the trash. They critique it, debate the angle, suggest improvements, and revise it until it shines.
COEVOL implements this using a Multi-Agent Cooperation Framework. It employs five distinct LLM-based agents, each with a specific role, working in a loop to refine data.

As shown in Figure 1, the process is a cycle dubbed the Debate-Advise-Edit-Judge paradigm. Let’s break down the cast of characters:
- Debaters (Positive & Critical): Two agents that argue about the quality of the current response.
- Advisor: An agent that synthesizes the debate and provides actionable writing suggestions.
- Editor: An agent that rewrites the response based on the advice.
- Judge: An agent that decides if the new response is actually better than the old one.
Step 1: The Two-Stage Debate Strategy
One of the most innovative parts of COEVOL is how it handles feedback. Simply asking an LLM to “critique this” often leads to generic or sycophantic feedback. To solve this, the authors designed a two-stage debate strategy.
Stage 1: Predetermined-Position Debate
In the first round, the agents are forced into roles.
- The Positive Debater must argue that the current response is accurate.
- The Critical Debater must argue that the response is flawed and needs improvement.
This ensures that the system generates a diverse set of viewpoints immediately, preventing an “echo chamber.”

Here, \(\hat{x}\) represents the data sample, and \(t\) represents the specific task prompt (e.g., “support this claim” or “refute this claim”).
Stage 2: Free Debate and Cross-Evaluation
In the second round, the restrictions are lifted. The agents review each other’s arguments from Stage 1. They cross-evaluate the plausibility of the opposing view. This step filters out hallucinations or weak arguments generated during the role-play in Stage 1, ensuring that the final feedback is reliable.

Step 2: Advising and Editing
Once the dust settles on the debate, we have a rich history of arguments (\(G_{dbt}\)). However, raw debate transcripts are messy instructions for an editor. This is where the Advisor comes in.
The Advisor (\(A_{adv}\)) reads the debate history and summarizes the credible points into clear, actionable writing suggestions (\(h_{adv}\)).

Next, the Editor (\(A_{edt}\)) takes the original instruction, the original response, and the Advisor’s specific suggestions to craft a new, improved response (\(h_{edt}\)).

Step 3: The Judge and Iteration
We now have a candidate for a better response. But is it actually better? An LLM editor might hallucinate or make the text unnecessarily verbose.
The Judge (\(A_{jdg}\)) compares the original response (\(r\)) and the edited response (\(r'\)) side-by-side. To avoid position bias (where LLMs prefer the first option presented), the Judge evaluates them in both orders.

The system calculates a score based on the Judge’s decision:

If the new response is better (\(s(r') > s(r)\)), it replaces the old one, and the loop continues for another iteration (up to a maximum limit). If the new response is worse or equal, the loop stops, and the current best version is kept.
Experimental Results
The theory sounds solid, but does it work? The researchers tested COEVOL by fine-tuning LLaMA-2-7B and Mistral-7B models on data evolved by their framework. They compared these against models trained on raw data and data selected by other high-performance methods like AlpaGasus.
Beating the Selector Models
The results on the Alpaca dataset were particularly telling.

In Table 1, look at the comparison between AlpaGasus2-7B and COEVOL-LLaMA2-7B.
- AlpaGasus used a sophisticated method to select the best 9,000 samples from the 52k Alpaca dataset.
- COEVOL took a random 9,000 samples and improved them.
The result? COEVOL significantly outperformed AlpaGasus on both MT-Bench (4.32 vs 2.86) and AlpacaEval (43.55% vs 8.38%). This suggests that improving random data is more effective than selecting the “best” existing data.
Universality Across Models and Tasks
The researchers didn’t stop at LLaMA-2. They also tested the framework on Mistral-7B and applied it to both single-turn and multi-turn conversation datasets.

Table 2 shows that the gains hold up. Whether using ChatGPT or Mixtral as the agent backend, and whether the data was single-turn or multi-turn, COEVOL consistently boosted performance. The CoEVOL-Mistral-7B-MIXTRAL model achieved an impressive 89.76% on AlpacaEval, surpassing the baseline DEITA model.
Why Is It Better?
To understand how the data changed, the authors analyzed the text statistics and the types of edits made.

Figure 2 reveals two key trends:
- Iterative Improvement: A significant portion of the data underwent multiple rounds of evolution (1, 2, or 3 rounds), indicating that the Judge agent was actively pushing for higher quality.
- Length and Detail: The evolved responses (Figure 2b) were significantly longer. In the context of instruction following, length often correlates with helpfulness—providing detailed explanations, examples, and context rather than curt answers.
The authors also visualized the direction of the evolution by analyzing the verbs used in the Advisor’s suggestions.

Figure 3 shows that the most common suggestions involved “providing,” “including,” “enhancing,” and “enriching.” The system wasn’t just fixing grammar; it was adding depth, examples, and explanations to the training data.
Case Study
Let’s look at a concrete example of how COEVOL changes a response.

In the first example of Table 4 (regarding the largest star in the galaxy), the baseline model gives a factual, dry answer. The COEVOL model, however, adds an analogy: “if the Sun were the size of a small grape, VY Canis Majoris would be the size of a basketball.”
This is the kind of helpful, human-like nuance that improves the user experience but is often missing from standard training data.
Conclusion and Implications
The COEVOL paper presents a compelling argument against the “garbage in, garbage out” mentality. It suggests that “garbage” (or at least “mediocrity”) can be recycled into gold through multi-agent cooperation.
Key takeaways for students and researchers:
- Don’t just delete bad data: With the right automated feedback loop, low-quality samples can be transformed into high-quality training signals.
- Debate drives quality: A single LLM critic is often insufficient. Forcing agents to take opposing sides (Debate) and then verify each other (Cross-Evaluation) yields much more reliable editing suggestions.
- Agents are the new annotators: As models get stronger, the pipeline for creating training data is moving away from human annotation and toward autonomous multi-agent systems.
By leveraging the latent capabilities of LLMs to critique and improve their own work, COEVOL offers a scalable path toward smarter, more helpful AI assistants.
](https://deep-paper.org/en/paper/2406.07054/images/cover.png)