FuseGen: How Collaborative AI Agents Generate Superior Training Data

In the current landscape of Artificial Intelligence, we are witnessing a “David and Goliath” dynamic. On one side, we have the “Goliaths”—massive Pre-trained Language Models (PLMs) like GPT-4, Llama-2, and Claude. These models are incredibly capable but computationally expensive, slow, and difficult to deploy on edge devices or in privacy-sensitive environments.

On the other side, we have the “Davids”—Small Task-specific Models (STMs). These are compact, efficient models (like BERT) that can run on a smartphone or a private server. The problem? Davids need training data—lots of it—to be effective. In many real-world scenarios, high-quality labeled data is scarce or non-existent.

This leads to a fascinating technique called Data-generation based Zero-shot Learning. The idea is simple: ask the Goliath (PLM) to write a textbook (synthetic dataset) to teach the David (STM). However, there is a catch. If a single PLM generates the data, the dataset inherits that specific model’s biases, blind spots, and lack of diversity.

Today, we are diving deep into FuseGen, a novel framework proposed by researchers at Tsinghua University and Shanghai AI Laboratory. FuseGen solves the single-teacher problem by creating a “committee” of PLMs that collaborate, critique, and learn from each other to generate superior synthetic datasets.

The Problem: The Bias of the Single Teacher

To understand why FuseGen is necessary, we first need to look at the limitations of current synthetic data generation.

In a standard setup, you might prompt a model like Llama-2 with: “Write a movie review with a positive sentiment.” You do this thousands of times to create a dataset, then train a small classifier on that text.

The issue is Distribution Bias. A single PLM tends to generate samples that fall into a specific, narrow probability distribution. It often produces samples that are “easy-to-learn”—stereotypical examples that don’t challenge the student model.

The authors of FuseGen visualized this problem beautifully using a technique called Dataset Cartography. This method maps data samples based on two metrics:

Confidence: How sure the model is about the label.
Variability: How much the model’s prediction fluctuates during training.

Based on these metrics, data is categorized into three types:

Easy-to-learn: High confidence, low variability. Good for convergence, but boring.
Ambiguous: High variability. These are the “goldilocks” samples—tricky, nuanced examples that actually force the model to learn complex boundaries.
Hard-to-learn: Low confidence, low variability. Often these are mislabeled or garbage data.

Figure 1: Dataset Cartography comparing single-PLM methods (ZeroGen, ProGen) vs FuseGen.

As shown in Figure 1 above, look at the difference between the distributions.

Figures 1(a) and 1(b): Llama-2 (left column) generates mostly “easy-to-learn” samples (the red region). The “Ambiguous” region (black) is sparse.
Figures 1(d) and 1(e): Flan-T5 (bottom row) has a different bias.
Figures 1(c) and 1(f): This is FuseGen. Notice how the distribution shifts? It successfully generates a rich mixture of easy and ambiguous samples.

The core insight here is that an ideal dataset shouldn’t just be massive; it should be difficult in the right way. It needs diversity that a single model struggles to provide on its own.

The Solution: The FuseGen Framework

FuseGen (Fusion Generation) is built on the premise that a diverse committee of models can cover each other’s blind spots. It is a multi-PLM framework that iteratively improves the quality of synthetic data without needing human labels.

The framework operates in two main phases:

Cross-model Dataset Generation (CDG): The models generate data, critique it, and use the best examples to inspire the next batch.
Cross-model Data Quality Improvement (CDI): A final filtering stage where the student model (STM) learns to ignore bad data.

Let’s break down the architecture.

Figure 3: Illustrated Workflow of FuseGen showing the iterative generation loop.

Phase 1: Cross-model Dataset Generation (CDG)

This is the engine of the framework (Steps 1, 2, and 3 in Figure 3). The goal here is to select “In-Context Examples”—samples included in the prompt to guide the PLM—that are high-quality and diverse.

Step 1: Parallel Generation & Training

Imagine we have a cluster of \(K\) different PLMs (e.g., GPT-2, Llama-2, Vicuna, etc.). First, every PLM generates a batch of synthetic data using a simple prompt. We then train a temporary STM (Small Task Model) for each of these datasets.

Step 2: Cross-model Quality Evaluation

Now comes the “fusion.” We need to decide which samples from this massive pool of generated text are actually good. FuseGen uses a Cross-model Criteria.

Instead of asking one model if a sample is good, FuseGen looks at the variability of predictions across all the trained STMs. If different models (trained on data from different PLMs) disagree on a sample’s label, that sample typically has high “information value.”

The variability \(d_{k,i}\) for a sample is calculated as the standard deviation of the predicted probabilities across the different models:

Equation for cross-model variability standard deviation.

Here, \(p\) represents the predicted probability of the label.

High Variability: The committee disagrees. This is likely an “Ambiguous” sample—very valuable for training.
Low Variability: The committee agrees. This is likely an “Easy-to-learn” sample.

FuseGen selects a subset of samples that mixes these two categories (controlled by a hyperparameter \(\alpha\)). This ensures the next round of generation includes prompts that are both stabilizing (easy) and challenging (ambiguous).

Step 3: Cross-PLM In-context Learning

Once the best samples are identified from the collective pool, they are used as in-context examples for the next round of generation.

Crucially, Model A might receive a prompt containing excellent examples generated by Model B and Model C. This allows knowledge transfer between models without them ever directly communicating parameters. They communicate through data.

Figure 8: Bar charts showing how different PLMs contribute to the selected subsets over time.

Figure 8 illustrates this cross-pollination. It shows the source of the samples selected for in-context learning. As the rounds progress (0 to 4), you can see that the selected “best” samples come from a mix of all available PLMs (represented by different colors). No single model dominates the “good ideas” pool.

Phase 2: Cross-model Data Quality Improvement (CDI)

After several rounds of iterative generation, we have a large, diverse dataset. However, synthetic data is noisy. Even the best PLMs hallucinate or produce irrelevant text.

To fix this, FuseGen employs Self-boosting Weight Adjustment (SWA).

Instead of just training the final STM on the raw data, FuseGen assigns a weight \(w\) to every sample. It trains the model, checks which samples the model struggles with or predicts easily, and adjusts the weights accordingly.

The weight update rule is inspired by boosting algorithms (like AdaBoost):

Equation for weight adjustment based on prediction error.

In simple terms:

The STM makes a prediction on the synthetic data.
We look at the error.
We adjust the weight \(w_{k,i}\) for the next epoch.
Correctly predicted samples generally see their influence increased (confirming the signal), while confusing or likely mislabeled data (where the model oscillates or fails consistently in specific patterns) is down-weighted.

Note on the equation: The logic ensures that the model focuses on high-quality, reliable data while suppressing the noise inherent in zero-shot generation.

Finally, the STM is trained using a weighted loss function:

Equation for the final weighted loss function.

Why “Naive Mixing” Fails

You might be wondering: “Why go through all this trouble? Why not just ask 6 different models to generate data, mix it all into one big pile, and train the STM?”

The researchers tested this hypothesis. They called it the “mixed” approach.

Figure 2: Performance comparison showing FuseGen vs Naive Mixing.

Figure 2 shows the results. The bar labeled “mixed” (combining data from 6 PLMs) often performs worse than the best single PLM (e.g., Flan-T5).

Why? Because mixing piles of garbage with piles of gold just gives you a larger pile of dirty gold. You haven’t filtered the noise; you’ve just accumulated more of it. FuseGen (the bottom bar) consistently outperforms the “mixed” strategy because it actively selects for quality and re-weights to ignore noise.

Experimental Results

The researchers evaluated FuseGen on a wide variety of tasks, including movie reviews (IMDb, SST-2), news classification (AgNews), and natural language inference (QNLI, MNLI). They used a mix of open-source models (Llama-2, Vicuna, OPT) and closed-source models (GPT-3.5, GPT-4).

Performance on Open-Source Models

The comparison below shows the classification accuracy of the final Small Task Model (BERT-base).

ZeroGen: Baseline method (Single PLM, standard prompting).
ProGen: Progressive generation (Single PLM, feedback loop).
FuseGen: The proposed multi-PLM method.

Table 1: Main experimental results comparing FuseGen to baselines.

Table 1 reveals consistent victories for FuseGen.

On IMDb, FuseGen reaches 90.06%, beating the best single-model performance (Flan-T5 at 87.06%).
On AgNews, FuseGen hits 86.89%, surpassing all competitors.
Crucially, FuseGen is PLM-agnostic. You don’t need to know which PLM is best for your specific task beforehand. FuseGen automatically leverages the strengths of the best model in the cluster.

What about GPT-4?

Does this only work for smaller open-source models? The researchers tested FuseGen by fusing GPT-3.5 and GPT-4.

Table 2: Results on closed-source PLMs (GPT-3.5 and GPT-4).

As shown in Table 2, even with powerful models like GPT-4, the fusion approach (FuseGen) yields higher accuracy (56.56% on QNLI) compared to relying on GPT-4 alone (55.76% with ProGen). Collaboration helps even the giants.

Visualizing the Data Quality

To physically see the difference in the data produced, the researchers used t-SNE, a technique for visualizing high-dimensional text data in 2D space.

Figure 7: t-SNE visualization of synthetic samples.

In Figure 7:

Blue dots are positive reviews; Orange dots are negative reviews.
In (a) ZeroGen, the clusters are somewhat messy and overlapping in unhelpful ways (noise).
In (c) FuseGen, notice the structure. The core clusters are dense (easy samples), but the boundary between blue and orange is richer. FuseGen generates samples that better populate the decision boundary—the exact area where a machine learning model needs to practice to get smart.

Ablation: What Matters Most?

The paper includes extensive ablation studies to determine which components of FuseGen drive the performance.

Figure 5: Ablation study graphs showing the effect of hyperparameters.

Figure 5 provides three key insights:

Graph (a) - The \(\alpha\) ratio: The x-axis represents the ratio of “high-variability” samples. The performance peaks at 0.5. This confirms the hypothesis: a dataset shouldn’t be all hard questions (ambiguous) nor all easy questions. A 50/50 mix is optimal.
Graph (b) - Sample Size (\(N\)): More data is better, but FuseGen achieves high performance even with relatively small synthetic datasets (\(N=1000\)).
Graph (c) - Feedback Rounds (\(J\)): More iterations of the feedback loop lead to better results, as the models have more opportunities to refine their prompts based on cross-model feedback.

Additionally, looking at the ablation table below:

Table 4: Ablation study removing SWA and CDG components.

Table 4 shows that removing the Self-boosting Weight Adjustment (w/o SWA) causes a performance drop. Removing the Cross-model Dataset Generation (w/o CDG & SWA) causes a massive drop. This confirms that the collaborative generation phase is the most critical contributor to success.

Conclusion and Implications

FuseGen represents a significant step forward in Zero-Shot Learning. It moves us away from the paradigm of finding the “one perfect model” to generate data, and toward a paradigm of Model Collaboration.

By allowing models to critique each other via prediction variability and teach each other via in-context prompting, FuseGen creates synthetic datasets that are:

Less Biased: It smooths out the distribution biases of individual PLMs.
More Diverse: It captures a wider range of linguistic expression.
Higher Quality: It actively filters out noise and low-quality hallucinations.

For students and practitioners, the takeaway is clear: STMs (Small Task Models) are not dead. In fact, they are becoming more accessible than ever. With frameworks like FuseGen, you can leverage the intelligence of massive proprietary models to train efficient, private, runs-anywhere models—without labeling a single piece of data yourself.

The future of AI training might not be humans teaching machines, but machines teaching machines.

This blog post is based on the research paper “FuseGen: PLM Fusion for Data-generation based Zero-shot Learning” by Tianyuan Zou et al.

FuseGen: How Collaborative AI Agents Generate Superior Training Data#

The Problem: The Bias of the Single Teacher#

The Solution: The FuseGen Framework#

Phase 1: Cross-model Dataset Generation (CDG)#

Step 1: Parallel Generation & Training#

Step 2: Cross-model Quality Evaluation#

Step 3: Cross-PLM In-context Learning#

Phase 2: Cross-model Data Quality Improvement (CDI)#

Why “Naive Mixing” Fails#

Experimental Results#

Performance on Open-Source Models#

What about GPT-4?#

Visualizing the Data Quality#

Ablation: What Matters Most?#

Conclusion and Implications#