Solving the Diversity Crisis in Synthetic Data: A Deep Dive into CorrSynth

The era of Large Language Models (LLMs) has revolutionized how we approach machine learning. We have moved from a scarcity mindset—where labeled data was expensive and rare—to an abundance mindset, where models like GPT-4 or Mixtal can generate infinite amounts of text. This has given rise to Knowledge Distillation: using a massive “Teacher” LLM to generate synthetic datasets, which are then used to train smaller, efficient “Student” models (like BERT or DistilBERT) for specific tasks.

However, there is a ghost in the machine. While LLMs are fluent, they are also prone to mode collapse. When asked to generate 1,000 movie reviews, an LLM tends to reuse the same tropes, vocabulary, and sentence structures. It regurgitates its most probable paths. The resulting dataset might be large, but it lacks the diversity and richness of human-generated data. A student model trained on this repetitive data often fails to generalize.

In this post, we will explore CorrSynth, a novel research contribution from Amazon researchers (Kowshik et al.) that tackles this exact problem. CorrSynth introduces a mathematical framework for correlated sampling, forcing an LLM to generate diverse, distinct examples by “contrasting” them against each other in real-time.

The Problem: Independent Sampling and Homogeneity

To understand why synthetic data often fails, we must look at how it is generated. The standard approach is Few-Shot Generation (FewGen).

In FewGen, you provide the LLM with a prompt and a few examples (In-Context Learning), and ask it to generate a new instance. If you need a dataset of 1,000 examples, you simply repeat this process 1,000 times. Crucially, each generation is independent. The model doesn’t know what it generated in iteration 5 when it is generating iteration 6.

Mathematically, if we want to generate a sequence \(\mathbf{w}\) given a prompt, the probability is factorized autoregressively:

Standard autoregressive probability distribution equation.

Because the model always seeks to maximize the probability \(P(\mathbf{w}|\text{prompt})\), and the prompt remains largely static, the model gravitates toward the same high-probability regions of the latent space. This results in “clumped” data—thousands of examples that all sound vaguely similar.

The Solution: CorrSynth

The researchers propose a shift from independent sampling to correlated sampling. Instead of generating one example at a time in isolation, CorrSynth generates a batch of examples simultaneously and forces them to interact.

The core intuition is simple: If I am generating Example A and Example B at the same time, I should ensure that Example A is distinct from Example B.

Figure 1: Comparison between Few-shot generation (independent) and CorrSynth (correlated).

As shown in Figure 1 above, in standard Few-Shot generation (left), inputs are processed independently. In CorrSynth (right), the generation of one example is mathematically influenced by the others. This introduces an anti-correlation between examples, pushing them apart in the semantic space.

The Mathematics of Contrast

How does this work at the token level? CorrSynth modifies the probability distribution of the next token by introducing a “contrast” term.

Let’s look at the binary classification case. Suppose we want to generate two examples simultaneously:

\(\mathbf{x}\) for label \(\mathbf{y}\) (e.g., “Positive Sentiment”)
\(\bar{\mathbf{x}}\) for label \(\bar{\mathbf{y}}\) (e.g., “Negative Sentiment”)

CorrSynth samples the next token for \(\mathbf{x}\) by boosting tokens that are probable for \(\mathbf{y}\) but penalizing tokens that are probable for \(\bar{\mathbf{y}}\).

Equations showing the modified sampling probabilities for binary contrast.

In these equations:

The Numerator \(P(\cdot|\text{prompt}(\mathbf{y}), \mathbf{x}_{< i})^{\gamma}\) encourages the model to follow the correct prompt (Positive).
The Denominator \(P(\cdot|\text{prompt}(\bar{\mathbf{y}}), \bar{\mathbf{x}}_{< i})^{\gamma - \delta}\) penalizes tokens that are highly probable for the competing sequence (Negative).
\(\gamma\) controls the overall guidance strength.
\(\delta\) controls the strength of the contrast.

By dividing by the probability of the other sequence, we are essentially telling the model: “Choose words that make sense for a Positive review, but avoid words that the Negative review is currently likely to use.” This forces the two sequences to diverge, ensuring better class separation.

Generalized M-Way Contrast

The method scales beyond binary classification. If we are generating \(M\) different sequences simultaneously (which could be different classes, or different examples of the same class), the formula generalizes.

The probability of the next token for the \(m\)-th sequence is conditioned on its own history, but normalized by the probabilities of all other \(M-1\) sequences currently being generated.

General equation for M-way correlated sampling.

The researchers interpret this denominator as a Geometric Mean of the competing distributions. The model is contrasted against the “average” behavior of the other sequences.

Geometric Mean interpretation of the contrastive denominator.

Intra-Label vs. Cross-Label Diversity

A key innovation of CorrSynth is how it defines “competing” sequences. The researchers introduce three strategies:

Cross-Label Contrast: You generate a Positive review and a Negative review simultaneously. The contrast ensures they are semantically distinct. This improves Class Separability.
Intra-Label Contrast: You generate two Positive reviews simultaneously. By contrasting them against each other, you force the model to find two different ways to express “Positive.” This improves Lexical Diversity.
Hybrid: You do both. You generate a batch containing multiple Positive and multiple Negative reviews, contrasting every sequence against every other sequence.

The Hybrid approach is mathematically represented by splitting the denominator into two geometric means: one for the same class (\(GM_{intra}\)) and one for other classes (\(GM_{cross}\)).

Hybrid CorrSynth equation separating intra and cross label guidance.

Why Not Just Use Classifier-Free Guidance (CFG)?

Students familiar with diffusion models or advanced LLM sampling might ask: “Isn’t this just Classifier-Free Guidance (CFG)?”

CFG is a popular technique where a model’s output is guided by contrasting a conditioned prompt against an unconditioned one. While similar in spirit, CFG has two major flaws when applied to generating long synthetic datasets:

1. Signal Decay: In CFG, you typically feed the generated sequence back into the “negative” prompt to calculate the contrast. However, as the sequence gets longer, the “negative” context becomes more and more similar to the “positive” context, causing the guidance signal to vanish.

2. Computational Cost: CFG requires running the model twice for every single token (one for the positive prompt, one for the negative). CorrSynth, however, generates sequences in parallel.

The researchers demonstrated the “Signal Decay” problem empirically. In Figure 2 below, they plot the difference between the positive and negative logits over time.

Graph comparing signal strength of CFG vs CorrSynth over token length.

Notice the Left Graph (CFG): The difference (colored lines) stays low and flat. The contrast signal is weak. Notice the Right Graph (CorrSynth): The difference spikes and stays high. Because CorrSynth contrasts against a different, parallel generation (which is actively diverging), the contrast signal remains strong throughout the entire paragraph.

Furthermore, regarding efficiency, for a K-class dataset, CFG requires significantly more forward passes because it cannot effectively batch the contrasting operations without recalculating the negative prompt. CorrSynth does it all in a single batched inference step.

Experimental Results

The researchers evaluated CorrSynth on four datasets: AG News (Topic), ToI Headlines (Topic), Humor (Binary), and IMDb (Sentiment). They used Phi-3 Mini and Mixtal as teacher models and DistilBERT as the student.

1. Intrinsic Quality: Diversity and Separability

To visualize how CorrSynth affects the generated data, the researchers used heatmaps to show the cosine similarity between classes.

In an ideal dataset, the diagonal (self-similarity) should be high, and the off-diagonal (cross-class similarity) should be low.

Heatmaps showing label-wise cosine similarities with varying contrast settings.

Left (FewGen): We see significant “bleeding” of color off the diagonal. The classes are somewhat muddled.
Right (CorrSynth): As the contrast parameters increase (moving right), the off-diagonal elements become lighter (lower similarity). The classes are being pushed apart in the embedding space, making it easier for a student model to learn the decision boundary.

2. Cluster Analysis (UMAP)

Visualizing the data in 2D space using UMAP further illustrates the power of the \(\delta\) (contrast strength) parameter.

UMAP scatter plots showing cluster separation.

Far Right (\(\delta=0.0\) / No Contrast): The clusters (representing different topics) are overlapping and messy.
Far Left (\(\delta=0.9\) / High Contrast): The clusters are distinct and tight. This separation reduces “hard negatives” and ambiguous examples that confuse student models.

3. Student Model Performance

Ultimately, the goal is to train a better student model. The researchers compared CorrSynth against standard FewGen and several state-of-the-art baselines (like ReGen and SunGen).

Key metrics included:

Accuracy: How well the student performs on real test data.
MAUVE: A measure of how close the synthetic text distribution is to human text.
Self-BLEU: A measure of diversity (lower is better, indicating less repetition).
Entity-Entropy: Measuring the variety of named entities generated (higher is better).

The Findings: CorrSynth consistently outperformed baselines. For example, on the AG News dataset with a Phi-3 teacher:

FewGen Accuracy: 83.8%
CorrSynth Accuracy: 85.1%
Self-BLEU (Diversity): Dropped from 33.9 (repetitive) to 12.1 (diverse).

The improvement in Entity Entropy was particularly notable. FewGen tends to mention the same famous entities (e.g., “Google”, “Apple”) repeatedly. CorrSynth, by penalizing the most probable tokens of parallel sequences, forces the model to dig deeper into its vocabulary and retrieve “tail entities” (less common companies, places, or names), resulting in a much richer dataset.

4. Learning Dynamics (Datamaps)

Finally, the researchers analyzed how the student learns using Dataset Cartography. This technique maps training examples based on the model’s confidence and the variability of that confidence during training.

Datamaps comparing learning dynamics.

In the FewGen map (left), we see a concentration of “easy” examples (high confidence, low variability). In the CorrSynth maps (middle), the distribution spreads out. CorrSynth generates more “ambiguous” and “hard-to-learn” examples (the middle region). In machine learning curriculum, these “hard” examples are often the most valuable for driving generalization, as they force the model to learn robust features rather than shallow heuristics.

Conclusion and Implications

CorrSynth represents a significant step forward in synthetic data generation. It acknowledges a fundamental weakness in LLMs—their tendency to revert to the mean—and counters it with a geometric framework that forces diversity.

Key Takeaways:

Correlated Sampling works: Generating examples in batches and contrasting them against each other prevents mode collapse.
Efficiency matters: Unlike CFG, CorrSynth maintains a strong guidance signal over long sequences and is computationally efficient due to parallelization.
Diversity equals Accuracy: By optimizing for lexical diversity (Intra-contrast) and class separability (Cross-contrast), we produce datasets that train significantly better student models.

As we continue to rely on synthetic data to train the next generation of AI models, techniques like CorrSynth will be essential to ensure that our AI “students” are learning from diverse, high-quality textbooks, rather than reading the same page over and over again.

The Problem: Independent Sampling and Homogeneity#

The Solution: CorrSynth#

The Mathematics of Contrast#

Generalized M-Way Contrast#

Intra-Label vs. Cross-Label Diversity#

Why Not Just Use Classifier-Free Guidance (CFG)?#

Experimental Results#

1. Intrinsic Quality: Diversity and Separability#

2. Cluster Analysis (UMAP)#

3. Student Model Performance#

4. Learning Dynamics (Datamaps)#

Conclusion and Implications#