Breaking the Echo Chamber: How SYNTHESIZRR Uses Retrieval to Create Diverse Synthetic Datasets

In the modern era of Natural Language Processing (NLP), we are witnessing a “David and Goliath” dynamic. On one side, we have the Goliaths: massive, generalist Large Language Models (LLMs) like GPT-4, Claude, and Llama 2. These models can do almost anything, but they are computationally expensive, slow, and often accessible only via API. On the other side, we have the Davids: smaller, specialist models (like BERT or DeBERTa) that are fast, cheap, and deployable on modest hardware.

For many organizations and researchers, the goal is knowledge distillation: transferring the capabilities of the giant Teacher LLM into the smaller Student model. The standard way to do this is by asking the Teacher to generate a massive synthetic dataset, which the Student then trains on.

But there is a problem. When you ask an LLM to generate thousands of examples (e.g., “Write a movie review”), it tends to get repetitive. It relies on its internal parametric memory, often regurgitating the same popular entities (e.g., “The Avengers” or “Titanic”) and reusing the same sentence structures. This lack of diversity creates a “bland” dataset that limits how much the Student model can learn.

In this post, we explore a solution proposed by researchers from Amazon and UT Austin: SYNTHESIZRR. This method fundamentally changes how we generate synthetic data by introducing a retrieval step, ensuring that the Student model learns from a rich, diverse, and grounded distribution of data.

The Problem: The Limits of Few-Shot Generation

To understand why SYNTHESIZRR is necessary, we first need to look at the incumbent method: Few-Shot Generation (FEWGEN).

In FEWGEN, you provide the LLM with a few examples (demonstrations) of a task—say, detecting political bias—and ask it to generate more examples following that pattern. While this works to an extent, it suffers from mode collapse. The LLM tends to sample from the high-probability regions of its training data.

If you ask for 10,000 synthetic news articles, you might get 500 variations of the same Hillary Clinton email scandal or generic complaints about taxes. The resulting dataset lacks the “long tail” of real-world complexity.

As shown in Figure 1 above, the FEWGEN approach (middle) often hallucinates content or relies on generic tropes. In contrast, SYNTHESIZRR (bottom) retrieves a real, specific document (like a letter from Senator Harry Reid) and rewrites it to match the target label. This grounding in real documents provides the necessary variety for robust training.

The Solution: SYNTHESIZRR

The core insight of this paper is that dataset synthesis involves two distinct competencies that should be decoupled:

Content Sourcing: Obtaining the “what”—the facts, entities, and scenarios.
Task Inversion: Performing the “how”—formatting that content into the specific input-output pair required for the classification task.

Prior approaches forced the LLM to do both using only its internal memory. SYNTHESIZRR offloads the Content Sourcing to an external retrieval system.

The Workflow

The SYNTHESIZRR process can be visualized as a pipeline that transforms a small set of “seed” examples into a massive, diverse dataset.

$Figure 2: Abstract depiction of the SYNTHESIZRR procedure. In the content sourcing stage, we retrieve K unique document { r _ { 1 } , \\ldots , r _ { K } } from a large corpus for each in-context covariate x _ { \\mathrm { { I C L } } } . The task-inversion stage of synthesis uses a parameterized context refinement prompt \\mathcal { P } _ { \\tau } ，which takes parameters R _ { i n v } (inversion instruction), r _ { k } (a retrieved document), and \\nu ( y _ { \\mathrm { I C L } } ) (the verbalized target label). A generalist teacher LLM autoregressively generates a synthetic covariate. Each in-context example thus produces K unique synthetic examples { \\tilde { x } _ { 1 } , \\dots , \\tilde { x } _ { K } } , which we include in the dataset with target y _ { \\mathrm { I C L } }$

Here is the step-by-step breakdown:

Content Sourcing (Retrieval): The system starts with a small seed set of labeled examples ($x_{ICL}, y_{ICL}$). It uses these examples as queries to search a massive unlabelled corpus (like RealNews or an Amazon Product database). For every seed example, it retrieves $K$ unique documents ($r_1...r_K$).
Task Inversion (Refinement): The system then constructs a prompt that includes one of these retrieved documents. It instructs the Teacher LLM to rewrite or leverage this retrieved document to match a specific target label.

The beauty of this approach is the expansion factor. A single seed example can be used to retrieve 50 or 100 different documents. Since each document contains unique entities and contexts, the LLM is forced to generate 50 or 100 unique synthetic training examples, rather than repeating the same concept.

Mathematically, the generation of the synthetic example $\tilde{x}$ is conditioned not just on the prompt instructions, but explicitly on the retrieved document $r_k$:

\[ \begin{array} { r } { \tilde { x } _ { j } ^ { i } \sim \mathcal { M } _ { \mathrm { L M } } \big ( \cdot | \tilde { x } _ { < j } ^ { i } , \mathcal { P } _ { \tau } ( R _ { i n v } , r _ { k } , \mathcal { V } ( y _ { { I C L } } ) ) \big ) , } \end{array} \]

Description: Equation showing the generation probability conditioned on the prompt P, inversion instructions R, retrieved document r, and verbalized label V.

This formula highlights that the generation is grounded. The model $\mathcal{M}_{LM}$ generates the next token based on the prompt $\mathcal{P}_\tau$, which wraps the inversion instruction $R_{inv}$, the retrieved document $r_k$, and the label verbalization $\mathcal{V}(y)$.

The Algorithm: RETRICL

The authors introduce a specific algorithm called SYNTHESIZRR RETRICL (Retrieval In-Context Learning).

In standard few-shot prompting, you provide random examples from your seed set. In RETRICL, the authors use retrieval even for the demonstrations. When generating a synthetic example based on a retrieved document $r_k$, the prompt includes “shots” (examples) that are semantically similar to that document.

The final synthetic dataset $\mathcal{D}_{SYNTH}$ is the union of all generations derived from all retrieved documents:

\[ \mathcal { D } _ { \mathrm { { S Y N T H } } } = \bigcup _ { ( x , y , \Gamma _ { K } ) \in \mathcal { D } _ { \mathrm { { R E T R } } } } \bigcup _ { r _ { k } \in \Gamma _ { K } } \big \{ ( \tilde { x } ^ { i } , y ) \big \} . \]

Description: Equation showing the set union operation to create the final synthetic dataset.

Why It Works: Intrinsic Evaluation

Does adding a retrieval step actually make the data better? The researchers analyzed the generated text using several metrics to measure diversity and quality.

1. Lexical Diversity (Self-BLEU)

Self-BLEU measures how similar a generated sentence is to other sentences in the same dataset. A lower score is better, as it indicates less repetition.

$Figure 3: Self-BLEU (↓) for ngrams \\mathrm { n } { = } 1 { - } 5 Com-parison: GOLD，FEWGEN O-shot， FEWGEN 32-shot， SYNTHESIZRR O-shot， SYNTHESIZRR 3-shotRETRICL, SYNTHESIZRR 32-shot NON-RETRICL.$

As shown in Figure 3, SYNTHESIZRR (the blue and pink lines) achieves significantly lower Self-BLEU scores than FEWGEN (orange/green lines). The diversity of SYNTHESIZRR approaches that of human-written text (GOLD). This confirms that seeding the model with different retrieved documents forces it to use varied vocabulary.

2. Entity Diversity (Entropy)

One of the biggest weaknesses of LLMs is “popularity bias”—they talk about New York and London, but rarely about Scranton or Leeds.

Figure 4: Entity entropy (↑) on ToI (headlines) and CATEGORY (reviews).Comparison: GOLD, FEWGEN 32-shot, SYNTHESIZRR 3-shot RETRICL and SYNTHESIZRR 32-shot NON-RETRICL. Zeroshot results are similar for SYNTHESIZRR and worse for FEWGEN; we omit them.

Figure 4 demonstrates the entity entropy (randomness of entities mentioned) for different categories like Organizations (ORG), People (PERSON), and Locations (LOC). SYNTHESIZRR maintains high entropy, similar to the Gold standard. FEWGEN, conversely, collapses into a narrow range of popular entities. By retrieving from the “long tail” of a corpus, SYNTHESIZRR ensures the Student model sees rare entities during training.

3. Qualitative Comparison

The difference is even more stark when you read the text. In Table 3 below, look at the “Electronics” category. The FEWGEN example is vague (“happy customer,” “fast,” “good external drive”). It reads like a template. The SYNTHESIZRR example, derived from a retrieved product description, is specific: it mentions a “portable laptop microphone,” “right-angled,” and “flat-frequency response.”

Table 3: Real and synthetic examples from“electronics”classof CATEGoRY. Grey text indicates lack of specifics

This specificity helps the Student model learn features that actually matter for classification, rather than relying on generic sentiment words.

Distillation Performance: Extrinsic Evaluation

The ultimate test of a synthetic dataset is: Does it train a better Student model?

The authors fine-tuned a DeBERTa-v3-Large model (the Student) on datasets generated by different methods.

$Table 6: Test Accuracy(↑) after distiling DEBERTA-V3-LARGE student from LLAMA-2 CHAT 13B and CLAUDE INSTANT-V1. CONTRIEVER was used as the retriever in SYNTHESIZRR. We report the average of 5 runs and rerun in cases where std. dev. 2 6 % (indicating one or more models failed to converge). The top half considers zero-shot synthesis and bottom half uses in-context learning,and we bold the best result under each paradigm. Notation: ^ { \\ast } 3 2 . -shot; ^ { \\dag } 3 -shot RETRICL; ^ { \\ddagger } 3 2 -shot NON-RETRICL.$

Table 6 presents the results across six different datasets (AG News, Hyperpartisan, etc.).

Zero-Shot: Even without any in-context examples, SYNTHESIZRR significantly outperforms FEWGEN. This indicates that the retrieval step alone provides massive value.
Few-Shot: The 3-shot SYNTHESIZRR (RETRICL) consistently beats the 32-shot FEWGEN. This is a remarkable efficiency gain—using retrieval allows for better performance with fewer in-context demonstrations.

Comparison to State-of-the-Art

The researchers also benchmarked SYNTHESIZRR against other complex synthesis methods like SunGen, ReGen, and AttrPrompt.

$Table 7: Evaluations of synthetic datasets released by prior work. We subsample all to 6K examples (uniformly distributed across classes) before computing metrics as described in §4. Tasks not evaluated by previous authors are denoted by \\otimes while those evaluated without dataset release are marked \\mathbb{X} . GPT3.5 is text-davinci-003 whereas GPT3.5-T is gpt-3.5-turbo (OpenAI, 2022), LLAMA2 is 13B Chat version (Touvron et al., 2023a), CLAUDEV1 is Instant-V1.2 version (Anthropic, 2023). Accuracy is measured on a DISTILBERT student, where we train 5 student models and report the mean accuracy (std. dev. was \\leq 2.0 in all cases). Within each dataset, we bold the best result.$

As shown above (Table 7), SYNTHESIZRR generally produces datasets that lead to higher accuracy (Intrinsic and Extrinsic) compared to these other methods, often using a smaller teacher model (Llama-2 13B vs GPT-3.5). It particularly excels in MAUVE scores, a metric that measures how closely the distribution of synthetic text matches human text.

Why Does It Work? The Data Map Analysis

To understand why the Student models learn better, the authors employed Dataset Cartography. This technique plots training examples on two axes:

Confidence: How sure the model is about the true label.
Variability: How much the model’s prediction fluctuates during training.

High variability usually indicates “ambiguous” or “hard-to-learn” examples. These are often the most valuable for learning decision boundaries.

Figure 5: Data maps from a DISTILBERT training run on 8K CATEGORY roWs from LLAMA2. FEWGEN (center) is skewed towards easy-to-learn examples (topleft) while GOLD (left) and SYNTHESIZRR (right) have a higher density of ambiguous examples.

Figure 5 reveals a critical insight.

FEWGEN (Middle): The data is clustered in the top-left (High Confidence, Low Variability). These are “easy” examples. The model learns them instantly and stops learning.
SYNTHESIZRR (Right): The distribution looks much more like the GOLD human data (Left). It contains a healthy spread of ambiguous examples (higher variability).

Because SYNTHESIZRR forces the LLM to deal with real-world, retrieved contexts (which might be messy or complex), it generates training data that challenges the Student model, preventing it from learning simple, brittle heuristics.

Influence of In-Context Examples

Does providing more in-context examples (shots) help SYNTHESIZRR? The authors varied the number of shots used in the RETRICL process.

Figure 7: Left: DEBERTA-V3L test accuracy(↑),center: entity entropy (↑),right: Mauve (↑) for SYNTHESIZRR RETRICL. We vary the number of in-context examples from O to 8. Teacher LLMs LLAMA-2 CHAT 13B and CLAUDE INSTANT-V1 are Compared on 6 tasks: AG NEWS, HYPERPARTISAN, TOI HEADLINES, CATEGORY, HUMOR and POLARITY. We do not report CATEGORY 8-shot due to model failures.

Figure 7 shows that increasing the number of shots (from 0 to 8) generally improves Student accuracy (left graph). However, interestingly, providing more shots creates a slight trade-off in entity entropy (center graph). As the prompt becomes stricter with more examples, the LLM might constrain its creativity slightly, though the performance gains usually outweigh this drop.

Conclusion

The SYNTHESIZRR paper presents a compelling argument against the “black box” generation of synthetic data. By relying solely on an LLM’s internal parameters (FEWGEN), we get data that is biased towards the “head” of the distribution—repetitive, bland, and overly simple.

By treating dataset generation as a two-part process—Retrieval (Content) + Refinement (Style)—SYNTHESIZRR achieves the best of both worlds. It leverages the massive knowledge contained in external corpora and the instruction-following capabilities of modern LLMs.

Key Takeaways for Students:

Decomposition: Complex NLP tasks often benefit from breaking them down. Here, separating what to say from how to say it unlocked diversity.
Grounding: Synthetic data is only as good as its source. Grounding generation in retrieved documents prevents mode collapse.
Difficulty Matters: For a Student model to learn, the Teacher must provide challenging, ambiguous examples, not just easy ones.

As we move toward a future where most training data might be synthetic, techniques like SYNTHESIZRR will be essential to prevent our AI models from collapsing into a feedback loop of their own repetitive outputs.

The Problem: The Limits of Few-Shot Generation#

The Solution: SYNTHESIZRR#

The Workflow#

The Algorithm: RETRICL#

Why It Works: Intrinsic Evaluation#

1. Lexical Diversity (Self-BLEU)#

2. Entity Diversity (Entropy)#

3. Qualitative Comparison#

Distillation Performance: Extrinsic Evaluation#

Comparison to State-of-the-Art#

Why Does It Work? The Data Map Analysis#

Influence of In-Context Examples#

Conclusion#