In the modern era of Natural Language Processing (NLP), we are witnessing a “David and Goliath” dynamic. On one side, we have the Goliaths: massive, generalist Large Language Models (LLMs) like GPT-4, Claude, and Llama 2. These models can do almost anything, but they are computationally expensive, slow, and often accessible only via API. On the other side, we have the Davids: smaller, specialist models (like BERT or DeBERTa) that are fast, cheap, and deployable on modest hardware.
For many organizations and researchers, the goal is knowledge distillation: transferring the capabilities of the giant Teacher LLM into the smaller Student model. The standard way to do this is by asking the Teacher to generate a massive synthetic dataset, which the Student then trains on.
But there is a problem. When you ask an LLM to generate thousands of examples (e.g., “Write a movie review”), it tends to get repetitive. It relies on its internal parametric memory, often regurgitating the same popular entities (e.g., “The Avengers” or “Titanic”) and reusing the same sentence structures. This lack of diversity creates a “bland” dataset that limits how much the Student model can learn.
In this post, we explore a solution proposed by researchers from Amazon and UT Austin: SYNTHESIZRR. This method fundamentally changes how we generate synthetic data by introducing a retrieval step, ensuring that the Student model learns from a rich, diverse, and grounded distribution of data.
The Problem: The Limits of Few-Shot Generation
To understand why SYNTHESIZRR is necessary, we first need to look at the incumbent method: Few-Shot Generation (FEWGEN).
In FEWGEN, you provide the LLM with a few examples (demonstrations) of a task—say, detecting political bias—and ask it to generate more examples following that pattern. While this works to an extent, it suffers from mode collapse. The LLM tends to sample from the high-probability regions of its training data.
If you ask for 10,000 synthetic news articles, you might get 500 variations of the same Hillary Clinton email scandal or generic complaints about taxes. The resulting dataset lacks the “long tail” of real-world complexity.

As shown in Figure 1 above, the FEWGEN approach (middle) often hallucinates content or relies on generic tropes. In contrast, SYNTHESIZRR (bottom) retrieves a real, specific document (like a letter from Senator Harry Reid) and rewrites it to match the target label. This grounding in real documents provides the necessary variety for robust training.
The Solution: SYNTHESIZRR
The core insight of this paper is that dataset synthesis involves two distinct competencies that should be decoupled:
- Content Sourcing: Obtaining the “what”—the facts, entities, and scenarios.
- Task Inversion: Performing the “how”—formatting that content into the specific input-output pair required for the classification task.
Prior approaches forced the LLM to do both using only its internal memory. SYNTHESIZRR offloads the Content Sourcing to an external retrieval system.
The Workflow
The SYNTHESIZRR process can be visualized as a pipeline that transforms a small set of “seed” examples into a massive, diverse dataset.

Here is the step-by-step breakdown:
- Content Sourcing (Retrieval): The system starts with a small seed set of labeled examples (\(x_{ICL}, y_{ICL}\)). It uses these examples as queries to search a massive unlabelled corpus (like RealNews or an Amazon Product database). For every seed example, it retrieves \(K\) unique documents (\(r_1...r_K\)).
- Task Inversion (Refinement): The system then constructs a prompt that includes one of these retrieved documents. It instructs the Teacher LLM to rewrite or leverage this retrieved document to match a specific target label.
The beauty of this approach is the expansion factor. A single seed example can be used to retrieve 50 or 100 different documents. Since each document contains unique entities and contexts, the LLM is forced to generate 50 or 100 unique synthetic training examples, rather than repeating the same concept.
Mathematically, the generation of the synthetic example \(\tilde{x}\) is conditioned not just on the prompt instructions, but explicitly on the retrieved document \(r_k\):
\[ \begin{array} { r } { \tilde { x } _ { j } ^ { i } \sim \mathcal { M } _ { \mathrm { L M } } \big ( \cdot | \tilde { x } _ { < j } ^ { i } , \mathcal { P } _ { \tau } ( R _ { i n v } , r _ { k } , \mathcal { V } ( y _ { { I C L } } ) ) \big ) , } \end{array} \]
This formula highlights that the generation is grounded. The model \(\mathcal{M}_{LM}\) generates the next token based on the prompt \(\mathcal{P}_\tau\), which wraps the inversion instruction \(R_{inv}\), the retrieved document \(r_k\), and the label verbalization \(\mathcal{V}(y)\).
The Algorithm: RETRICL
The authors introduce a specific algorithm called SYNTHESIZRR RETRICL (Retrieval In-Context Learning).
In standard few-shot prompting, you provide random examples from your seed set. In RETRICL, the authors use retrieval even for the demonstrations. When generating a synthetic example based on a retrieved document \(r_k\), the prompt includes “shots” (examples) that are semantically similar to that document.
The final synthetic dataset \(\mathcal{D}_{SYNTH}\) is the union of all generations derived from all retrieved documents:
\[ \mathcal { D } _ { \mathrm { { S Y N T H } } } = \bigcup _ { ( x , y , \Gamma _ { K } ) \in \mathcal { D } _ { \mathrm { { R E T R } } } } \bigcup _ { r _ { k } \in \Gamma _ { K } } \big \{ ( \tilde { x } ^ { i } , y ) \big \} . \]
Why It Works: Intrinsic Evaluation
Does adding a retrieval step actually make the data better? The researchers analyzed the generated text using several metrics to measure diversity and quality.
1. Lexical Diversity (Self-BLEU)
Self-BLEU measures how similar a generated sentence is to other sentences in the same dataset. A lower score is better, as it indicates less repetition.

As shown in Figure 3, SYNTHESIZRR (the blue and pink lines) achieves significantly lower Self-BLEU scores than FEWGEN (orange/green lines). The diversity of SYNTHESIZRR approaches that of human-written text (GOLD). This confirms that seeding the model with different retrieved documents forces it to use varied vocabulary.
2. Entity Diversity (Entropy)
One of the biggest weaknesses of LLMs is “popularity bias”—they talk about New York and London, but rarely about Scranton or Leeds.

Figure 4 demonstrates the entity entropy (randomness of entities mentioned) for different categories like Organizations (ORG), People (PERSON), and Locations (LOC). SYNTHESIZRR maintains high entropy, similar to the Gold standard. FEWGEN, conversely, collapses into a narrow range of popular entities. By retrieving from the “long tail” of a corpus, SYNTHESIZRR ensures the Student model sees rare entities during training.
3. Qualitative Comparison
The difference is even more stark when you read the text. In Table 3 below, look at the “Electronics” category. The FEWGEN example is vague (“happy customer,” “fast,” “good external drive”). It reads like a template. The SYNTHESIZRR example, derived from a retrieved product description, is specific: it mentions a “portable laptop microphone,” “right-angled,” and “flat-frequency response.”

This specificity helps the Student model learn features that actually matter for classification, rather than relying on generic sentiment words.
Distillation Performance: Extrinsic Evaluation
The ultimate test of a synthetic dataset is: Does it train a better Student model?
The authors fine-tuned a DeBERTa-v3-Large model (the Student) on datasets generated by different methods.

Table 6 presents the results across six different datasets (AG News, Hyperpartisan, etc.).
- Zero-Shot: Even without any in-context examples, SYNTHESIZRR significantly outperforms FEWGEN. This indicates that the retrieval step alone provides massive value.
- Few-Shot: The 3-shot SYNTHESIZRR (RETRICL) consistently beats the 32-shot FEWGEN. This is a remarkable efficiency gain—using retrieval allows for better performance with fewer in-context demonstrations.
Comparison to State-of-the-Art
The researchers also benchmarked SYNTHESIZRR against other complex synthesis methods like SunGen, ReGen, and AttrPrompt.

As shown above (Table 7), SYNTHESIZRR generally produces datasets that lead to higher accuracy (Intrinsic and Extrinsic) compared to these other methods, often using a smaller teacher model (Llama-2 13B vs GPT-3.5). It particularly excels in MAUVE scores, a metric that measures how closely the distribution of synthetic text matches human text.
Why Does It Work? The Data Map Analysis
To understand why the Student models learn better, the authors employed Dataset Cartography. This technique plots training examples on two axes:
- Confidence: How sure the model is about the true label.
- Variability: How much the model’s prediction fluctuates during training.
High variability usually indicates “ambiguous” or “hard-to-learn” examples. These are often the most valuable for learning decision boundaries.

Figure 5 reveals a critical insight.
- FEWGEN (Middle): The data is clustered in the top-left (High Confidence, Low Variability). These are “easy” examples. The model learns them instantly and stops learning.
- SYNTHESIZRR (Right): The distribution looks much more like the GOLD human data (Left). It contains a healthy spread of ambiguous examples (higher variability).
Because SYNTHESIZRR forces the LLM to deal with real-world, retrieved contexts (which might be messy or complex), it generates training data that challenges the Student model, preventing it from learning simple, brittle heuristics.
Influence of In-Context Examples
Does providing more in-context examples (shots) help SYNTHESIZRR? The authors varied the number of shots used in the RETRICL process.

Figure 7 shows that increasing the number of shots (from 0 to 8) generally improves Student accuracy (left graph). However, interestingly, providing more shots creates a slight trade-off in entity entropy (center graph). As the prompt becomes stricter with more examples, the LLM might constrain its creativity slightly, though the performance gains usually outweigh this drop.
Conclusion
The SYNTHESIZRR paper presents a compelling argument against the “black box” generation of synthetic data. By relying solely on an LLM’s internal parameters (FEWGEN), we get data that is biased towards the “head” of the distribution—repetitive, bland, and overly simple.
By treating dataset generation as a two-part process—Retrieval (Content) + Refinement (Style)—SYNTHESIZRR achieves the best of both worlds. It leverages the massive knowledge contained in external corpora and the instruction-following capabilities of modern LLMs.
Key Takeaways for Students:
- Decomposition: Complex NLP tasks often benefit from breaking them down. Here, separating what to say from how to say it unlocked diversity.
- Grounding: Synthetic data is only as good as its source. Grounding generation in retrieved documents prevents mode collapse.
- Difficulty Matters: For a Student model to learn, the Teacher must provide challenging, ambiguous examples, not just easy ones.
As we move toward a future where most training data might be synthetic, techniques like SYNTHESIZRR will be essential to prevent our AI models from collapsing into a feedback loop of their own repetitive outputs.
](https://deep-paper.org/en/paper/2405.10040/images/cover.png)