Introduction
Imagine you need to build a text classifier for a very specific task. Perhaps you need to filter emails that are both “urgent” and “related to shipping,” or identify social media posts that are “sarcastic” versus “genuinely angry.”
Traditionally, you had two difficult options. First, you could collect thousands of examples and label them by hand to train a model—a slow, expensive process. Second, you could try “zero-shot” classification using raw text mining, which involves searching massive databases for keywords. However, this often fails when concepts are complex or nuanced.
Recently, Large Language Models (LLMs) like GPT-4 and LLaMA have offered a third path: just ask the LLM to classify the text. While accurate, this is computationally expensive and slow for processing millions of documents. A smarter approach is to use the LLM to generate a training dataset, then use that data to train a smaller, faster model (like BERT or RoBERTa).
But there is a catch. Most existing methods for generating synthetic data treat every label in isolation. They struggle with “interdependent” labels—for example, defining an “Other” category depends entirely on what the other specific categories are.
In the research paper “Incubating Text Classifiers Following User Instructions with Nothing but LLM,” researchers from UC San Diego propose a novel framework called Incubator. This system treats the creation of a classifier as a “model incubation” process. By instruction-tuning an LLM and using a clever “self-diversification” technique, they can generate high-quality, diverse, and context-aware training data from nothing but a user’s instruction.
The Problem with Zero-Shot Classification
To understand why Incubator is necessary, we first need to look at the limitations of current zero-shot methods.
Zero-shot text classification aims to build a classifier without seeing any labeled examples. Early methods relied on mining raw corpora (like Wikipedia) for sentences containing class names. This works for simple topics like “Sports” vs. “Politics,” but fails for complex concepts like “TED Talk given by an Educator.”
With the advent of LLMs, researchers started using prompts to generate synthetic data. Methods like ZeroGen prompt an LLM to “generate a sentence about sports.” While better than mining, these methods usually generate data for each class independently.
This independence creates a massive blind spot: Label Interdependency.
Consider a classification task with three labels: Love, Joy, and Other.
- If the user changes the labels to Love, Anger, and Other, the definition of Other fundamentally shifts. In the first case, “Anger” belongs in Other. In the second, it has its own category.

As shown in Figure 1, traditional methods struggle to capture this dynamic. They might generate generic “Other” text that overlaps with the specific categories you care about. Incubator solves this by feeding the LLM the entire context of the instruction, allowing it to understand the relationships between labels.
The Incubator Framework
The core idea of Incubator is to create a “teacher” LLM that is expert at generating training data for “student” models. The process involves two main stages: Instruction-Tuning and Self-Diversification.

Stage 1: Instruction-Tuning the LLM
An off-the-shelf LLM (like LLaMA-2) is good at chatting, but not necessarily optimized for generating perfectly balanced classification datasets. To fix this, the researchers “instruction-tuned” the model.
They collected pairs of descriptions and data samples from the HuggingFace platform. They converted these into a specific format: a user instruction (describing the task) and a Python dictionary containing examples for each label.
To illustrate, here is what a training sample looks like:

However, manually collecting these pairs provides a limited dataset. To expand it, the researchers used GPT-4 as a data augmentation tool. They used In-Context Learning (ICL), feeding GPT-4 a few real examples and asking it to hallucinate new, imaginative tasks and corresponding data.

By the end of this stage, the Incubator (based on LLaMA-2) had learned to take a complex user request and output a structured dictionary of training examples that respected the relationships between labels.
Stage 2: Self-Diversification
A major issue with LLMs is that they are repetitive. If you ask an LLM to “write a positive movie review” 100 times, you will get 100 very similar variations of “I loved this movie, the acting was great.” This lack of diversity hurts the performance of the final classifier.
To solve this, the researchers introduced Self-Diversification.
- Mass Generation: They query the Incubator to generate a large number of samples (e.g., 1024) for a single instruction.
- Embedding: They use a text embedder (a model that turns text into a mathematical vector) to represent these samples in semantic space. Because the data is structured as a dictionary (Label A + Label B + …), they concatenate the embeddings of all values to represent the full “data package.” \[ E ( d ) = \bigoplus _ { i = 1 } ^ { n } E ( d [ l _ { i } ] ) \]
- Clustering: They run a clustering algorithm (K-Means) on these embeddings to group similar samples together.
- Selection: They identify the samples closest to the center of each cluster. These represent the most distinct, “archetypal” examples of different semantic variations.
By fine-tuning the Incubator on these diverse cluster centers, the model learns to prioritize diversity and uniformity in its generation, preventing the “mode collapse” where an LLM simply repeats the same easy examples.
Experiments and Results
The researchers tested Incubator against several strong baselines, including:
- Prompting: Asking LLaMA-2 to classify text directly (no training).
- ZeroGen & ProGen: Previous state-of-the-art methods that generate data label-by-label without considering dependencies.
The goal was to “incubate” a small RoBERTa-Large model using the synthetic data and see how well it performed on real benchmarks.
Performance on Traditional Benchmarks
The results on standard datasets (like sentiment analysis and news classification) showed that Incubator consistently outperforms previous methods.

The data indicates that considering all labels simultaneously—rather than generating them one by one—allows the model to create sharper decision boundaries. The self-diversification step (denoted as “- Diversification” in the table when removed) proved crucial; without it, performance dropped significantly.
Handling the “Other” Class
The true test of Incubator’s understanding of label dependency is the “Other” class. The researchers modified several datasets to lump minority classes into a generic “Other” category.

In the NYT-LOC dataset (classifying news by location), Incubator achieved 84.19% accuracy compared to roughly 69% for the baselines. This proves that Incubator correctly inferred that “Other” meant “Locations that are NOT the specific ones listed,” whereas other methods struggled to define the negative space.
Complex User Constraints and Logic
One of the most powerful applications of Incubator is Logical Text Mining. Users often have complex requirements that can be expressed using logic gates (AND, OR, NOT).
For example, a user might want to find text messages that are “Positive AND about food.”
Incubator allows for Conjunctive Incubation. You can decompose a complex request into simple components (Incubate a model for “Positive”, incubate a model for “About Food”), and then combine their probabilities mathematically.

The generated data in the table above demonstrates high semantic quality. Notice specifically how the “Other” category is context-aware. When generating data for “About Food,” the Incubator creates “Other” examples regarding meetings or movies—things that are distinct from food, ensuring the classifier learns the boundary correctly.
The researchers quantified this capability using “Precision@100” (how many of the top 100 retrieved items were correct).

Analysis: Efficiency and Robustness
For this technology to be useful, it must be efficient. Does generating synthetic data take forever?

As shown in Figure 5, the process is surprisingly fast. Generating the dataset takes just over a minute. The majority of the time is spent fine-tuning the small student model, which is still very quick (less than 4 minutes). This makes it feasible to spin up a custom classifier on a laptop in the time it takes to grab a coffee.
The researchers also analyzed how much data is actually needed.

Figure 3 reveals that the “scaling law” for synthetic data hits a point of diminishing returns relatively quickly. Generating just 64 to 128 high-quality, diverse samples per class is often enough to train a robust classifier. This further emphasizes the importance of the quality and diversity provided by the Incubator framework over sheer quantity.
Conclusion
The “Incubator” framework represents a significant shift in how we think about NLP in the era of Large Language Models. Instead of using massive, expensive LLMs for every single prediction, we can use them as “teachers” to create specialized, lightweight “student” models.
By focusing on instruction-tuning and self-diversification, Incubator solves the critical problem of label dependency. It allows users to define classes naturally—including tricky concepts like “Other” or “Not urgent”—and generates training data that respects those boundaries.
This approach democratizes text classification. You no longer need a raw corpus of millions of documents or a budget for human annotators. With nothing but a clear instruction and an LLM, you can incubate a custom classifier tailored exactly to your needs.
Future work in this space aims to expand beyond classification. Imagine “incubating” a custom summarizer or a question-answering system that follows your specific style guide, all generated from a single prompt. The era of “Model Incubation” has just begun.
](https://deep-paper.org/en/paper/2404.10877/images/cover.png)