Introduction

Imagine you need to build a text classifier for a very specific task. Perhaps you need to filter emails that are both “urgent” and “related to shipping,” or identify social media posts that are “sarcastic” versus “genuinely angry.”

Traditionally, you had two difficult options. First, you could collect thousands of examples and label them by hand to train a model—a slow, expensive process. Second, you could try “zero-shot” classification using raw text mining, which involves searching massive databases for keywords. However, this often fails when concepts are complex or nuanced.

Recently, Large Language Models (LLMs) like GPT-4 and LLaMA have offered a third path: just ask the LLM to classify the text. While accurate, this is computationally expensive and slow for processing millions of documents. A smarter approach is to use the LLM to generate a training dataset, then use that data to train a smaller, faster model (like BERT or RoBERTa).

But there is a catch. Most existing methods for generating synthetic data treat every label in isolation. They struggle with “interdependent” labels—for example, defining an “Other” category depends entirely on what the other specific categories are.

In the research paper “Incubating Text Classifiers Following User Instructions with Nothing but LLM,” researchers from UC San Diego propose a novel framework called Incubator. This system treats the creation of a classifier as a “model incubation” process. By instruction-tuning an LLM and using a clever “self-diversification” technique, they can generate high-quality, diverse, and context-aware training data from nothing but a user’s instruction.

The Problem with Zero-Shot Classification

To understand why Incubator is necessary, we first need to look at the limitations of current zero-shot methods.

Zero-shot text classification aims to build a classifier without seeing any labeled examples. Early methods relied on mining raw corpora (like Wikipedia) for sentences containing class names. This works for simple topics like “Sports” vs. “Politics,” but fails for complex concepts like “TED Talk given by an Educator.”

With the advent of LLMs, researchers started using prompts to generate synthetic data. Methods like ZeroGen prompt an LLM to “generate a sentence about sports.” While better than mining, these methods usually generate data for each class independently.

This independence creates a massive blind spot: Label Interdependency.

Consider a classification task with three labels: Love, Joy, and Other.

  • If the user changes the labels to Love, Anger, and Other, the definition of Other fundamentally shifts. In the first case, “Anger” belongs in Other. In the second, it has its own category.

Figure 1 compares Incubator with raw corpus mining and simple LLM generation. It highlights how the definition of “Other” changes depending on the presence of “Love” or “Sad” in the label set.

As shown in Figure 1, traditional methods struggle to capture this dynamic. They might generate generic “Other” text that overlaps with the specific categories you care about. Incubator solves this by feeding the LLM the entire context of the instruction, allowing it to understand the relationships between labels.

The Incubator Framework

The core idea of Incubator is to create a “teacher” LLM that is expert at generating training data for “student” models. The process involves two main stages: Instruction-Tuning and Self-Diversification.

Figure 2 provides an overview of the Incubator framework. It shows the flow from Huggingface Datasets to Instruction Tuning, followed by the Self-diversification phase using sentence embeddings and clustering.

Stage 1: Instruction-Tuning the LLM

An off-the-shelf LLM (like LLaMA-2) is good at chatting, but not necessarily optimized for generating perfectly balanced classification datasets. To fix this, the researchers “instruction-tuned” the model.

They collected pairs of descriptions and data samples from the HuggingFace platform. They converted these into a specific format: a user instruction (describing the task) and a Python dictionary containing examples for each label.

To illustrate, here is what a training sample looks like:

Figure 6 shows a case from the instruction-tuning dataset. The instruction asks for a model to anticipate star ratings for Android apps, and the data is a JSON dictionary mapping star ratings to review texts.

However, manually collecting these pairs provides a limited dataset. To expand it, the researchers used GPT-4 as a data augmentation tool. They used In-Context Learning (ICL), feeding GPT-4 a few real examples and asking it to hallucinate new, imaginative tasks and corresponding data.

Table 7 displays the prompt used for ICL-based augmentation, where the user asks GPT-4 to generate imaginative instructions and sample data dictionaries.

By the end of this stage, the Incubator (based on LLaMA-2) had learned to take a complex user request and output a structured dictionary of training examples that respected the relationships between labels.

Stage 2: Self-Diversification

A major issue with LLMs is that they are repetitive. If you ask an LLM to “write a positive movie review” 100 times, you will get 100 very similar variations of “I loved this movie, the acting was great.” This lack of diversity hurts the performance of the final classifier.

To solve this, the researchers introduced Self-Diversification.

  1. Mass Generation: They query the Incubator to generate a large number of samples (e.g., 1024) for a single instruction.
  2. Embedding: They use a text embedder (a model that turns text into a mathematical vector) to represent these samples in semantic space. Because the data is structured as a dictionary (Label A + Label B + …), they concatenate the embeddings of all values to represent the full “data package.” \[ E ( d ) = \bigoplus _ { i = 1 } ^ { n } E ( d [ l _ { i } ] ) \]
  3. Clustering: They run a clustering algorithm (K-Means) on these embeddings to group similar samples together.
  4. Selection: They identify the samples closest to the center of each cluster. These represent the most distinct, “archetypal” examples of different semantic variations.

By fine-tuning the Incubator on these diverse cluster centers, the model learns to prioritize diversity and uniformity in its generation, preventing the “mode collapse” where an LLM simply repeats the same easy examples.

Experiments and Results

The researchers tested Incubator against several strong baselines, including:

  • Prompting: Asking LLaMA-2 to classify text directly (no training).
  • ZeroGen & ProGen: Previous state-of-the-art methods that generate data label-by-label without considering dependencies.

The goal was to “incubate” a small RoBERTa-Large model using the synthetic data and see how well it performed on real benchmarks.

Performance on Traditional Benchmarks

The results on standard datasets (like sentiment analysis and news classification) showed that Incubator consistently outperforms previous methods.

Table 1 shows benchmark results. Incubator achieves an average score of 70.78, beating ZeroGen (64.73) and ProGen (65.79). The diversification step alone adds significant value.

The data indicates that considering all labels simultaneously—rather than generating them one by one—allows the model to create sharper decision boundaries. The self-diversification step (denoted as “- Diversification” in the table when removed) proved crucial; without it, performance dropped significantly.

Handling the “Other” Class

The true test of Incubator’s understanding of label dependency is the “Other” class. The researchers modified several datasets to lump minority classes into a generic “Other” category.

Table 2 presents results on datasets with an “Other” class. Incubator drastically outperforms ZeroGen and ProGen, specifically on the NYT-LOC and Massive datasets.

In the NYT-LOC dataset (classifying news by location), Incubator achieved 84.19% accuracy compared to roughly 69% for the baselines. This proves that Incubator correctly inferred that “Other” meant “Locations that are NOT the specific ones listed,” whereas other methods struggled to define the negative space.

Complex User Constraints and Logic

One of the most powerful applications of Incubator is Logical Text Mining. Users often have complex requirements that can be expressed using logic gates (AND, OR, NOT).

For example, a user might want to find text messages that are “Positive AND about food.”

Incubator allows for Conjunctive Incubation. You can decompose a complex request into simple components (Incubate a model for “Positive”, incubate a model for “About Food”), and then combine their probabilities mathematically.

Table 5 shows examples of generated text. For the target ‘Positive’, the data generated is clearly positive. For ‘About Food’, the generated text discusses burgers and sushi. Crucially, the ‘Other’ examples serve as valid negative samples.

The generated data in the table above demonstrates high semantic quality. Notice specifically how the “Other” category is context-aware. When generating data for “About Food,” the Incubator creates “Other” examples regarding meetings or movies—things that are distinct from food, ensuring the classifier learns the boundary correctly.

The researchers quantified this capability using “Precision@100” (how many of the top 100 retrieved items were correct).

Table 4 displays performance on logical conjunctions. Conjunctive Incubation (breaking tasks down) generally outperforms Direct Incubation, achieving up to 100% precision on ‘OR’ tasks.

Analysis: Efficiency and Robustness

For this technology to be useful, it must be efficient. Does generating synthetic data take forever?

Figure 5 analyzes efficiency. Dataset generation takes a constant time of about 70 seconds. Classifier incubation (training) scales linearly with the number of labels but remains under 4 minutes total.

As shown in Figure 5, the process is surprisingly fast. Generating the dataset takes just over a minute. The majority of the time is spent fine-tuning the small student model, which is still very quick (less than 4 minutes). This makes it feasible to spin up a custom classifier on a laptop in the time it takes to grab a coffee.

The researchers also analyzed how much data is actually needed.

Figure 3 analyzes dataset size. It shows that accuracy plateaus after about 64 samples per class, meaning you don’t need to generate massive amounts of data to get good results.

Figure 3 reveals that the “scaling law” for synthetic data hits a point of diminishing returns relatively quickly. Generating just 64 to 128 high-quality, diverse samples per class is often enough to train a robust classifier. This further emphasizes the importance of the quality and diversity provided by the Incubator framework over sheer quantity.

Conclusion

The “Incubator” framework represents a significant shift in how we think about NLP in the era of Large Language Models. Instead of using massive, expensive LLMs for every single prediction, we can use them as “teachers” to create specialized, lightweight “student” models.

By focusing on instruction-tuning and self-diversification, Incubator solves the critical problem of label dependency. It allows users to define classes naturally—including tricky concepts like “Other” or “Not urgent”—and generates training data that respects those boundaries.

This approach democratizes text classification. You no longer need a raw corpus of millions of documents or a budget for human annotators. With nothing but a clear instruction and an LLM, you can incubate a custom classifier tailored exactly to your needs.

Future work in this space aims to expand beyond classification. Imagine “incubating” a custom summarizer or a question-answering system that follows your specific style guide, all generated from a single prompt. The era of “Model Incubation” has just begun.