In the world of machine learning, data is king. But what happens when the data you need simply doesn’t exist?
This is a common scenario in Extremely Weak-Supervised Text Classification (XWS-TC). Imagine you want to build a classifier to detect a very specific, rare emotion in tweets—say, “Surprised”—but you have absolutely no labeled examples. You only have the class name: “Surprised.”
Traditionally, researchers have had two ways to solve this. They could either “mine” the raw data to find existing examples, or they could ask a Large Language Model (LLM) to “synthesize” (write) new ones. Both methods have critical flaws when it comes to minority classes—categories that rarely appear in the data.
In this post, we will dive into a new framework called Text Grafting, proposed by researchers at UC San Diego. This method combines the best of both worlds, using a clever “mask-and-fill” technique to generate high-quality training data that is both accurate to the label and faithful to the real-world data distribution.
The Problem: The Long Tail and Distribution Shifts
To understand why Text Grafting is necessary, we first need to look at why current methods fail for minority classes. Real-world data often follows a long-tailed distribution: a few classes are very common, while many others are rare.
Approach 1: Text Mining
Text mining involves searching through a massive pile of raw, unlabeled text to find examples that match your class name (e.g., finding tweets containing the word “shocked” to represent the “Surprised” class).
While this works for common topics, it falls apart for minority classes. If a class constitutes only 3% of your data, simple keyword matching or clustering often returns noise or finds nothing at all.

As shown in Figure 2, the precision of text mining plummets as the class proportion decreases. For rare classes, the mining process is like looking for a needle in a haystack where the needle might not even be there.
Approach 2: Data Synthesis
The alternative is using an LLM (like GPT-4) to generate data from scratch. You simply prompt the model: “Write a surprised tweet.”
The problem here is subtle but significant: Distribution Shift. LLMs have their own “writing style” learned from the vast internet. When an LLM writes a tweet, it often looks distinct from how actual users write tweets in your specific dataset. If you train a classifier on this “synthetic” style, it often struggles to recognize the “real” style during testing.
The Trade-off
We are essentially stuck between two imperfect options:
| Framework | Source | Data Quality | In-Distribution? |
|---|---|---|---|
| Text Mining | Real Data | Noisy (often wrong label) | Yes (it’s real data) |
| Data Synthesis | LLM | Clean (correct label) | No (LLM style) |

Text Grafting aims to fill the gap in the table above: providing Clean labels that are In-Distribution.
The Solution: Text Grafting
The core insight of Text Grafting is that even if a sentence doesn’t belong to the minority class, it might contain structural components or phrasing that fit the target domain perfectly.
The authors propose a biological metaphor: Grafting. In botany, you might take the roots of one tree (which provides stability and adaptation to the soil) and graft a branch from another tree onto it (to produce specific fruit).
In this context:
- The Rootstock is the raw text from your dataset (providing the correct writing style/distribution).
- The Scion is the semantic meaning of the minority class (provided by an LLM).

As illustrated in Figure 1, the process transforms a text from a majority class (e.g., a generic tweet) into a minority class sample (e.g., a “Surprised” tweet) by keeping useful words and rewriting the rest.
The Three-Stage Process
Let’s break down exactly how this algorithm works, using the example of generating data for the “Surprised” emotion.
Stage 1: Potential Text Mining
First, we need to find “potential” in the raw data. We aren’t looking for texts that are surprised; we are looking for texts that could easily become surprised.
The researchers use a small, efficient LLM (like Gemma-7B) to score words in raw texts. They calculate a “potential score” (\(\Delta p\)) for every word. This score measures how much more likely a word becomes when we condition the generation on the specific class label versus a generic instruction.
The equation for the potential score of a word \(x_{(i,j)}\) is:

- \(I_c\): Instruction with the class name (e.g., “Write a surprised sentence”).
- \(I_r\): Regularization instruction (e.g., “Write a sentence”).
If a word appears frequently in “surprised” contexts but not in generic contexts, it gets a high score.
We then score the entire sentence by averaging the scores of the top-performing words (Top-K%):

Texts with high \(\Delta P_i\) are selected as candidates.
Stage 2: Template Creation
Once we have a candidate text, we don’t keep the whole thing. We keep the high-potential words (the “roots”) and mask out the rest.
- High Score Words: Kept as anchors (e.g., “believe,” “when luck,” “feel”).
- Low Score Words: Replaced with a blank token
_.
This creates a Template. The template forces the final output to retain the sentence structure and vocabulary of the original dataset, preserving the distribution.
Stage 3: Template Filling
Finally, we feed this template into a powerful LLM (like GPT-4). We ask it to fill in the blanks to create a coherent sentence that matches the target label.

Figure 3 provides a concrete walkthrough:
- Raw Text: A random sentence is analyzed.
- Scoring: Words like “believe” and “luck” are identified as having high potential for the “Surprised” class.
- Template:
_ believe _ when luck _ feel _ - Synthesis: The LLM fills the blanks to produce: “I can’t believe it when luck suddenly changes, and I feel completely astonished.”
The result is a sentence that is clearly “Surprised” (thanks to the LLM) but retains the unique phrasing of the raw corpus (thanks to the template).
Experiments and Results
Does this hybrid approach actually work better than just asking GPT-4 to write sentences? The authors tested Text Grafting on several datasets, including TweetEval, Emotion, and 20News.
Beating the Baselines
The results were compelling. Text Grafting consistently outperformed both standard Text Mining (TM) and Data Synthesis (DS) methods.

Take a look at the “Zero-Occur” row in Table 3. This is an extreme scenario where the researchers removed all actual instances of the minority class from the raw text.
- Text Mining failed completely (0.00 score) because there was nothing to find.
- Text Grafting still performed exceptionally well (scoring ~30.61 on TweetEval), proving that it doesn’t need the class to exist in the data—it just needs the style to exist.
Visualizing the Distribution
The researchers also visualized the mathematical “location” of the generated texts using Principal Component Analysis (PCA).

In Figure 4, you can see the distributions:
- Green/Yellow (Original Data): This is where we want to be.
- Blue (Data Synthesis/Incubator): Notice how far away this cluster is? That is the “distribution shift.” The LLM is writing in its own style, not the dataset’s style.
- Purple (Text Grafting): This cluster sits much closer to the original data. By grafting onto existing templates, the method successfully mimics the distribution of the real world.
Why It Works: The Analysis
The paper includes an ablation study to understand which parts of the process matter most.
1. The Mask Ratio
How much of the original sentence should we keep?
- If we keep too much (Low Mask Ratio), the LLM can’t change the meaning enough to fit the new label.
- If we mask too much (High Mask Ratio), we lose the structural benefit and revert to standard generation.

Figure 6 shows that a mask ratio around 0.75 (masking 75% of the words) is the “sweet spot.” This gives the LLM enough freedom to be creative while forcing it to adhere to the original sentence structure.
2. Efficiency and Negative Samples
Usually, when training a classifier with synthetic data, you need to generate “negative” samples (texts that aren’t the target class) to teach the model the difference. This doubles your cost.

However, Figure 5 reveals a surprising efficiency of Text Grafting. The red bars (Text Grafting without negative synthesis) often perform as well as or better than methods that require generating negative data. Because the grafted texts are already “near-distribution,” simply using random raw texts as negative samples is sufficient. This makes Text Grafting highly efficient.
Case Study: Strength vs. Failure
Text Grafting is powerful, but it relies on the assumption that the raw data contains complex structures worth preserving.

Figure 8 illustrates this perfectly:
- Strength (Top): When dealing with complex sentence structures (like the “Politics” example), grafting successfully transforms a sentence about a sewing machine into a sentence about a senator, keeping the complex grammar intact.
- Failure (Bottom): If the domain is simple factual questions (e.g., “What Mexican leader was shot…”), the template approach can be clumsy. Forcing “wolf” into a template about a “leader” results in nonsense. In cases where the text structure is very simple, standard generation might be enough.
Conclusion
Text Grafting represents a significant step forward for Extremely Weak-Supervised Learning. It tackles the “Minority Class” problem not by searching harder for data that might not be there, nor by Hallucinating data that looks fake, but by repurposing the data we already have.
By identifying “graftable” templates in raw text and using LLMs to fill them in, we can create training datasets that are high-quality, diverse, and crucially, in-distribution.
For students and practitioners, this paper offers a valuable lesson: sometimes the best way to generate new data isn’t to write it from scratch, but to build upon the structures that already exist.
Key Takeaways:
- Minority classes are hard to classify because they are rare (hard to mine) and usually require specific domain knowledge (hard to synthesize).
- Text Grafting mines “templates” rather than full texts, identifying words with high potential for the target class.
- The method produces data that aligns with the real-world distribution, leading to better classifier performance.
- It is robust even when the target class has zero occurrences in the raw data.
](https://deep-paper.org/en/paper/2406.11115/images/cover.png)