In the world of machine learning, we are currently living in a paradox. We have access to models pre-trained on trillions of tokens—large language models that seem to know everything. Yet, when we want to solve a specific, real-world business problem—like classifying customer support tickets or detecting specific types of toxic speech—we often hit a wall.

The wall is labeled data.

While unlabeled text is everywhere, high-quality labeled data is scarce, expensive, and slow to produce. You usually need human experts to sit down and categorize thousands of examples to fine-tune a model.

This brings us to a crucial research paper: “Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models” by Christopher Schröder and Gerhard Heyer. This work tackles the data scarcity problem head-on. The researchers propose a method called HAST that combines human expertise with machine confidence, allowing models to reach state-of-the-art performance using as few as 130 labeled examples.

In this post, we will deconstruct this paper, exploring how it merges Active Learning with Self-Training to create a highly efficient text classification pipeline.

The Twin Engines: Active Learning and Self-Training

To understand the innovation in this paper, we first need to understand the two specific techniques it attempts to harmonize: Active Learning and Self-Training.

Active Learning: The Human in the Loop

In a traditional supervised learning setup, you might randomly select 10,000 documents, label them all, and train a model. But not all documents are equally useful. Some are repetitive; others are obvious.

Active Learning (AL) is a strategy where the model “asks” for help. Instead of labeling random data, the model looks at the pool of unlabeled data and identifies the instances it is least certain about. It then queries an “oracle” (a human annotator) to label just those difficult instances.

Self-Training: The Machine Feedback Loop

Self-Training (ST) is a semi-supervised technique where the model acts as its own teacher. Once a model is partially trained, it makes predictions on the remaining unlabeled data. If the model is extremely confident in a prediction (e.g., “I am 99.9% sure this is a positive review”), it assigns that prediction as a “pseudo-label.” These pseudo-labeled examples are then added to the training set.

The Conflict and the Opportunity

These two methods seem contradictory at first glance.

  • Active Learning focuses on uncertainty (what the model doesn’t know).
  • Self-Training focuses on certainty (what the model thinks it knows).

However, as illustrated in the figure below, they can actually work in perfect harmony.

Figure 1: Active learning (a), and active learning with interleaved self-training (b). For active learning, the most uncertain samples are labeled by the human annotator, while for self-training pseudo-labels are obtained from the current model using the most certain samples.

In Part (a), we see the standard Active Learning loop: the model selects uncertain samples for the human. In Part (b), the authors propose an interleaved approach. The human handles the “hard” cases to correct the model’s boundaries, while the model automatically labels the “easy” cases to bulk up the training data. This creates a powerful feedback loop where the labeled dataset grows rapidly without requiring extra human effort.

The Landscape of Self-Training

While the idea of combining AL and ST isn’t brand new, the authors identified significant flaws in how previous research approached it. They conducted a reproduction study of four major self-training strategies:

  1. UST (Uncertainty-aware Self-Training): Uses dropout (randomly disabling neurons) to estimate uncertainty and sample instances.
  2. AcTune: Uses clustering to find diverse samples but relies on a validation set to tune hyperparameters—something you rarely have in a low-resource scenario.
  3. VERIPS: A rigorous method that verifies pseudo-labels but uses strict confidence thresholds.
  4. NeST (Neighborhood-regularized Self-Training): Uses the embedding space (the geometric relationship between text vectors) to smooth out predictions.

The table below summarizes these approaches. Note the differences in how they select pseudo-labels and whether they are designed for the specific constraints of Active Learning (where we don’t have massive validation sets).

Table 1: Comparing the four most relevant self-training approaches in terms of pseudo-label selection, self-training, and experiment setting.

The authors argue that many of these methods suffer from unrealistic settings. For example, some methods require a validation set of 1,000 labeled examples to tune parameters. If you have 1,000 labels, you might not need Active Learning in the first place! Furthermore, rigid confidence thresholds (e.g., “only accept if probability > 0.9”) often fail because model confidence isn’t always calibrated to accuracy.

Introducing HAST: Hard-Label Neighborhood-Regularized Self-Training

To overcome these limitations, the authors introduce HAST. This new method is designed to be sample-efficient and robust, even without large validation sets.

HAST is built on three main pillars: Contrastive Representation Learning, Neighborhood-Based Selection, and Dynamic Weighting.

1. Contrastive Representation Learning (SetFit)

Standard language models like BERT produce embeddings (vector representations of text), but the distance between these vectors doesn’t always perfectly correspond to semantic similarity.

HAST leverages Contrastive Learning, specifically using the SetFit framework. SetFit fine-tunes the underlying sentence embeddings so that examples from the same class are pulled close together in vector space, while examples from different classes are pushed apart. This creates a “cleaner” map of the data, making the next step—finding neighbors—much more reliable.

2. Pseudo-Label Selection via Neighbors

Instead of just asking the model, “Are you confident?”, HAST asks, “Do your neighbors agree with you?”

This utilizes a k-Nearest Neighbors (KNN) vote. For a specific unlabeled instance \(x\), HAST checks the \(k\) closest labeled examples in the embedding space.

The selection rule is defined mathematically as follows:

Equation 1 showing the selection criteria for pseudo-labels based on confidence and KNN agreement.

Here is what this equation tells us:

  • We select an instance for pseudo-labeling (\(\mathbb{1}_{PL}(x) = 1\)) if and only if:
  1. The model’s confidence (\(s_i\)) is greater than 0.5 (a very low, lenient bar).
  2. The model’s predicted label (\(\hat{y}_i\)) matches the majority vote of its neighbors (\(\hat{y}_i^{knn}\)).

This is a clever “sanity check.” Even if the model is slightly unsure (e.g., 60% confidence), if all the surrounding data points say “Class A,” HAST trusts the neighborhood and assigns the pseudo-label. This allows HAST to gather many more pseudo-labels than methods relying on strict 90%+ confidence thresholds.

3. Handling Imbalance with Weighting

A major risk in self-training is runaway bias. If the model is good at detecting “Class A” but bad at “Class B,” it will generate thousands of pseudo-labels for Class A. Retraining on this skewed data will make the model ignore Class B entirely.

HAST introduces a dynamic weighting scheme to solve this.

First, it calculates a raw imbalance score \(z\). It looks at the total number of pseudo-labels (\(N\)) and the count for the specific class (\(h_c\)).

Equation 2 showing the calculation for the imbalance score z.

If a class \(h_c\) has more samples than the average (\(N/C\)), \(z\) becomes negative. If it has fewer, \(z\) becomes positive.

Next, this score is passed through a sigmoid function to create a class weight \(\alpha_c\):

Equation 3 showing the calculation for alpha_c using a sigmoid function.

This essentially down-weights classes that are over-represented in the pseudo-labels and up-weights rare classes.

Finally, HAST applies a global damping factor \(\beta\). Since pseudo-labels are generated automatically, they are inherently “noisier” than human labels. We shouldn’t trust them as much as the gold-standard data provided by the active learning oracle.

Equation 4 showing the final weight calculation.

The parameter \(\beta\) allows the researchers to scale down the influence of all pseudo-labels relative to the human labels. In their experiments, they found a \(\beta\) of 0.1 was effective, meaning a pseudo-label carries 10% of the voting power of a human label.

The Experiment: Can We Learn with Just 130 Examples?

The researchers tested HAST against the reproduced methods (UST, AcTune, VERIPS, NeST) and a baseline (Standard Active Learning with no Self-Training).

The Setup

They used four diverse datasets ranging from news categorization to sentiment analysis.

Table 2: Key information about the examined datasets. Abbreviations: N (News), S (Sentiment), Q (Questions).

The constraints were strict:

  • Start with only 30 labeled instances.
  • Perform 10 rounds of Active Learning.
  • In each round, label only 10 new instances.
  • Total Budget: 130 instances.

This is an incredibly “low-resource” setting. For context, typical deep learning training uses tens of thousands of examples.

The Results

The performance was measured using Accuracy (for balanced data) and F1-score (for imbalanced data). The table below details the final performance after 130 samples.

Table 3: Classification performance after the final iteration. The table breaks down results by query strategy, classifier (BERT vs SetFit), and self-training approach.

Key Takeaways from the Results:

  1. HAST dominates with SetFit: When using the contrastive SetFit model, HAST (bottom rows) consistently achieves top-tier results, reaching 98.4% F1 on DBPedia and 88.2% Accuracy on IMDB.
  2. Self-Training works: Almost all self-training methods outperformed the “No Self-Training” baseline.
  3. NeST is a strong competitor: The NeST method also performed very well, particularly with standard BERT models. Both NeST and HAST leverage the embedding space, suggesting that geometry is key to sample efficiency.

Visualizing the Learning Curve

It’s not just about the final score; it’s about how fast you get there. The learning curves below show performance (y-axis) vs. the number of labeled instances (x-axis).

Figure 2: Learning curves per model, query strategy, and dataset. The horizontal red line represents the performance of a model trained on 100% of the data.

Look at the SetFit columns (the right side of each pair). The purple line (HAST) shoots up incredibly fast. On the DBPedia dataset (second column), HAST reaches near-perfect performance almost immediately, effectively matching the red line (the model trained on the entire dataset) using only a tiny fraction of the data.

Robustness to Noise

One might worry: “What if the model generates wrong pseudo-labels?”

The researchers analyzed this by artificially injecting noise—flipping correct pseudo-labels to incorrect ones.

Figure 3: The effect of label noise for NeST and HAST on AGN. The left side shows validation accuracy, the right side shows area under the curve.

The graphs above show that HAST (orange/purple lines) maintains high accuracy even as noise increases up to 20% (\(\lambda = 0.2\)). This suggests the weighting mechanism effectively dampens the impact of bad data, preventing the model from spiraling out of control.

Does the Weighting Really Matter?

Finally, the authors performed an ablation study to see if their complicated weighting math was actually necessary.

Table 6: Ablation analysis showing performance when removing different weighting components.

The results are clear. When they removed the class weighting (\(\alpha=1.0\)) or the pseudo-label down-weighting (\(\beta=1.0\)), performance generally dropped. The combination of addressing class imbalance and managing the noise ratio is crucial for the method’s success.

Comparison with State-of-the-Art

Perhaps the most impressive statistic from the paper is how HAST compares to other “low-resource” methods in the literature.

Table 4: Comparison with previous works that have investigated low-resource methods. Comparison includes sample sizes (N).

On the AG’s News dataset, HAST achieves 0.886 accuracy with 130 samples. A previous BERT-based active learning method required 525 samples to reach a comparable 0.904. On DBPedia, HAST matches the performance of a method using 420 samples, again using only 130.

This represents a 3x to 4x reduction in labeling effort compared to previous strong baselines.

Conclusion and Implications

The HAST approach demonstrates that we don’t necessarily need massive datasets to build high-performing text classifiers. By combining Contrastive Learning (to create a meaningful data map), Active Learning (to let humans solve the hard cases), and Self-Training (to let machines solve the easy cases), we can achieve state-of-the-art results with minimal human effort.

Why does this matter?

  1. Cost: It drastically reduces the budget required for data annotation.
  2. Accessibility: Smaller organizations without data teams can build custom classifiers.
  3. Efficiency: It proves that smaller models (like the 110M parameter models used here), when trained smartly, can be incredibly powerful tools.

As we move forward in the AI landscape, the focus is shifting from “how big is your model?” to “how efficient is your data?” Papers like this pave the way for a more sustainable, accessible approach to Natural Language Processing.