Large Language Models (LLMs) have fundamentally changed how we approach Natural Language Processing (NLP). One of their most powerful features is In-Context Learning (ICL). Instead of fine-tuning the model’s billions of parameters—which is expensive and computationally heavy—you simply provide a few examples of the task within the prompt itself. For instance, to teach a model to classify sentiment, you might provide three examples of movie reviews with their labels before asking it to classify a new review.

However, there is a catch. ICL is notoriously sensitive to the specific examples you choose. Pick the wrong examples, and the model’s accuracy tanks. Furthermore, in real-world scenarios, you often have a massive pile of unlabeled data but a very limited budget to pay human experts to annotate it. If you can only afford to label 20 examples out of 10,000, which 20 should you pick to get the best performance?

This is the problem of Selective Annotation for Low-Budget Active Learning.

In this post, we will explore a fascinating solution presented in the paper “CoverICL: Selective Annotation for In-Context Learning via Active Graph Coverage.” The researchers propose a method that combines the strengths of graph theory, uncertainty estimation, and diversity sampling to select the most high-impact examples for annotation.

The Problem: Finding the “Needle in the Haystack” Examples

Before we dive into the solution, let’s formalize the problem. You have a massive unlabeled dataset (\(\mathcal{U}\)) and a small budget (\(B\)). You need to select \(B\) examples to send to a human oracle (an annotator) to get labels. Once labeled, these examples form your set \(\mathcal{L}\), which is used to construct prompts for an LLM.

Figure 2: Our studied problem setting. We must select B examples from U to be annotated by an Oracle, which are then used for k-shot ICL inference.

As illustrated in Figure 2, the pipeline is straightforward, but the selection strategy is critical.

The Two Schools of Thought

Historically, Active Learning (AL) strategies fall into two main categories:

Diversity Sampling: The logic here is that your few-shot examples should represent the entire landscape of your data. You don’t want 20 examples that are all virtually identical. You want to cover the semantic space.
Uncertainty Sampling: This approach suggests you should label the examples the model finds most confusing. If the model is 50/50 on a specific sentence, knowing the true label provides the most information gain.

The Low-Budget Trap

In standard machine learning where you fine-tune the model, uncertainty sampling is often king. However, in the low-budget ICL setting, uncertainty sampling is unreliable. Why? Because without any initial labeled data (or very little), the model’s uncertainty estimates are noisy. It might be “confident” about something it’s actually wrong about, or “uncertain” about everything.

On the other hand, purely diversity-based methods might waste the budget on examples that are statistically representative but actually quite easy for the model to solve on its own.

CoverICL bridges this gap. It argues that we should focus on diversity, but only within the regions where the model is struggling.

The Core Method: CoverICL

The researchers introduce CoverICL, an algorithm that models the data as a graph. The central idea is to build a “nearest-neighbor graph” of data points and then solve a “Maximum Coverage” problem to find the most representative examples among the difficult ones.

Let’s break down the architecture of this solution step-by-step.

Figure 3: The CoverICL algorithm. It builds a graph, estimates uncertainty to find hard examples, and solves MaxCover to select annotations.

Figure 3 provides the roadmap. We move from raw candidate examples to a semantic graph, filter for “hard” examples, and then apply a selection algorithm.

Step 1: Graph Construction

First, CoverICL constructs a semantic graph \(\mathcal{G}_m\).

Every unlabeled example is a node in this graph.
Edges are drawn between nodes based on semantic similarity.
Specifically, they calculate embeddings for all sentences (using a model like SBERT) and connect each node to its \(m\)-nearest neighbors.

This graph captures the underlying structure of the data manifold. If two nodes are connected, they are semantically similar.

Step 2: Identifying Hard Examples (Uncertainty)

Next, the system needs to know which parts of the graph are “hard” for the LLM. Since we are in a low-resource setting, we can’t train a classifier to tell us this. Instead, we ask the LLM itself.

CoverICL constructs a basic prompt (either zero-shot or using a tiny random set of initial examples) and asks the LLM to predict labels for the unlabeled set. We then look at the model’s confidence.

Figure 4: Estimating uncertainty using Negative Log Likelihood (NLL). High NLL variance or low confidence indicates a hard example.

As shown in Figure 4, we can measure uncertainty using metrics like the Negative Log-Likelihood (NLL) of the predicted tokens.

Easy Example: The model assigns a very high probability to one label and low probability to others.
Hard Example: The model is “confused,” assigning similar probabilities to conflicting labels, or low probability to everything.

We sort all unlabeled examples by their uncertainty scores and pick the top portion (e.g., top 50%) to form the Hard Set (\(\mathcal{U}_h\)).

Step 3: Building the “Active Graph”

This is the clever innovation of CoverICL. Instead of running diversity sampling on all data, or uncertainty sampling on isolated points, CoverICL builds an Active Graph.

The Active Graph consists of:

The nodes from the Hard Set (\(\mathcal{U}_h\)).
The edges connecting these hard nodes to their neighbors.

This graph effectively represents the “difficult regions” of the data landscape. We want our annotated examples to “cover” these regions.

Step 4: Maximum Coverage (MAXCOVER)

Now we have a graph of difficult examples. We have a budget \(B\) (e.g., 20 examples). We want to pick 20 nodes such that the number of distinct hard examples connected to these selected nodes is maximized.

In graph theory, this is the Maximum Coverage Problem. If we select a node to be annotated, we assume it “covers” itself and its immediate neighbors (its “egonet”). The logic is that if we know the true label for a central node, that information likely propagates to its similar semantic neighbors, clarifying the confusion for that local cluster.

The objective function looks like this:

Equation for maximizing coverage of hard examples.

Here, \(c_j\) is an indicator variable that is 1 if a hard example \(x_j\) is “covered” by our selection, and 0 otherwise. We want to maximize the total count of covered hard examples.

This maximization is subject to the budget constraint:

Equation describing the budget constraint and coverage logic.

Basically, we can select at most \(B\) sets (nodes), and a node is only considered “covered” if it belongs to the neighbor set of a selected node.

The Greedy Solution: Solving MAXCOVER perfectly is an NP-hard problem (computationally impossible for large graphs). However, a simple “Greedy” algorithm approximates the optimal solution very well (reaching at least \(1 - 1/e \approx 63\%\) of the optimal).

Find the one node that covers the most currently uncovered hard examples.
Add it to our selected list.
Mark its neighbors as “covered.”
Repeat until the budget is used up.

This approach ensures we select examples that are diverse (because we pick nodes covering different parts of the graph) and informative (because we focus on the hard set).

Iterative Improvement: CoverICL+

The authors also propose a variant called CoverICL+. Instead of selecting all \(B\) examples at once, it splits the budget into \(T\) iterations.

Select a batch of examples.
Get them annotated.
Re-estimate uncertainty using the new examples in the prompt.
Update the Hard Set and Active Graph.
Repeat.

This allows the model to refine its understanding of what is “hard” as it learns, making the subsequent selections even more targeted.

Experimental Results

Does this graph-theoretic approach actually work better than just picking random or diverse examples? The authors tested CoverICL across ten datasets covering topic classification, sentiment analysis, natural language inference, summarization, and math reasoning, using seven different LLMs (ranging from 1.3B to 65B parameters).

Performance Comparison

The results were consistently positive.

Figure 1: Comparison of accuracy for CoverICL vs Diversity, Uncertainty, and Random baselines on GPT-J and GPT-Neo.

As Figure 1 demonstrates, CoverICL (in blue) consistently outperforms Diversity (red), Uncertainty (beige), and Random (grey) strategies. On average, CoverICL improved accuracy by 2–4.6% over existing active learning methods. While small percentages might seem minor, in the world of few-shot learning where margins are thin, this is a significant boost.

We can look deeper into the specific numbers in Table 1:

Table 1: Detailed accuracy table comparing CoverICL to methods like Fast-Vote-K, IDEAL, and Patron.

In the “Average” column on the far right, CoverICL achieves 65.45%, clearly surpassing the next best method (Active-K means at 63.45%) and significantly beating the random baseline (61.40%). It excels particularly in sentiment analysis and topic classification.

Budget Efficiency

One of the most compelling arguments for CoverICL is cost efficiency. The goal of Active Learning is to pay for as few labels as possible.

Figure 5: Performance curves showing accuracy vs. budget. CoverICL+ reaches high accuracy with fewer examples.

Figure 5 plots accuracy against the budget (\(B\)).

Look at the green line (CoverICL+). It rises sharply.
In many cases, CoverICL+ with a budget of 10 examples achieves the same or better accuracy than other methods using 20 examples.
This effectively means you can cut your annotation costs in half by using this selection strategy.

Visualizing the Process

To understand why this works, it helps to visualize the vector space.

Figure 7: Visualization of embeddings. Green represents confident predictions, red represents uncertain ones. The stars show selected examples.

In Figure 7, the authors visualize the data using PCA. The color gradient represents uncertainty:

Red: The model is uncertain.
Green: The model is confident.

In the top left (0-shot), we see a mix of colors. The model is somewhat confused. In the top right (5-shot selected by CoverICL), notice the black stars. These are the examples CoverICL selected. They are strategically placed in clusters. Most importantly, look at the color shift. After providing just those 5 examples, the map turns predominantly green. The model has successfully resolved its uncertainty across the majority of the dataset because the selected examples effectively “covered” the confusing regions.

Robustness and Generality

The authors pushed the method further to see if it holds up under different conditions.

Generation Tasks: It’s not just for classification. They tested it on Summarization (XSUM) and Math Reasoning (GSM8K).

Table 2: Results for generation tasks. CoverICL improves RougeL scores and Math Accuracy over baselines.

Table 2 shows that for math reasoning (GSM8K) using LLaMa-65B, CoverICL improved accuracy from 45.04% (Vote-k) to 49.08%. Math tasks are notoriously sensitive to bad prompts, so this stability is valuable.

Different Embeddings: Does it matter which model you use to build the graph?

Table 4: Ablation study on embedding models. CoverICL performs best regardless of whether SBERT, RoBERTa, or BERT is used.

Table 4 confirms that while the choice of embedding model (SBERT vs. BERT vs. RoBERTa) changes the absolute numbers, CoverICL (bottom row) remains the winner in almost every configuration. The method is robust to the underlying semantic representation.

Why Does It Work?

The paper offers a theoretical intuition summarized by Theorem 1 (though we won’t derive the proofs here). The gist is that CoverICL acts as an approximation of diversity sampling over a specific manifold.

If the uncertainty estimation is perfect, CoverICL focuses purely on the decision boundaries where the model is struggling. However, even if the uncertainty estimation is noisy (random), the graph-based MAXCOVER objective ensures that the selected examples are well-separated and representative of the underlying data distribution.

This dual-nature is its superpower:

When uncertainty is good: It acts like a surgical tool, fixing specific model blind spots.
When uncertainty is bad: It falls back to being a robust diversity sampler, ensuring you don’t pick redundant examples.

Conclusion

The prompt engineering era has taught us that LLMs are only as good as the context we give them. CoverICL provides a rigorous, mathematical framework for selecting that context.

By treating the dataset as a graph and applying the “Maximum Coverage” principle to the model’s areas of uncertainty, it allows students, researchers, and engineers to maximize model performance while minimizing annotation costs. Whether you are working with a massive 65B parameter model or a smaller 1.3B model, the lesson is clear: don’t just pick examples randomly. Build a graph, find the confusion, and cover it.

Key Takeaways

Context Matters: In-Context Learning acts like a non-parametric regression; the quality of “training” examples determines the inference quality.
Combine Strategies: Pure diversity or pure uncertainty sampling usually fails in low-budget settings. Combining them via Active Graph Coverage succeeds.
Efficiency: CoverICL can match state-of-the-art performance with roughly half the annotation budget.
Versatility: The method works across classification, summarization, and reasoning tasks without requiring parameter updates (fine-tuning).

The Problem: Finding the “Needle in the Haystack” Examples#

The Two Schools of Thought#

The Low-Budget Trap#

The Core Method: CoverICL#

Step 1: Graph Construction#

Step 2: Identifying Hard Examples (Uncertainty)#

Step 3: Building the “Active Graph”#

Step 4: Maximum Coverage (MAXCOVER)#

Iterative Improvement: CoverICL+#

Experimental Results#

Performance Comparison#

Budget Efficiency#

Visualizing the Process#

Robustness and Generality#

Why Does It Work?#

Conclusion#

Key Takeaways#