In the era of Large Language Models (LLMs), the paradigm of how we teach machines has shifted dramatically. We no longer always fine-tune models by updating millions of parameters; instead, we often rely on In-Context Learning (ICL). This involves feeding the model a few input-output examples (demonstrations) in the prompt, allowing it to “learn” the pattern on the fly.

However, there is a catch. For ICL to work well, the examples you choose matter—a lot. Typically, finding the best examples requires retrieving them from a massive dataset of already labeled examples. But what if you don’t have a massive labeled dataset? What if you have a huge pile of raw text and only enough budget to manually label 50 or 100 examples?

This is the problem tackled by the paper “Effective Demonstration Annotation for In-Context Learning via Language Model-Based Determinantal Point Process.” The researchers propose a method called LM-DPP that mathematically selects the absolute best examples to annotate, balancing two critical factors: uncertainty and diversity.

In this post, we will break down how LM-DPP allows developers to achieve high performance with minimal data annotation, effectively “smart-sizing” the prompt engineering process.

The Problem: The Cost of Good Demonstrations

To understand the contribution of this paper, we first need to look at the standard ICL pipeline.

In a resource-rich scenario, you might have a dataset with thousands of labeled examples (e.g., sentiment analysis pairs). When a user asks a question, you use a Retriever (like Sentence-BERT) to find the examples most similar to the user’s query and insert them into the prompt. This is known as instance-level retrieval.

But in many real-world applications, we start with unlabeled data. Labeling data is expensive and time-consuming. If you can only afford to label a small subset, which ones should you pick?

Random Selection: You pick examples blindly. This usually leads to poor coverage and mediocre performance.
Traditional Active Learning: You pick examples the model is most “confused” about.
Diversity Sampling: You pick examples that look different from each other to cover the data space.

The authors argue that none of these strategies alone is sufficient for LLMs. As illustrated below, their approach fundamentally changes the workflow. Instead of retrieving from a large labeled dataset, they insert a Selective Annotation step to curate a small, high-quality labeled set first.

Comparison of previous retrieval methods versus the proposed LM-DPP work flow.

The Solution: LM-DPP

The researchers introduce Language Model-based Determinantal Point Process (LM-DPP). That is a mouthful, but the concept is elegant.

The goal is to select a subset of data points to annotate that satisfies two conditions simultaneously:

Low Uncertainty: The LLM should be somewhat familiar with the example (high confidence).
High Diversity: The examples should be semantically distinct from one another to cover different aspects of the task.

Let’s break down the architecture of this method.

An illustration of the LM-DPP approach showing three main steps: Perplexity estimation, DPP Modeling, and Retrieval.

As shown in the figure above, the process involves three distinct steps:

Scoring: Estimate perplexity for unlabeled data.
Selection: Use conditional DPP to model uncertainty and diversity.
Inference: Retrieve from this curated pool at test time.

Let’s dive deep into the mathematics and logic of steps 1 and 2.

Step 1: Measuring Uncertainty via Perplexity

In traditional Active Learning, we often select examples where the model is most uncertain (high entropy), assuming these will teach the model the most. However, the authors found that for In-Context Learning, the opposite is often true. LLMs act like data compressors; they perform better when prompted with examples that resemble the distribution they were pre-trained on.

Therefore, the metric used here is Perplexity. Lower perplexity means the model finds the text predictable and familiar. The scoring function \(r(\tilde{x})\) is defined as the reciprocal of perplexity:

Equation showing the scoring function r(x) as the reciprocal of perplexity.

Here, a higher score \(r(\tilde{x})\) indicates lower uncertainty (or higher familiarity). The authors explicitly aim to select examples with high scores, avoiding outlier examples that might confuse the model.

Step 2: Modeling Diversity with DPP

If we only picked examples with the lowest perplexity, we might end up with 100 variations of the exact same simple sentence. This is where the Determinantal Point Process (DPP) comes in.

DPP is a probabilistic model used in physics and machine learning to select subsets of items that are diverse. It relies on a Kernel Matrix (L), which describes the similarity between items. The probability of selecting a subset is determined by the determinant of this matrix. If two items are very similar, the determinant shrinks, making it less likely that both will be selected together. This effectively creates a “repulsion” between similar items.

The authors modify the standard DPP to include the uncertainty score from Step 1. They define a trade-off parameter, \(\lambda\), to balance the need for certainty against the need for diversity.

Equation 2 showing the log determinant trade-off between uncertainty score r and diversity.

First Term (\(\lambda \sum r_i\)): This rewards subsets that have high “familiarity” scores (low uncertainty).
Second Term (\((1-\lambda) \dots\)): This rewards subsets that are geometrically diverse (high determinant).

By tuning \(\lambda\), the researchers can find the “Goldilocks” zone: examples that are diverse enough to cover the task space but standard enough that the LLM understands them well.

Step 3: Greedy Inference

Solving for the absolute perfect subset is an NP-hard problem (computationally impossible for large datasets). However, because of the mathematical properties of DPP, we can use a greedy algorithm.

The algorithm iteratively adds the one item to the set that maximizes the marginal gain in the log-determinant score.

Equation 5 showing the greedy selection logic for adding item j to the set S_map.

This turns a computationally heavy problem into one that can be solved in polynomial time, making it feasible to run even on large unlabeled datasets.

Experimental Results

Does this mathematical balancing act actually work? The authors tested LM-DPP across 9 Natural Language Understanding (NLU) tasks (like sentiment analysis and inference) and 2 Generation tasks (like summarization).

Performance on NLU Tasks

The table below compares LM-DPP against several baselines:

Random: Just picking random examples.
K-means: Clustering data and picking centroids (Diversity only).
Vote-k: A graph-based selection method (the previous state-of-the-art).

Table 1: Results with GPT-J and LlaMA-2-7B on NLU tasks comparing various annotation methods.

Key Takeaways from the Data:

Consistency: LM-DPP (bottom rows for each model) consistently achieves the highest or second-highest accuracy across almost all datasets.
Budget Efficiency: Even with a tiny budget of 16 annotated examples, LM-DPP significantly outperforms Random selection. On GPT-J, it achieves an average score of 63.67 compared to Random’s 60.31.
Model Agnostic: The method works for both GPT-J (6B parameters) and LlaMA-2 (7B parameters), suggesting it generalizes well across different architectures.

Scaling to GPT-3 (175B)

You might wonder if this only works for smaller open-source models. The authors also tested the method on the massive GPT-3 model.

Figure 5: Results of GPT-3-Turbo with 100 annotated examples showing LM-DPP consistently improving results.

As shown above, LM-DPP (the pink bars) consistently beats Random (blue) and Fast Vote-k (orange). In the TREC classification task, the improvement is substantial, jumping from roughly 74% to nearly 80%.

The Impact of Annotation Budget

One of the most compelling arguments for Active Learning is cost reduction. How does performance scale as we increase the number of annotated examples from 16 to 800?

Figure 4: Line charts comparing selection methods with increasing annotation counts (16 to 800).

In Figure 4, look at the pink line (LM-DPP).

Early Advantage: On datasets like RTE and Hellaswag, LM-DPP provides a massive jump in performance at the very start of the x-axis (small data).
Sustained Performance: As data quantity increases, it remains competitive, often staying top-tier.
Stability: Unlike Random selection (light blue), which fluctuates unpredictably, LM-DPP offers a stable, upward trend.

Time Efficiency

Sophisticated selection algorithms often come with a heavy computational tax. However, the authors show that LM-DPP is surprisingly efficient compared to the previous state-of-the-art, Vote-k.

Figure 6: Bar chart comparing running times. Vote-k takes over 4000s while LM-DPP takes around 382s.

While “Random” selection is instant (0.3s), it performs poorly. Vote-k takes over an hour (4039s). LM-DPP, by contrast, finishes in about 6 minutes (382s)—a 10x speedup over the main competitor while delivering better accuracy.

Deep Dive: Why does it work?

The paper provides some fascinating analysis on why this specific combination of uncertainty and diversity is so effective.

The “Goldilocks” Trade-off

The researchers experimented with the \(\lambda\) parameter (the slider between uncertainty and diversity).

\(\lambda = 0\) (Pure Diversity): The model picks very different examples, but some are outliers that confuse the model.
\(\lambda = 1\) (Pure Familiarity): The model picks safe, predictable examples, but they are all variations of the same thing (redundant).
\(\lambda \approx 0.5\): This balance consistently yielded the best results. It confirms that LLMs need a “syllabus” of examples that covers the whole subject (diversity) but remains understandable (low uncertainty).

Case Study: Quality of Demonstrations

To visualize the difference, let’s look at an actual comparison of generated summaries from the XSUM dataset.

Figure 8: Case analysis in XSUM comparing Random and LM-DPP summaries.

In this example, the Random selection leads to a summary that has a low ROUGE score (43.24) and misses the nuance. LM-DPP selects demonstrations that help the model generate a much higher quality summary (ROUGE 58.06).

A Note of Caution: Notice the “FactCC” score in the image above. While LM-DPP improved the style and fluency (ROUGE), it actually scored lower on factual consistency in this specific instance. The authors note that by prioritizing diversity, the model might sometimes retrieve examples that are stylistically perfect but factually loose. This highlights a limitation: we must be careful when balancing diversity against strict factual grounding.

Conclusion

The paper LM-DPP presents a compelling workflow for the modern NLP practitioner. As we move away from massive fine-tuning and toward In-Context Learning, the quality of our prompts becomes the bottleneck.

This research proves that “Big Data” isn’t always the answer. “Smart Data” is. By mathematically modeling the tension between what the model knows (uncertainty) and what the model needs to see (diversity), LM-DPP allows us to achieve state-of-the-art results with a fraction of the annotation effort.

For students and engineers, the takeaways are clear:

Don’t annotate blindly. Random selection is a wasted opportunity.
Respect the model’s priors. Examples that the model finds “perplexing” are usually bad teachers for ICL.
Use Math. Determinantal Point Processes provide a robust, non-heuristic way to ensure your data is diverse.

As LLMs continue to grow, efficient methods like LM-DPP will be essential for deploying these models in specialized, low-resource domains where every data point counts.

The Problem: The Cost of Good Demonstrations#

The Solution: LM-DPP#

Step 1: Measuring Uncertainty via Perplexity#

Step 2: Modeling Diversity with DPP#

Step 3: Greedy Inference#

Experimental Results#

Performance on NLU Tasks#

Scaling to GPT-3 (175B)#

The Impact of Annotation Budget#

Time Efficiency#

Deep Dive: Why does it work?#

The “Goldilocks” Trade-off#

Case Study: Quality of Demonstrations#

Conclusion#