If you have ever played around with Large Language Models (LLMs) like GPT-4 or Llama, you have likely encountered In-Context Learning (ICL). It is the fascinating ability of these models to learn a new task simply by seeing a few examples in the prompt, without any gradient updates or weight changes.

For instance, if you want a model to classify movie reviews, you might provide three examples of reviews and their sentiment (Positive/Negative) before asking it to classify a fourth one. This process seems magical and, crucially, it seems “free” compared to fine-tuning a model.

But is it really free?

A recent research paper, “Rethinking the Evaluation of In-Context Learning for LLMs,” argues that we have been overlooking a massive hidden factor: the Demonstration Configuration (DC) Cost. Just as hyperparameter tuning takes computational resources in traditional machine learning, finding the perfect set of examples (and the perfect order) for a prompt costs computational power.

In this post, we will break down why the current way we evaluate ICL is flawed, explore the relationship between cost and accuracy, and look at a clever strategy proposed by the researchers to achieve state-of-the-art results with almost “zero” configuration cost.

The Anatomy of In-Context Learning

Before diving into the costs, let’s formalize what we are talking about. In a standard text classification task using ICL, the goal is to predict a label \(y\) for an input \(x\). To do this, we provide the model with a set of demonstration examples, \(C\).

The core equation looks like this:

The ICL formulation showing the probability of a label given input x and context C.

Here, \(C\) represents a set of \(k\) input-output pairs selected from a training set. The “magic” of ICL relies heavily on Demonstration Configuration, which consists of two steps:

Example Selection: Which examples from the training data should we include in the prompt?
Example Ordering: In what sequence should these examples appear? (Believe it or not, swapping the order of examples can drastically change the LLM’s output).

Researchers have developed various methods to optimize this. For example:

TopK: Selects examples that are semantically similar to the input query using embeddings.
GlobalE / LocalE: Orders examples based on entropy metrics to prevent the model from being biased toward specific class labels.
MDL (Minimum Description Length): Uses compression principles to find the most informative examples.

The Problem: The Hidden Cost of Optimization

Traditionally, when researchers publish a new ICL method (e.g., “My new method SuperSelect is better than Random selection”), they compare Task Performance (Accuracy). If SuperSelect gets 85% accuracy and Random gets 80%, SuperSelect wins.

However, this comparison ignores the DC Cost—the number of times you have to call the LLM to find those optimal examples.

Random Selection has a DC cost of effectively zero. You pick examples and go.
Complex Methods (like GlobalE or MDL) might require running thousands of permutations through the LLM to verify which order works best before making the final prediction.

The authors of this paper conducted a pilot study to investigate the relationship between this configuration cost and downstream accuracy.

Graphs showing the positive correlation between DC cost and Accuracy across different methods.

As shown in Figure 1, there is a clear trend: Higher DC cost yields higher accuracy.

Look at the curve for “DataModel” (the purple line) or “MDL” (the green line). As the cost (x-axis) increases—meaning the method spends more compute searching for the best prompt—the accuracy (y-axis) climbs significantly.

This reveals a major bias in standard evaluations. If Method A outperforms Method B, is it because Method A is smarter, or simply because Method A was allowed to burn 100x more GPU cycles finding the perfect prompt? The paper argues that comparing methods without normalizing for cost is unfair.

A New Standard: Two-Dimensional Evaluation

To fix this, the researchers propose a Two-Dimensional Evaluation Paradigm. Instead of just reporting accuracy, we must report:

Task Performance (Accuracy)
Configuration Cost (Number of inference calls used to tune the prompt)

Let’s look at how existing methods stack up when we apply this rigorous bi-objective lens.

Table 1 comparing various ICL methods with their associated costs.

In Table 1, notice the numbers in parentheses representing the DC cost (\(\times 10000\)).

DataModel achieves high accuracy (0.7910 on SST2) but requires a massive amount of compute (Cost: 10.0).
TopK is efficient (Cost: 0.00) but generally scores lower than the heavy hitters.
TopK + MDL (a hybrid method) seems to strike a strong balance, achieving high accuracy with moderate costs.

What happens if we force everyone to have the same budget?

The researchers took it a step further. They re-ran the experiments, but this time they controlled the DC cost, forcing all methods to operate within the same computational budget.

Table 2 showing evaluation results when DC cost is controlled.

Table 2 reveals some interesting shifts. When the budget is capped:

Selection > Ordering: Methods that focus on selecting the right examples (like DataModel and TopK) consistently outperform methods that focus only on ordering (GlobalE, LocalE). This suggests that if you have a limited budget, you should spend it on finding good examples rather than worrying about their order.
Hybrid is King: The hybrid methods (TopK+LocalE, TopK+MDL) dominate. They use a cheap heuristic (TopK) to narrow down the pool, and then use a more expensive method (MDL) to refine it.

The Breakthrough: Transferability Across Models

So far, the news seems grim: if you want high accuracy, you have to pay a high computational price. But the researchers hypothesized a potential loophole.

Hypothesis: A demonstration that is “good” for one language model is likely “good” for another language model.

If this is true, we could use a small, cheap model (like GPT-2) to do the heavy lifting of finding the best examples, and then transfer that optimized prompt to a large, expensive model (like OPT-6.7B or GPT-3.5) for the final answer.

They tested this hypothesis by optimizing demonstrations on one model and testing them on another.

Heatmap matrices demonstrating that optimized prompts transfer well across model sizes.

Figure 2 visualizes the results. The x-axis represents the inference model, and the y-axis represents the configuration model.

The diagonal line represents the standard approach (Config Model = Inference Model).
The non-diagonal squares are surprisingly bright! This means a prompt optimized by a tiny 355M parameter model performs almost as well on a massive 6.7B parameter model as if the big model had optimized it itself.

The “Zero Cost” Strategy

Based on this transferability, the authors propose a strategy to achieve what they call Zero Configuration Cost (relative to the inference model).

The Strategy:

Configure: Use a small, lightweight model (e.g., GPT-2 Medium) to select and order the examples. This is computationally negligible compared to running a Large Language Model.
Infer: Feed the resulting optimized prompt into the large Target LLM (e.g., GPT-4, OPT-13B) for the actual task.

Because the large model never has to do the searching, the “DC Cost” on the large model is effectively zero.

Does it work?

The results are impressive.

Table 3 comparing the ‘Zero Cost’ strategy against standard expensive baselines.

In Table 3, look at the rows labeled (ours).

Standard Method: Using OPT-6.7B to configure its own prompt costs 7200 inference calls.
Ours: Using GPT-2 to configure the prompt for OPT-6.7B costs 0 calls on the big model.
Result: The accuracy drop is negligible. In some cases (like TopK+LocalE on SST2), the small model actually found better prompts than the large model did!

This strategy even works on closed-source “black box” models like OpenAI’s Davinci and GPT-3.5 Turbo.

Table 4 showing the strategy applied to closed-source models like GPT-3.5.

As seen in Table 4, using a small, local model to engineer prompts for GPT-3.5 leads to significant performance gains over zero-shot or random selection. This is a game-changer for developers who pay per token—you can optimize your prompts locally for free before sending them to the expensive API.

Conclusion

This paper prompts a necessary shift in how we think about In-Context Learning. We can no longer treat prompt engineering as a cost-free activity.

Key Takeaways:

Cost correlates with Performance: You can usually “buy” better accuracy by spending more compute on searching for examples.
Fair Evaluation: Researchers must report configuration costs alongside accuracy to enable fair comparisons.
Transferability is Real: You don’t need a supercomputer to find good prompts. A small model can act as a “proxy,” finding excellent examples that transfer to larger models.

By adopting this “Zero Cost” strategy—using small models to guide big ones—students and researchers can achieve state-of-the-art ICL performance without breaking the bank on computational resources.

The Anatomy of In-Context Learning#

The Problem: The Hidden Cost of Optimization#

A New Standard: Two-Dimensional Evaluation#

What happens if we force everyone to have the same budget?#

The Breakthrough: Transferability Across Models#

The “Zero Cost” Strategy#

Does it work?#

Conclusion#