Introduction

Imagine you are trying to teach a brilliant but literal-minded student how to solve complex physics problems. You don’t have time to teach them the entire textbook. Instead, you can only show them five specific examples of how to solve a problem before giving them the final exam.

Which five examples do you choose?

Do you pick five simple ones? Five incredibly hard ones? Or a mix of different reasoning styles? If you pick the wrong ones, the student fails. If you pick the right ones, the student excels.

This is the exact challenge researchers face with In-Context Learning (ICL) in Large Language Models (LLMs). We know that giving an LLM a few “exemplars” (demonstration examples) significantly boosts its performance, especially on complex reasoning tasks like math or logic puzzles. This is often called “few-shot prompting.”

However, finding that “perfect” subset of examples from a training dataset of thousands is a massive combinatorial problem. Previous methods were either computationally expensive—requiring the LLM to evaluate thousands of possibilities—or they required checking the model’s internal confidence scores, which isn’t always possible with black-box models like GPT-4.

In this post, we are diving deep into EXPLORA, a research paper that proposes a novel, efficient algorithm for selecting these exemplars. EXPLORA treats prompt selection as an exploration-exploitation problem, drastically reducing the computational cost while outperforming state-of-the-art methods.

Background: The Art of In-Context Learning

Before we dissect the algorithm, let’s establish the context. In-context learning allows LLMs to learn a task simply by seeing examples in the prompt, without updating the model’s weights.

Chain-of-Thought (CoT)

For complex tasks, we don’t just give the input and output; we provide a “rationale.” This is known as Chain-of-Thought (CoT) prompting. An exemplar isn’t just a Question (\(x\)) and an Answer (\(y\)); it is a triplet: Input, Rationale, Output.

The goal is to select a subset of these exemplars (\(S\)) to append to a test question (\(x_{test}\)) so the model generates the correct output (\(\hat{y}_{test}\)).

Equation describing the in-context learning process.

As shown in the equation above, the prediction \(\hat{y}_{test}\) is generated by the LLM function \(f\) based on a prompt \([S, x_{test}]\). The post-processing step \(\delta\) (like regex matching) extracts the final answer.

Static vs. Dynamic Selection

There are two main schools of thought on how to pick these exemplars (\(S\)):

  1. Dynamic Selection: For every new test question, the system searches the database to find the examples most similar to that specific question.
  • Pros: Highly relevant context.
  • Cons: Very slow and expensive. You have to run a retrieval process for every single query.
  1. Static Selection: You select one “golden” set of exemplars once, and use that same set for every test question.
  • Pros: Fast inference (no retrieval step needed at runtime) and robustness.
  • Cons: Finding a single set that works well for everything is extremely difficult.

EXPLORA focuses on Static Selection. The researchers argue that if we can find a set of examples that covers diverse reasoning patterns, we don’t need to dynamically retrieve examples for every query.

The Problem with Current Methods

State-of-the-art static methods, like a method called LENS, rely on “informativeness.” They pass candidate examples through the LLM and check the probability logits (confidence scores) to see how much the model “learns” from them.

The downsides?

  1. Cost: LENS requires calling the LLM thousands of times to score the training data.
  2. Access: It requires access to probability distributions (\(\mathbb{P}_{LLM}\)), which APIs for models like Claude or GPT-4 might not fully expose.
  3. Independence: These methods often score examples individually, failing to capture how examples interact within a group.

The EXPLORA Method

The authors propose EXPLORA (Model-based Exploration for Exemplar Subset Selection). The core philosophy is simple: We should minimize the validation loss.

In other words, we want to find a subset of examples (\(U\)) such that, when we use them as a prompt, the model makes the fewest mistakes on a small validation set.

Overview of EXPLORA showing the iterative process of sampling, scoring, and updating subsets.

As illustrated in Figure 1, EXPLORA is an iterative loop. It doesn’t check every possible combination (which is impossible). Instead, it uses a Scoring Function (the green box) to guess how good a subset is. It then periodically checks with the real LLM to update that scoring function.

1. Defining the Objective

The researchers formulate the problem as minimizing the loss \(L\) on a validation set \(\mathcal{V}\).

Equation defining the validation loss calculation.

The equation above calculates the error rate. If the model predicts the wrong answer (\(v_i \neq \delta(...)\)), the loss increases. The goal is to find the subset \(U^*\) that minimizes this loss:

Equation defining the optimization objective to find the best subset U.

2. The Scoring Function (\(\sigma\))

Calculating the actual loss (Equation 3) is expensive because it requires running the LLM. To solve this, EXPLORA introduces a proxy scoring function (\(\sigma\)). This function estimates how good a subset is without calling the LLM.

The scoring function is a linear model based on similarity. It assumes that a training example is useful if it is semantically similar to the validation examples.

Equation for the scoring function sigma.

Let’s break down this equation:

  • \(\sigma(\vec{\alpha}, S)\): The estimated score of subset \(S\).
  • \(E_{ij}\): The similarity score between a training exemplar (\(x_i\)) and a validation question (\(u_j\)). This is calculated using a small, cheap model (like BERT), not the massive LLM.
  • \(\alpha_i\): The weight (parameter) of exemplar \(x_i\). This is what the algorithm learns.

If \(\alpha_i\) is high, it means the algorithm believes exemplar \(i\) is very important for solving the task. If \(\alpha_i\) is negative, that exemplar might be confusing or harmful to the model.

3. The Bandit Algorithm

How do we find the correct values for \(\alpha\)? EXPLORA uses a sampling-based bandit algorithm. This is a technique often used in scenarios where you have to balance exploitation (using what you know works) and exploration (trying new things).

The algorithm maintains two sets of subsets:

  1. \(U_t\): The current “best” subsets (low loss).
  2. \(V_t\): A pool of other candidate subsets.

In each round \(t\), the algorithm performs the following steps:

  1. Update Parameters (\(\alpha\)): It solves an optimization problem to find \(\alpha\) values that make the scoring function \(\sigma\) match the actual LLM loss for the subsets we have tested so far.

Equation showing the loss function used to update the scoring parameters alpha.

This equation minimizes the difference between the actual loss (\(L(S, \mathcal{V})\)) and the predicted score (\(\sigma(\vec{\alpha}, S)\)). Note the second term: it samples “negative” examples from \(V_t\) to ensure the model learns to distinguish good subsets from bad ones.

  1. Select Candidates:
  • Find the subset in the candidate pool (\(V_t\)) with the lowest predicted loss (best candidate). Call it \(S^*_t\).
  • Find the subset in the current “best” set (\(U_t\)) with the highest predicted loss (worst winner). Call it \(\tilde{S}_t\).
  1. Update the Set:
  • If the new candidate \(S^*_t\) looks better than the worst winner \(\tilde{S}_t\), swap them!
  • The algorithm adds the new candidate to the “best” set and removes the worst one.

Why is this efficient? Because \(U_t\) and \(U_{t+1}\) usually differ by only one subset, the algorithm can use caching mechanisms. It only needs to query the LLM for the few new subsets it is exploring, rather than re-evaluating everything.

Experiments and Results

The researchers tested EXPLORA on several complex reasoning datasets, including:

  • GSM8K: Math word problems.
  • AquaRat: Algebraic problems.
  • TabMWP: Reasoning over tables.
  • FinQA: Financial numerical reasoning.
  • StrategyQA: Commonsense reasoning.

They compared EXPLORA against dynamic methods (KNN, MMR) and static methods (LENS, Random, Manual selection).

Performance Comparison

Table comparing EXPLORA’s performance against baselines like LENS and KNN.

As shown in Table 1, EXPLORA achieves remarkable results:

  • It outperforms LENS (the previous state-of-the-art static method) across all datasets. On GSM8K, it shows a 12.24% improvement.
  • It even outperforms dynamic methods like KNN in many cases, which is impressive given that dynamic methods have the advantage of seeing the test question before picking examples.
  • The variants (EXPLORA + Self-Consistency) push the performance even higher.

Efficiency: The “Frugal” Choice

The most critical contribution of EXPLORA is its efficiency. Because it relies on the proxy scoring function (\(\sigma\)) rather than querying the LLM for every candidate, it saves massive amounts of compute.

Charts comparing LLM calls and Runtime between LENS and EXPLORA.

Figure 2 tells the story clearly:

  • Top Chart (LLM Calls): EXPLORA (striped blue) requires significantly fewer calls to the LLM than LENS (striped red). We are talking about reducing calls to ~11% of what LENS requires.
  • Bottom Chart (Time): The time savings are equally dramatic. For GSM8K, the process drops from ~40 hours (LENS) to ~7 hours (EXPLORA).

Transferability

A fascinating finding is that exemplars selected using a smaller, cheaper model (like Llama-2-7b or Mistral-7b) work surprisingly well when transferred to a larger model (like GPT-3.5-turbo).

Table showing the transferability of exemplars from small models to GPT-3.5.

Table 2 demonstrates that selecting prompts using Mistral-7b (M) and then running inference on GPT-3.5 yields results almost identical to selecting them directly on GPT-3.5. This allows researchers to perform the optimization process on open-source, local hardware and only pay for the expensive API during the final inference.

Robustness

Finally, does the selected subset perform consistently?

Table showing the standard deviation of EXPLORA compared to other methods.

Table 3 shows the standard deviation of performance across different splits of the data. Lower numbers are better. EXPLORA consistently shows lower variance than methods like KNN or LENS, indicating that the subsets it finds are robust and reliable, not just lucky hits.

Qualitative Analysis: What does EXPLORA pick?

Why does EXPLORA perform better than LENS? The authors analyzed the actual questions selected by both algorithms.

LENS tends to select exemplars based on individual confidence scores. This often leads to redundancy. For example, in the AquaRat dataset (algebra), LENS selected two questions that were thematic duplicates—both dealing with “work and time” problems phrased similarly.

EXPLORA, because it scores the subset as a whole, implicitly encourages diversity. If the subset already contains a “work and time” problem, adding a second one won’t lower the validation loss as much as adding a “geometry” problem would.

Table comparing specific examples selected by LENS and EXPLORA for AquaRat.

In Table 7, we can see that EXPLORA’s selection (bottom row) covers Proportionality, Series, Kinematics, and Interest calculation. It’s a much broader curriculum for the LLM to learn from in-context.

Conclusion

The prompt is the new programming interface, and selecting the right examples to put in that prompt is the new optimization challenge. EXPLORA offers a sophisticated solution to this problem. By modeling the selection process as a subset optimization problem and using a bandit algorithm to efficiently explore the space, it achieves the “holy grail” of prompt engineering:

  1. High Performance: Beats state-of-the-art methods.
  2. Low Cost: Requires 90% fewer LLM calls.
  3. Black-Box Compatible: Doesn’t need access to internal model weights or logits.

For students and researchers working with LLMs, this paper highlights an important shift. We are moving away from “vibes-based” prompt engineering (trying things until they work) toward algorithmic prompt engineering, where we use rigorous mathematical frameworks to discover the optimal inputs for our AI systems.