Introduction
In the era of Large Language Models (LLMs), In-Context Learning (ICL) has become a dominant paradigm. The idea is deceptively simple: instead of fine-tuning a model’s weights, you simply provide a few examples (exemplars) in the prompt, and the model learns the pattern.
For example, if you want an LLM to translate English to SQL, your prompt might look like this:
Input: Show me users over 20. Output: SELECT * FROM users WHERE age > 20;
Input: Count the products. Output: SELECT count(*) FROM products;
Input: [Your actual query] Output: …
The success of ICL depends heavily on which examples you choose. Traditionally, we treat this as a simple search problem: find the \(k\) examples in our training set that are most similar to the current input. We grab the top 5 or 10, stack them up, and feed them to the LLM.
But here is the problem: This approach assumes that the examples are independent of each other. It ignores the “chemistry” between examples. What if the second example repeats information from the first? What if the combination of Example A and Example B confuses the model, even if they are individually relevant?
In this post, we will dive into a fascinating paper titled “Learning to Retrieve Iteratively for In-Context Learning” by researchers from Microsoft and Johns Hopkins University. They propose a new framework called Iterative Retrieval (IterR). Instead of grabbing a handful of examples at once, their model picks them one by one, updating its understanding of the context after every selection.
We will explore how they frame this as a Reinforcement Learning problem, how they create a “stateful” retriever, and why this approach significantly outperforms standard methods on complex tasks like semantic parsing.
Background: The Flaw in “Snapshot” Retrieval
To understand the solution, we first need to understand the limitations of the current standard.
In a typical Retrieval-Augmented Generation (RAG) or ICL setup, we have a query \(x\). We want to find a set of exemplars from a dataset \(\mathcal{D}\). The standard retriever \(R\) selects the top-\(k\) items based on a scoring function \(S\) (usually cosine similarity or BM25 keyword matching):

This is a “snapshot” approach. The retriever looks at the query, calculates scores for all documents, and returns the highest ones. It does not care about the order of the selected items, nor does it care about how the items relate to each other.
The authors of the paper argue that finding the optimal set of prompts is actually a combinatorial optimization problem. The ideal retriever should maximize the probability that the Language Model (LM) produces the correct output \(y\) given the sequence of exemplars:

Because checking every possible combination of examples is computationally impossible (NP-hard), we need a smarter approximation than just “picking the most similar ones.”
Visualizing the Difference
The figure below perfectly illustrates the shift in thinking.

- Top (Standard): The model retrieves a batch of examples \((x_1, y_1), (x_2, y_2)\) all at once based on the input \(x\).
- Bottom (Iterative): The model retrieves one example, updates its internal state (\(s_0 \to s_1\)), considers what it has already learned, and then decides what to retrieve next.
The Core Method: Iterative Retrieval
The researchers propose modeling the prompt construction process as a Markov Decision Process (MDP). If you are familiar with Reinforcement Learning (RL), this will ring a bell. If not, think of it as a turn-based game where the “agent” (the retriever) takes “actions” (selecting examples) to maximize a “score” (the LLM’s performance).
1. The Architecture of IterR
The Iterative Retriever (IterR) consists of a few key components: a state vector, a policy network, and a state transition function.
The State (\(s\))
The state represents the “current context.” It tracks what the retriever has selected so far. The initial state \(s_0\) is a learnable parameter.
The Action (Retrieval)
At each step \(i\), the model needs to pick one exemplar pair \((x_i, y_i)\) from the dataset. It does this using a Policy Network. The policy decides which example is most useful given the current state \(s_i\).
The policy is defined by the inner product (similarity) between a query vector generated from the current state and the encoded vectors of the candidate examples:

Here:
- \(\mathbf{Q}(s_i)\) transforms the current state into a search query.
- \(\mathbf{F}_{\text{enc}}(x_i)\) is a text encoder (like Contriever) that turns text into vectors.
- This looks a lot like standard dense retrieval, but the key difference is that \(\mathbf{Q}(s_i)\) changes at every step.
To actually pick an example during inference, the model performs a greedy selection (picking the highest scoring item):

The Transition (Updating the State)
Once an exemplar is chosen, the model needs to remember it. The authors use a Gated Recurrent Unit (GRU), a type of Recurrent Neural Network (RNN), to update the state.

The new state \(s_{i+1}\) is a blend of the old state and the vector representation of the newly selected example. This is what makes the retriever “stateful.”
Putting It Together: An Example
Let’s look at a concrete example from the Semantic Parsing task (converting natural language to code).

In the figure above:
- Top (BM25): The standard retriever sees “tailgating party” and just looks for those keywords. It retrieves examples that share words but might not help with the specific logic required.
- Bottom (Iterative):
- The model first retrieves an example about “video camp” (\(x_1\)). This establishes a structural pattern.
- The state updates (\(s_1\)).
- Based on \(s_1\), the model realizes it needs a different kind of constraint example. It adjusts its search trajectory and retrieves an example about a “circus” (\(x_2\)).
- The final prompt contains a diverse portfolio of examples that help the LLM solve the specific logic puzzle.
2. Training with Reinforcement Learning
How do we train this? We can’t use standard supervised learning because we don’t have a “ground truth” list of perfect prompts. Instead, we have to let the model try constructing prompts and see how well the LLM performs.
This is a classic setup for Reinforcement Learning.
The Simulator
The “environment” in this RL setup is the LLM itself (e.g., Llama-2). The retriever picks examples, constructs a prompt, feeds it to the LLM, and waits for feedback.
The Reward Function
Defining the reward is tricky. If we just gave a reward of +1 for a correct answer and 0 for a wrong one, the signal would be too sparse (the model would hardly ever learn).
Instead, the authors use the change in probability. They measure how much the selected example increased the likelihood of the LLM generating the correct gold-standard output (\(y^*\)).

- \(P_{LM}(y^* | ... (x_i, y_i))\): The probability of the correct answer after adding the new example.
- \(P_{LM}(y^* | ...)\): The probability before adding the example.
If the new example makes the correct answer more likely, the model gets a positive reward. If it confuses the LLM, it gets a negative reward.
Policy Optimization (PPO)
To update the model weights, they use Proximal Policy Optimization (PPO). PPO is a popular RL algorithm known for its stability. It ensures that the policy doesn’t change too drastically in a single update.

The model is trained to maximize the expected reward (the advantage \(\hat{A}\)) while keeping the new policy close to the old one (the clipping term).
Challenge: Stratified Sampling
There is a computational hurdle here. The dataset \(\mathcal{D}\) might contain 100,000+ candidates. Calculating the full probability distribution for the policy at every step is too slow.
To solve this, the authors use Stratified Sampling.

Instead of looking at the whole dataset or just the top few, they:
- Retrieve a buffer of top candidates.
- Keep the very best ones (exploitation).
- Group the rest into “strata” (layers) and sample randomly from them (exploration).
- Renormalize the scores.
This ensures the model focuses on good candidates but still explores less obvious options that might be surprisingly helpful.
Experiments and Results
The authors tested IterR on Semantic Parsing, a task where the model must translate natural language into a formal representation (like code). This is a great testbed because it requires strict adherence to syntax and logic—close isn’t good enough.
Datasets
They used three datasets:
- SMCALFLOW: Complex dialogue dataflow synthesis.
- TREEDST: Dialogue state tracking.
- MTOP: Task-oriented parsing (alarms, reminders, etc.).

Main Results: Beating the Baselines
The authors compared IterR against:
- BM25: Standard keyword search.
- Contriever: A strong dense retriever (vector similarity).
- EPR & CEIL: Previous state-of-the-art methods for exemplar selection.
The results (Table 1) show a clear victory for Iterative Retrieval.

Key Takeaways:
- Exact Match (EM): IterR significantly outperforms all baselines. On SMCALFLOW, it jumps from 51.1% (CEIL) to 54.1%.
- SMatch: This metric measures partial correctness (is the structure mostly right?). IterR dominates here too, suggesting it helps the LLM understand the underlying structure of the task better than other methods.
Generalization: Does it work on other LLMs?
A critical question for any component trained with RL is: “Did you just overfit to the simulator?” The authors trained IterR using Llama-2-7b. But does the trained retriever work if we use a different LLM for inference?

The answer is yes.
- Intra-family: It works even better on larger Llama models (Llama-2-70b).
- Inter-family: It even generalizes to Mistral-7B (a completely different model family), maintaining a lead over baselines.
This suggests that IterR is learning universal properties of “what makes a good prompt portfolio” rather than just hacking the specific biases of Llama-2-7b.
Efficiency: Doing More with Less
Finally, does this smarter retrieval allow us to use fewer examples?

Figure 6 shows that IterR (green line) achieves higher performance with fewer exemplars compared to EPR (blue) and CEIL (pink). This is a major practical benefit: shorter prompts mean lower latency and lower cost.
Conclusion
The paper “Learning to Retrieve Iteratively for In-Context Learning” represents a shift in how we think about Prompt Engineering. It moves us away from static, heuristic-based similarity search toward dynamic, learned, and stateful retrieval.
By framing prompt construction as a sequential decision process, the IterR framework acknowledges that:
- Order matters: The first example sets the stage for the second.
- Context matters: What the model needs depends on what it has already seen.
- Feedback matters: The ultimate judge of a retriever should be the downstream model’s performance.
While the method adds some training complexity (Reinforcement Learning is notoriously tricky), the inference cost is low—only adding a small GRU (4M parameters) to the retrieval process. For applications requiring high precision, like semantic parsing or code generation, this iterative approach offers a significant edge over standard RAG techniques.
As LLMs continue to evolve, the systems that feed them information must evolve too. Iterative Retrieval proves that selecting the right context is not just about finding similar documents—it’s about building a coherent curriculum for the model, one step at a time.
](https://deep-paper.org/en/paper/2406.14739/images/cover.png)