Introduction

In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) have gained a reputation for being quick learners. Specifically, they excel at In-Context Learning (ICL). This is the ability to adapt to a new task simply by seeing a few examples in the prompt, without requiring any updates to the model’s weights.

Imagine you want an AI to translate English slang into formal text. You don’t need to retrain it; you just provide a few pairs: “Gonna -> Going to” and “Wanna -> Want to”, and the model figures out the pattern for the next input.

However, there is a catch. The performance of ICL is notoriously sensitive to the quality of the examples provided. Pick the wrong examples—or even just order them poorly—and the model’s accuracy can plummet. This has led to a search for the “perfect” selection strategy. Traditional methods usually look for examples that are visually or semantically similar to the test input.

But is similarity enough?

In this post, we will explore a research paper that proposes a novel approach called ByCS (Bayesian in-Context example Selection). Instead of just looking for similar examples, this method uses Bayes’ theorem to find examples that have a strong probabilistic interaction with the test input. It effectively “flips the script” by asking: If I knew the answer to the test question, would it help me predict the examples?

We will break down how this works across three different modalities—Text, Speech, and Vision—and see why thinking backwards might be the best way to move forward.

Background: The Challenge of Multimodal ICL

Before diving into the math, we need to understand the landscape. ICL started in text-based Natural Language Processing (NLP), but it has recently expanded to multimodal tasks. This includes Automatic Speech Recognition (ASR) and Visual Question Answering (VQA).

As shown in the figure below, the core concept remains the same across modalities, even if the architecture changes. Whether you are feeding the model text, audio waveforms, or pixels, the goal is to use a few labeled examples (\(C_{input}, C_{label}\)) to guide the model in predicting the label (\(Y\)) for a new test input (\(X\)).

Figure 2: Multimodal ICL. Although ICL on different modalities shares the same formula expression, the actual inputs and inference model architectures differ.

The Problem with Current Selection Methods

Since we cannot feed the entire training dataset into the model’s context window (it’s too expensive and technically limited), we have to select a subset of \(k\) examples.

Common strategies include:

  1. Random Selection: Picking examples blindly. This is unstable and often yields poor results.
  2. K-Nearest Neighbors (kNN) / KATE: This method embeds all examples into a vector space and selects the ones geometrically closest to the test input.

While kNN is better than random, it has a flaw: it treats the examples and the test input separately. It looks for surface-level or semantic similarity but doesn’t necessarily measure how much information an example provides to help solve the specific test case. It overlooks the mutual interactions between the context and the test input.

The Core Method: Bayesian In-Context Example Selection (ByCS)

The researchers propose a method that moves beyond simple similarity. They treat the selection process as a probability problem.

From Inference to Inverse Inference

To understand ByCS, we first look at the standard objective of In-Context Learning. We want to maximize the probability of getting the correct answer (\(Y\)) given the examples (\(C\)) and the test input (\(X\)).

Standard Inference Equation

The researchers apply Bayes’ Theorem to expand this inference probability. This allows them to decompose the problem and isolate the interaction between the label and the examples.

Bayes Theorem Expansion

Look at the numerator in the equation above. The term \(P(\mathcal{C}_{label} | X, Y, \mathcal{C}_{input})\) is the game-changer. This is called the Inverse Inference Probability.

In plain English, instead of asking:

“Given these examples, what is the answer to the test input?”

ByCS asks:

“If we assume the test input and its answer are the context, how well can we predict the labels of the candidate examples?”

If a candidate example is truly helpful, there should be high mutual information. The test case should explain the example, just as the example explains the test case.

The ByCS Pipeline

Implementing this requires a clever multi-step process, because strictly speaking, we don’t know the true answer (\(Y\)) for our test input (\(X\)) yet—that’s what we are trying to find!

The ByCS pipeline solves this with a three-step estimation process, illustrated below:

Figure 3: The detailed pipeline of our ByCS method includes: First, conduct the first-round inference to estimate the label of the test input. Then, perform inverse inference…

Step 1: First-Round Inference

First, the model makes a “hypothesized” guess for the test input (\(X\)). It might use a zero-shot prompt (no examples) or a simple random selection to generate a temporary label, \(\hat{Y}\).

Step 2: Inverse Inference

Now, the “script flip” happens. We treat the test input pair (\(X, \hat{Y}\)) as the context. We take a candidate example from our database and treat it as the test target. We try to predict the candidate’s label.

A simplified conceptual view of this process is shown in Figure 1:

Figure 1: A brief illustration of the proposed Bayesian in-context example selection

Step 3: Text Similarity & Selection

Finally, we compare the label we predicted for the candidate example against its actual ground-truth label.

  • If the prediction is accurate, it means the test input and the candidate example are highly correlated. This example gets a high score (\(Q\)).
  • If the prediction is wrong, the example is likely irrelevant or confusing relative to the test case.

The examples are ranked by this score, and the top \(k\) examples are selected for the final ICL prompt.

A Concrete Example

It can be hard to visualize “inverse inference,” so let’s look at a text-based example provided by the authors.

Figure 5: We provide an additional “inverse inference" illustration of the proposed Bayesian example selection method for in-context learning in a text format

In the image above:

  1. Top (Standard Inference): We use “Einstein -> German” to predict “Marie Curie -> ?”.
  2. Bottom (Inverse Inference): We assume we know “Marie Curie -> Polish”. We use that to try and predict the label for “Einstein”.
  3. Because the relationship (Scientist -> Nationality) is consistent, the inverse inference works. If we tried to use an irrelevant example (like “Gandhi -> Male”), the inverse inference would likely fail to predict “German” for Einstein given the context of Curie.

Calculating the Score

The score isn’t necessarily a raw probability. In practice, calculating the text similarity between the predicted label and the true label of the example works best. This makes the method compatible with commercial LLMs (like GPT-4) where access to raw probability logits might be restricted.

Figure 6: An illustration of the calculation of text similarity between inverse inference results and their true labels

Addressing Computational Cost

You might notice a potential bottleneck: running inference on every single example in a massive dataset is slow. The authors address this with two optimizations:

  1. Pre-selection: They use a fast method (like kNN) to narrow the dataset down to a small pool (e.g., top 10 or 20 candidates). ByCS is then only run on this small pool to find the absolute best ones.
  2. Smaller Proxy Models: They found that you don’t need to use the massive 175B+ parameter model for the inverse step. A smaller model from the same family (e.g., Whisper Small instead of Whisper Large) can approximate the inverse inference effectively, speeding up the process significantly.

Experiments and Results

The authors tested ByCS across three distinct modalities: Audio (ASR), Text (NLP), and Image (VQA).

1. Audio: Automatic Speech Recognition (ASR)

In this task, the model must transcribe speech, specifically focusing on difficult dialectal words. They used the Whisper model family.

The results in the table below compare ByCS against Random selection and “KATE+” (an improved kNN baseline).

Table 1: % WERs on RASC863 dialectal word dataset and CORAAL with different in-context example selection methods.

Analysis:

  • Lower WER (Word Error Rate) is better.
  • ByCS consistently outperforms the baselines. For example, on the RASC863 Chongqing dataset with \(k=1\), ByCS achieves 62.4% WER compared to 67.1% for Random and KATE+.
  • The “Oracle ByCS” row shows what happens if the model knew the actual label of the test input during step 1. The standard ByCS gets remarkably close to this oracle performance, proving that the “hypothesized” label in Step 1 is sufficient for the method to work.

2. Text: NLP Tasks

The method was applied to topic classification (TREC), sentiment analysis (SST2), Text-to-SQL (Spider), and ASR re-scoring (HyPoradise).

Table 5: Results of four text ICL tasks on two GPT-family models with different in-context example selection methods.

Analysis:

  • ByCS achieves the highest accuracy (or lowest WER) in almost every category using GPT-3.5 and GPT-4.
  • The improvement is particularly strong in open-ended generation tasks (like Text-to-SQL or ASR correction).

Why does it work better for some tasks than others? The authors analyzed the distribution of similarity scores.

Figure 4: The distribution of text similarity scores on different datasets.

  • SST2 (Blue): This is a simple classification task (Positive/Negative). The scores are clustered at 0 and 1. It’s binary.
  • HyPoradise (Red): This is a generation task. The scores are spread out. ByCS thrives here because it can differentiate between “somewhat good” and “very good” examples based on the nuance of the generated text, rather than just a binary correct/incorrect label.

3. Vision: Visual Question Answering (VQA)

Finally, the researchers applied ByCS to the OKVQA dataset, where the model must answer questions about images requiring external knowledge.

Table 6: Results of VQA ICL with different in-context example selection methods and numbers of examples on OKVQA dataset.

Analysis:

  • Even in the visual domain, ByCS edges out the KATE+ baseline.
  • The gains are slightly smaller here, likely because VQA answers are often very short (one or two words), which limits the “richness” of the inverse inference signal compared to longer text generation.

Conclusion and Implications

The “ByCS” paper presents a compelling argument: when teaching AI through examples, interaction matters more than surface similarity.

By leveraging Bayes’ theorem, the researchers demonstrated that we can mathematically quantify how well a test input and a training example “understand” each other. This inverse inference approach ensures that the selected examples are not just close in vector space, but are functionally useful for the specific prediction task at hand.

Key Takeaways:

  1. Robustness: ByCS works across audio, text, and visual tasks.
  2. Efficiency: It can be optimized using smaller proxy models for the selection step without losing accuracy.
  3. Generative Power: It shines brightest in complex, open-ended generation tasks where the nuance of the example matters most.

As LLMs continue to dominate AI, techniques like ByCS will be crucial. They allow us to squeeze more performance out of existing frozen models simply by being smarter about what we ask them to look at. It turns out that sometimes, to get the right answer, you have to work backwards.