Introduction
In recent years, the field of Natural Language Processing (NLP) has undergone a paradigm shift. We have moved from training specific models for specific tasks to utilizing massive, general-purpose Large Language Models (LLMs) like GPT-4 and Llama. What makes these models truly revolutionary is not just their size, but their ability to learn a new task simply by looking at a few examples, without any update to their internal parameters. This phenomenon is known as In-Context Learning (ICL).
Imagine you want to teach a model to classify the sentiment of a movie review. In the traditional supervised learning era, you would fine-tune a model on thousands of labeled reviews, updating the neural network’s weights via backpropagation. With ICL, the process is radically different. You simply provide the model with a prompt containing a few examples—“The movie was great! (Positive)”, “I fell asleep. (Negative)"—followed by your new query. The model, recognizing the pattern, predicts the sentiment of the new query.
This ability to learn from analogy is the core of ICL. It transforms LLMs into general-purpose problem solvers that can adapt to new tasks on the fly.

As illustrated in Figure 1 above, the workflow is intuitive. A “prompt context” is constructed using \(k\) demonstration examples. These are concatenated with a new query and fed into the LLM. The model, frozen in its pre-trained state, outputs the prediction.
In this blog post, we will dissect a comprehensive survey on In-Context Learning. We will explore how models are trained to acquire this ability, how we can engineer prompts to maximize performance, the mathematical mechanisms behind why this works, and the challenges that lie ahead.
The Landscape of In-Context Learning
To navigate the complex literature surrounding ICL, it helps to have a map. The research paper categorizes the field into three primary branches: Model Training, Inference, and Analysis.
- Training: How do we pre-train or warm up a model so that it can learn in-context?
- Inference: Once we have a capable model, how do we format demonstrations, select examples, and score outputs to get the best results?
- Analysis: Why does this work? What are the underlying mechanics inside the transformer that allow for learning without weight updates?

Figure 2 provides a hierarchical taxonomy of these areas. We will use this structure to guide our deep dive, starting with the formal definition and then moving through training and inference.
Defining In-Context Learning
Before we look at the engineering, let’s formalize what we are talking about. ICL is defined as a paradigm where a language model learns a task given only a few examples in the form of demonstrations.
Mathematically, let \(x\) be a query input and \(Y\) be a set of candidate answers. We want the model to predict the best answer based on a context set \(C\), which contains instructions and \(k\) examples.
The probability of a specific candidate answer \(y_j\) is calculated as:

Here, \(f_{\mathcal{M}}\) represents the scoring function of the pre-trained language model. The final prediction \(\hat{y}\) is simply the candidate with the highest probability:

The crucial distinction here is that the parameters of the model \(\mathcal{M}\) are not updated. The learning happens entirely within the context window during the forward pass. This distinguishes ICL from Few-Shot Learning (which traditionally implies parameter adaptation) and makes it a subclass of Prompt Learning.
Phase 1: Model Training
A common misconception is that ICL is purely a prompting technique. However, the ability to learn in-context is not always innate; it is often forged during the training process. While vanilla LLMs show ICL capabilities, specific training strategies can significantly boost this performance.
There are two main stages where ICL capabilities can be cultivated: Pretraining and Warmup.

Pretraining
During the initial pretraining of an LLM (the left side of Figure 3), the data distribution plays a massive role. If a model is trained on a corpus where documents are unrelated or purely sequential, it may not develop strong reasoning abilities.
Researchers have found that “burstiness”—where similar items or topics appear in clusters—helps the model learn to attend to local context. Furthermore, continuing to pretrain models on datasets that are specifically reorganized to highlight related contexts can teach the model to reason across demonstrations.
Warmup
The “Warmup” stage (right side of Figure 3) acts as a bridge between raw pretraining and downstream ICL application. This is often referred to as Instruction Tuning.
In this stage, the model is fine-tuned on a wide variety of tasks formatted with instructions and examples. For instance, a dataset might include a sentiment analysis task, a translation task, and a logic puzzle, all converted into a unified prompt format. By fine-tuning on these mixtures, the model explicitly learns the meta-task of “following instructions” and “learning from examples.”
Techniques like Symbol Tuning—where natural language labels are replaced with arbitrary symbols (e.g., replacing “Positive” with “Foo”)—force the model to rely solely on the relationships defined in the context, rather than its prior knowledge of words, effectively sharpening its reasoning skills.
Phase 2: Prompt Designing (The Core Method)
Once we have a trained model, the burden shifts to inference. How we present the “context” to the model changes everything. This is where Prompt Designing comes in. It is not just about writing a good sentence; it is a systematic engineering challenge involving demonstration organization and scoring functions.
Demonstration Organization
The selection, ordering, and formatting of the examples you provide (the demonstrations) can cause massive variance in performance.
1. Demonstration Selection
Which examples should you put in the prompt? Random selection is the baseline, but it is rarely optimal. The goal is to find examples that are most helpful for the current query.
- Unsupervised Methods: These methods select examples based on similarity metrics without needing a separate training step. A popular approach is KATE (\(k\)NN-based), which retrieves training examples that are semantically closest to the test query using embedding distance. Other methods use Mutual Information or Perplexity to find examples that the model finds “surprising” or informative.
- Supervised Methods: These involve training a separate “retriever” model specifically to find good examples. For instance, EPR (Efficient Prompt Retrieval) trains a dense retriever to select examples that maximize the likelihood of the correct output.
Table 1 below summarizes the key methods in this space.

2. Demonstration Ordering
Even after selecting the best examples, the order in which you present them matters. This is known as the “Recency Bias,” where models tend to be more influenced by the examples closer to the end of the prompt (near the query).
Research suggests ordering examples based on their embedding distance (putting similar examples closer to the query) or by entropy metrics. A more recent approach, ICCL, suggests a curriculum-based ordering: arranging demonstrations from simple to complex to guide the model’s reasoning process incrementally.
Scoring Functions
Once the prompt is set, how do we actually extract a prediction? The model outputs probabilities for the next token, but we need to map that to a specific answer (e.g., a classification label).
There are three main approaches to scoring, summarized in Table 3:

- Direct: We calculate the conditional probability of the label token (e.g., “Positive”) given the context. We pick the label with the highest probability. This is efficient but requires the answer to be at the very end of the sequence.
- Perplexity (PPL): Instead of just checking the next token, we compute the perplexity (a measure of how well the model predicts the sequence) for the entire sentence formed by the query and the candidate answer. We choose the answer that makes the full sentence most probable. This is more robust but computationally expensive because it requires a separate forward pass for every candidate answer.
- Channel: This method flips the probability using Bayes’ rule. It calculates \(P(x | y)\), effectively asking: “If the sentiment is Positive, how likely is it that the review text would be \(x\)?” This approach, often called the “Noisy Channel” method, helps mitigate imbalances in the training data (e.g., if the model just loves predicting the word “Positive” regardless of context).
Analysis: Why Does ICL Work?
Perhaps the most fascinating part of the survey is the analysis of why ICL works. How can a static matrix of numbers perform dynamic learning?

Influencing Factors
Figure 4 highlights several factors that dictate success. In the Pretraining Stage, the diversity of the corpus is key. A model trained on a narrow domain struggles to generalize. In the Inference Stage, the “Input-Label Mapping” is critical. Interestingly, earlier studies suggested that models didn’t actually care if the labels were correct (random labels worked almost as well as correct ones), implying the model was just copying the format. However, newer, larger models have been shown to actually learn the semantic relationship, degrading in performance if the labels are flipped.
Learning Mechanisms
Two major theoretical frameworks attempt to explain the mechanics of ICL:
The Induction Head View: Mechanistic interpretability researchers have identified specific components in the Transformer architecture called “Induction Heads.” These heads are capable of copying patterns. If the model sees
[A] -> [B]in the context, and then sees[A]again, the induction head increases the probability of[B]. As the model scales, these heads become capable of more abstract pattern matching.The Gradient Descent View: This is a profound theoretical insight. Researchers have mathematically demonstrated that the attention mechanism in a Transformer can be viewed as performing a step of Gradient Descent implicitly. When the model attends to the context examples, it is essentially calculating an error signal and updating its internal state (the “meta-parameters”) similar to how a model updates its weights during fine-tuning. This suggests that ICL is not magic; it is an approximated version of explicit training happening on the fly.
Applications and Beyond Text
While text is the most common medium, ICL is a general paradigm. The survey highlights that this “analogy-based learning” extends to other modalities.
Visual In-Context Learning
Just as we prompt an LLM with text examples, we can prompt Vision-Language Models with images. As shown in Figure 5, we can provide a grid of images. For example, to perform foreground segmentation, we provide pairs of “Original Image” and “Segmented Mask.” When given a new query image, the model generates the corresponding mask.

This capability has also been observed in speech synthesis (providing a few seconds of audio to clone a voice) and multi-modal tasks (interleaved text and images).
Challenges and Future Directions
Despite its success, ICL is not without flaws. The survey identifies several critical bottlenecks:
- Context Length & Efficiency: The number of examples you can provide is limited by the model’s context window. Furthermore, processing long contexts is computationally expensive (\(O(n^2)\) complexity for attention). Future work is focused on “distilling” contexts into compact vectors to save space and compute.
- Sensitivity: ICL can be brittle. A slight change in the wording of the instruction or the order of examples can flip the prediction. Achieving robustness is a major area of ongoing research.
- Generalization: ICL struggles with tasks that are entirely disjoint from its pre-training data. While it adapts well, it cannot invent new knowledge from scratch.
Conclusion
In-Context Learning represents a fundamental shift in how we interact with Artificial Intelligence. It moves us away from the rigid “train-then-test” pipeline toward a fluid, conversational style of interaction where models adapt to our needs in real-time.
By understanding the taxonomy of ICL—from training strategies like instruction tuning to inference techniques like demonstration selection—we can unlock the full potential of these models. The theoretical connections to gradient descent and Bayesian inference suggest that we are only scratching the surface of understanding how these large models actually “think.”
For students and researchers entering this field, the message is clear: the model is not just a static database of text; it is a dynamic engine capable of learning from the breadcrumbs you leave in the context. Mastering the art of scattering those breadcrumbs is the key to mastering modern NLP.
](https://deep-paper.org/en/paper/2301.00234/images/cover.png)