How Do LLMs Learn on the Fly? A Deep Dive into In-Context Learning

Large Language Models (LLMs) like GPT-4 and Claude have a seemingly magical ability: show them just a few examples of a task in a prompt—say, two labeled sentences or snippets of code—and they can instantly perform that task on new data. This capability, known as In-Context Learning (ICL), allows them to translate languages, analyze sentiment, or even write algorithms with just a handful of demonstrations, all without any updates to their underlying weights.

But how does this actually work? Is the model truly learning a new skill from those few examples, like a human student? Or is it just recognizing a pattern it already mastered during its immense pre-training—using the examples merely as hints? This question is one of the biggest mysteries in AI research today.

A recent survey paper, A Survey to Recent Progress Towards Understanding In-Context Learning, offers a clear roadmap through this puzzle. The authors propose a unified data generation perspective that organizes recent work into two fundamental capabilities: Skill Recognition and Skill Learning.

In this post, we’ll explore their insights to build a more intuitive picture of what’s happening inside the black box when an LLM learns “on the fly.”

What Exactly Is In-Context Learning?

Before we dive deeper, let’s clarify what ICL entails.
In-Context Learning is the process of providing an LLM with a prompt containing a few examples—called demonstrations—that show an input–output relationship. The model then uses these examples to predict the output for a new query input.

Illustration of In-Context Learning for sentiment analysis, showing several labeled examples (“Wonderful food!” → “Positive”, “The Beef is overcooked.” → “Negative”) followed by a new query (“Fruits taste great.”) for the model to complete.

Figure 1. In-Context Learning example for sentiment analysis. The upper examples provide labeled demonstrations; the bottom line is a new query for which the model must infer sentiment.

As shown above, the model is given exemplary input-output pairs and must infer the sentiment label for the new review “Fruits taste great.” The key point: we haven’t fine-tuned the model or changed its parameters. The apparent “learning” happens entirely through the context of this single prompt.

The paper’s central idea is to interpret this phenomenon through a data generation lens. Each task—whether sentiment classification, translation, or reasoning—can be thought of as following a specific rule or data generation function that maps inputs to outputs.
During pre-training, the LLM learns a massive library of such functions from billions of text sequences.
During ICL inference, the model uses the demonstrations in the prompt to decide which function best explains the examples and then applies that function to the query.

The Two Faces of ICL: Skill Recognition vs. Skill Learning

From this data-generation perspective, the paper defines two complementary abilities:

Skill Recognition – The LLM acts like an expert archivist. It has internalized countless data-generation functions during pre-training. When given a few examples, it recognizes which known function matches the pattern and retrieves the relevant behavior. It’s not learning new knowledge—it’s identifying an existing one that fits.
Skill Learning – The model behaves like a fast student. Using the small set of examples, it constructs a new function—one it hasn’t seen before—and applies it immediately. The demonstrations are treated as a mini-training set for on-the-fly generalization.

Understanding when and how LLMs switch between these modes helps demystify what “learning” in context truly means.

Deep Dive 1: Skill Recognition as Bayesian Inference

The most widely accepted explanation for skill recognition is that LLMs perform implicit Bayesian inference.
The intuition: given many possible hypotheses (skills or concepts) acquired during pre-training, the model must infer which one best accounts for the examples in the prompt.

Mathematically, this process can be expressed as:

\[ p(y|\text{prompt}) = \int_{\text{concept}} p(y|\text{concept, prompt})\,p(\text{concept}|\text{prompt})\,d(\text{concept}) \]

Here’s how to interpret each term:

$p(y|\text{prompt})$ – the probability of generating output y given the prompt;
concept – a latent variable representing a pre-learned data generation function (for instance, “sentiment analysis” or “translation”);
$p(\text{concept}|\text{prompt})$ – the model’s belief that a particular concept explains the provided demonstrations;
$p(y|\text{concept, prompt})$ – given that concept, how likely is the output y?

In practice, the model sees patterns like “Review: … Sentiment: …” and internally activates the concept “sentiment analysis,” applying the corresponding mapping it learned during pre-training.

Researchers have represented these latent concepts using tools such as Hidden Markov Models (HMMs) or Latent Dirichlet Allocation (LDA) topics.
This Bayesian view elegantly explains how an LLM can retrieve a relevant generative function without explicit parameter updates.
Its limitations arise when the structured input–label format of an ICL prompt differs from the model’s pre-training data distribution—but theory and experiments show that with enough examples, the model reliably locates the right concept.

Deep Dive 2: Skill Learning as Function Learning

What happens when the task cannot be matched to any known pre-training concept—something truly novel?
Here we move to the function learning framework, which models the LLM’s ability to form new input-output relationships from scratch.

In this setup, researchers train a transformer directly on synthetic data designed to teach learning itself—not just next-token prediction.
The training objective can be summarized as an expectation over input sequences and target functions $f$:

$Visual depiction of the function learning framework, showing inputs and labels, a transformer model, and a loss measuring how well the model predicts \$f(x_i)\$ for new inputs given previous in-context pairs.$

Figure 2. Function learning objective: the transformer learns to predict $f(x_i)$ for a new sample based on prior input–output pairs in the same prompt.

In simpler language: for many different functions $f$, the model sees examples of pairs $(x, f(x))$. Then, given a new $x_i$, it must predict $f(x_i)$.
By training across a diverse collection of function classes—such as linear regression, decision trees, or small neural networks—the transformer gradually learns how to learn.

Can Transformers Really Learn New Functions?

Surprisingly, yes—within limits.
Experiments show that transformers can approximate new functions from familiar families.
For example, a model pre-trained on numerous linear functions can quickly infer a new linear function from only a few in-context examples.
However, if presented with a quadratic mapping, it fails; the capacity to learn appears constrained to the pre-training function class.
Thus, ICL isn’t arbitrary magic—it’s grounded in pattern families learned before.

The Inner Mechanisms of ICL

How can a single forward pass emulate learning?
Theory suggests that the attention mechanism in transformers may implicitly implement gradient-descent-like optimization.

During pre-training, the matrices controlling attention—$W_Q$, $W_K$, and $W_V$—are shaped such that processing examples through them generates internal activations equivalent to having updated weights.
Even without any gradient step, the computation behaves as if fine-tuning occurred inside the forward pass.

Researchers have demonstrated this correspondence mathematically in simplified, linearized transformer models.
As architectures and scale increase, emergent behavior appears: large models begin accurately approximating closed-form regression solutions in a single shot.
This line of work blurs the boundary between inference and learning, hinting that ICL might indeed be a learned optimizer running implicitly.

When Does an LLM Learn vs. Recognize?

Real-world behavior seems to combine both abilities.
Which one dominates depends on the difficulty and clarity of the task.

A common test involves giving corrupted demonstrations—examples with deliberately wrong labels.

When performance collapses, the model is relying on skill learning; it attempts to learn the faulty mapping and fails.
When performance remains steady, the model employs skill recognition; it ignores the nonsense labels and recalls the genuine pattern from pre-training.

Empirical studies observe that easy, well-structured tasks (like arithmetic or simple mappings) invoke skill learning; challenging, abstract tasks (like reasoning or composition) rely more heavily on skill recognition.

Summary table of research papers categorized by Skill Recognition (SR) or Skill Learning (SL), listing analysis viewpoint, data generation function, and core characteristics.

Figure 3. Representative works on In-Context Learning organized by ability type, analytical framework, and behavior studied.

The Bigger Picture and Future Frontiers

The recognition-versus-learning framework shines light on many open questions about LLMs.

The Mystery of Emergence.
Skill learning appears only once models surpass certain scale thresholds—often tens of billions of parameters. Why does this sudden phase shift occur, and how do features like “induction heads” support it?
Skill Composition.
Advanced prompting such as Chain-of-Thought (CoT) reveals a third capability: combining multiple learned skills to solve complex problems through intermediate reasoning steps. Understanding how skills compose could unify approaches to reasoning, planning, and knowledge synthesis.
Bridging Theory and Practice.
Most theoretical analyses simplify transformers into linear models working on synthetic data. Confirming these mechanisms in trillion-token, real-world training regimes remains a major challenge—and opportunity—for future research.

Conclusion

The mystery of In-Context Learning is gradually giving way to systematic understanding.
By viewing it through the data generation perspective, distinguishing skill recognition from skill learning, we gain a clear framework for reasoning about how LLMs behave inside prompts.

ICL is not a single phenomenon but a dynamic interplay:

sometimes recognition of familiar structures from pre-training,
sometimes creation of new mappings via emergent, implicit optimization.

As research continues to refine these frameworks and validate them in large-scale models, we’ll move closer to designing AI systems that learn more faithfully, reliably, and transparently—bridging the gap between the seeming magic of “learning in context” and the scientific principles that make it possible.

What Exactly Is In-Context Learning?#

The Two Faces of ICL: Skill Recognition vs. Skill Learning#

Deep Dive 1: Skill Recognition as Bayesian Inference#

Deep Dive 2: Skill Learning as Function Learning#

Can Transformers Really Learn New Functions?#

The Inner Mechanisms of ICL#

When Does an LLM Learn vs. Recognize?#

The Bigger Picture and Future Frontiers#

Conclusion#