Beyond the Hype: Do LLMs Actually Learn, or Just Memorize? A Deep Dive into In-Context Learning

Large Language Models (LLMs) like GPT-4 have shown a remarkable capability: they can often perform new tasks immediately after seeing only a handful of examples. Whether it’s translating sentences, classifying customer sentiment, or solving logic puzzles, you can provide a few demonstrations and the model will produce a response for a new, unseen input. This phenomenon is known as In-Context Learning (ICL)—and it’s part of what makes these models feel so versatile.

But what’s actually happening under the hood? When an LLM performs ICL, is it learning in the rigorous, scientific sense? Or is it simply a form of sophisticated pattern matching—using its vast pre-trained knowledge to deduce the right answer without truly acquiring a new skill? The question “Is in-context learning learning?” is central to understanding the real capabilities—and limits—of modern AI.

A recent research paper, IS IN-CONTEXT LEARNING LEARNING?, takes on this question with one of the largest empirical investigations of ICL to date. The authors examined multiple LLMs, across varied formal tasks, and systematically tested how changing prompts, data distributions, and other factors impacts performance.

The verdict? Yes, ICL does constitute learning—but it’s a very different kind of learning than in traditional machine learning. It comes with unique strengths and significant weaknesses. Let’s explore what they found.

What Do We Mean by “Learning”?

In machine learning theory, learning is synonymous with generalisation. A model has “learned” a task if, after exposure to examples from a data distribution \(\mathcal{P}\), it can perform well on new examples, even from a different distribution \(\mathcal{Q} \neq \mathcal{P}\).

This is captured formally by the Probably Approximately Correct (PAC) learning framework. In PAC learning, we measure the error rate of a learner \(f\) on dataset \(D\): the proportion of incorrect predictions.

The PAC learning error definition, showing the sum of incorrect predictions over the dataset size.

A model has truly learned if, with high probability, this error remains low for new datasets \(E\) drawn from other distributions \(\mathcal{Q}\).

The PAC learning condition for generalization, stating the probability of the error on a new dataset is high.

Put simply, a good learner should stay accurate even when the data changes.

How Does In-Context Learning Fit In?

Traditional ML models learn during training by updating their internal weights based on data. In ICL, LLMs don’t update weights—they are frozen. Instead, they “learn” ad hoc at inference time. The training data is the set of exemplars in the prompt.

An LLM’s prediction is conditioned on the entire context: the system prompt, the exemplars, and the new query. The researchers formalize ICL as finding the most probable label \(f(x_k)\) given the prompt \(p\), exemplars \(e_i\), and the query:

The formal definition of In-Context Learning as a conditional probability problem.

Mathematically, this fits within PAC learning. The LLM maps input context to a hypothesis and tests it. This means ICL is a formal learning process. But theory alone doesn’t answer: How effective is ICL? How robust is it? Does it break when the prompt or data changes? For that, the authors turned to a massive empirical study.

The Experiment: Dissecting ICL

The study evaluated four LLMs—GPT-4 Turbo, GPT-4o, Mixtral 8x7B Instruct, and Phi-3.5 MoE—across nine tasks grounded in formal language theory and computer science. Collectively, the experiments produced 1.89 million predictions per model.

The Tasks

Tasks ranged from simple to complex, with known computational characteristics:

PARITY: Decide if a binary string has an even number of zeros (Finite State Automaton, FSA).
Pattern Matching: Detect if a fixed substring appears in a string (FSA).
Reversal: Check if \(l\#r\) has \(l\) equal to the reverse of \(r\) (Pushdown Automaton, PDA).
Stack: Simulate stack operations and verify the final string (PDA).
Hamiltonian: Determine if a path visits every vertex exactly once.
Maze Solve / Maze Complete: Variations on maze traversal.
Vending Machine: Calculate balances after sequences of transactions (decision and arithmetic versions).

By choosing tasks solvable with FSAs and PDAs, the study could probe whether performance correlates with theoretical complexity.

Data: Controlling the Distribution

All datasets were synthetic, generated by custom automata, giving precise control over the statistical properties of training (prompt exemplars) and test data.

A diagram of the two-state automaton used to generate data for the PARITY task, showing transition probabilities based on δ.

Figure 1: PARITY data generation automaton. Transition probability \(\delta\) controls how similar or different datasets are.

The researchers created in-distribution (ID) test sets, statistically similar to the exemplars, and out-of-distribution (OOD) sets, increasingly different from them as \(\delta\) grows from 0 (ID) to 0.85 (far OOD).

The Prompts

The study tested a spectrum of prompt styles:

n-Shot / Modus Ponens: Just examples, no instructions.
Description: Instructions + examples.
Chain-of-Thought (CoT): Step-by-step reasoning output before the answer.
Automated Prompt Optimization (APO): Meta-prompting to auto-generate an “optimal” system prompt.
Direct Encoding (DE): Explicit formal rules or code implementing the task.
Word Salad / Salad-of-Thought (SoT): Replace instruction text with nonsensical words to test dependence on natural-language semantics.

Results: What the Data Shows

1. Forget “Few-Shot”—Think “Many-Shot”

Accuracy generally increased with more exemplars, peaking often at 50–100 shots.

A line chart showing that as the number of shots increases, the average performance of all prompt types trends upward and the variance between them decreases.

Figure 2: More shots yield better average accuracy and shrink performance gaps between prompts and models.

In the limit, ICL’s performance depends more on the autoregressive mechanism common to all LLMs than on model-specific traits or prompt style.

2. Brittle to Distributional Shift

Accuracy fell sharply as test data moved OOD (\(\delta \to 0.85\)).

A 3x3 grid of charts showing accuracy rises with more shots but drops steeply for OOD data between shot counts.

Figure 3: Accuracy drop from ID to OOD datasets. The Reversal task is notably brittle.

Advanced prompting (CoT, APO) not only overfit to ID exemplars but degraded faster than simpler prompts when faced with OOD inputs.

3. Inconsistent Across Tasks

Even tasks with similar formal complexity yielded wildly different performance.

A table showing maximum accuracy for each model on each task.

Table 1: Peak average accuracies by task. Pattern Matching (~94%) vastly outperformed Maze Solve (~63%) despite both being FSA-based.

This suggests limited cross-task generalisability: solving one problem does not guarantee solving another, even if mathematically analogous.

4. Word Salad Surprises

Replacing clear instructions with random words hurt performance at low shot counts—but often caught up at high shots.

A grid of charts comparing prompts with normal instructions to those with random words.

Figure 4: At high shot counts, Word Salad matching Description prompt performance shows structure and exemplars matter more than semantics.

This implies that for formal tasks, LLMs may rely more on the structural cues of the prompt and statistical regularities from exemplars than on semantic comprehension.

Conclusion: Is ICL Learning?

Yes. Models aren’t just regurgitating memorized content; they improve predictions dynamically at inference time as more exemplars are provided.

But…

Data-Hungry: Real gains often require many examples, not the “few” shots implied by the term.
Overfits to Prompt: Sophisticated prompts can worsen OOD robustness.
Inconsistent: Success varies unpredictably between tasks.
Not Always Semantic: Instruction meaning may be secondary to data structure in some tasks.

The authors argue that ICL focuses heavily on statistical features in the prompt rather than abstracting deeper relations within the data—making it powerful yet fragile.

Implications

We cannot evaluate LLMs on one benchmark and one prompt and assume robust performance. This work underscores the need to test across varied prompts, shot counts, and data distributions to avoid spurious conclusions.

By moving beyond hype toward deeper experimental rigor, researchers can better understand the true strengths and limits of LLMs—and design systems that can learn and generalize more reliably.

What Do We Mean by “Learning”?#

How Does In-Context Learning Fit In?#

The Experiment: Dissecting ICL#

The Tasks#

Data: Controlling the Distribution#

The Prompts#

Results: What the Data Shows#

1. Forget “Few-Shot”—Think “Many-Shot”#

2. Brittle to Distributional Shift#

3. Inconsistent Across Tasks#

4. Word Salad Surprises#

Conclusion: Is ICL Learning?#

Implications#