Why LLMs Learn (and Forget) How to Learn: The Story of Strategy Coopetition

If you have played with Large Language Models (LLMs) like GPT-4 or Claude, you are intimately familiar with In-Context Learning (ICL). This is the model’s ability to look at a few examples in your prompt (the context) and figure out how to solve a new task without any updates to its internal weights. It feels like magic. It is the bedrock of “few-shot prompting.”

But here is something strange: recent research suggests that this ability might be transient.

When training these models, ICL often emerges early on, only to disappear later, replaced by other strategies. It’s as if the model learns how to learn, and then decides to “unlearn” it in favor of memorization. Why does this happen? And more importantly, does the model completely throw away the ICL machinery, or does it repurpose it?

In a fascinating paper titled “Strategy Coopetition Explains the Emergence and Transience of In-Context Learning,” researchers dive deep into the internal circuitry of Transformers to answer these questions. They uncover a phenomenon they call Strategy Coopetition—a mix of cooperation and competition. It turns out that the mechanism for “learning from context” and the mechanism for “memorizing weights” are not distinct enemies. They are actually siblings that share the same neural circuits.

In this post, we will break down this paper step-by-step. We will explore the mystery of ICL transience, dissect the “Context-Constrained In-Weights Learning” (CIWL) strategy that replaces it, and look at the mathematical model that explains their rivalry.


1. The Setup: How to Watch a Model Learn

To understand how a model chooses a strategy, the researchers needed a controlled environment. Training a massive LLM on the entire internet is too messy for precise mechanistic interpretability. Instead, they trained 2-layer attention-only Transformers on a synthetic image classification task.

The Task: Omniglot Classification

The model is shown a sequence of images (handwritten characters from the Omniglot dataset) and their corresponding labels. The sequence looks like this: [Example 1, Label 1] ... [Example N, Label N] ... [Query Image, ?] The model must predict the label for the query image.

The Data: “Bursty” Distributions

Crucially, the training data is “bursty.” This mirrors real-world data (and language) where topics appear in clusters.

  • In-Context Opportunity: The query image belongs to the same class as one of the examples provided in the context. If the model looks back at the context, matches the query to an example, and copies that label, it gets the right answer. This is In-Context Learning (ICL).
  • In-Weights Opportunity: The mapping between images and labels is fixed throughout training. If the model simply memorizes that “Image Type A = Label 5,” it can ignore the context examples and still get the right answer. This is In-Weights Learning (IWL).

Because the data supports both strategies, the researchers could watch them race against each other.

The Evaluators: Determining the Strategy

How do you know if a model is using ICL or simple memorization? The researchers devised clever evaluation datasets, visualized below.

Figure 1 from the paper illustrating the different evaluation strategies and the learning curves. Panel (a) shows the input sequences for different evaluators like ICL, CIWL, and Flip. Panel (b) shows the accuracy of these strategies over training time, highlighting the rise and fall of ICL. Panel (c) is a diagram of the neural circuits involved.

Let’s look closely at Figure 1a above.

  1. ICL Evaluator (Blue): The labels are randomized (0s and 1s) or flipped relative to training. The only way to get the right answer is to look at the context. Memorization will fail.
  2. IWL Evaluator: The query image appears, but its matching example is not in the context. The model must use memorized weights. (Interestingly, the researchers found pure IWL rarely happens in this setup).
  3. CIWL Evaluator (Green): This stands for Context-Constrained In-Weights Learning. Here, the correct label is present in the context, but it is paired with the wrong image. However, the model knows the image-label mapping from training. This tests a hybrid strategy we will discuss shortly.
  4. Flip Evaluator (Red): This is the tie-breaker. The labels in the context are flipped (e.g., if A is usually 1, the context says A is 2).
  • If the model predicts 2, it is trusting the context (ICL).
  • If the model predicts 1, it is trusting its training weights (CIWL).

The Phenomenon: Transience

Look at the graph in Figure 1b. This is the core mystery.

  • Early Training: The blue line (ICL) shoots up. The red line (Flip) is high, meaning the model prefers the context over its weights. The model has “grokking” emergent ICL.
  • Late Training: The blue line crashes. The red line drops to zero. The model stops trusting the context.
  • The Replacement: As ICL dies, the green line (CIWL) rises to dominance.

The model learned to use the context, and then abandoned it. But it didn’t switch to pure memorization (IWL). It switched to CIWL. What is that?


2. The Asymptotic Strategy: What is CIWL?

We usually think of learning as a binary: either you memorize the facts (In-Weights) or you learn a procedure to look them up (In-Context). The researchers found that the model’s final, dominant strategy is a strange hybrid.

Context-Constrained In-Weights Learning (CIWL) works like this: The model memorizes the mapping between images and labels (stored in the weights). However, it only considers an answer valid if that label token is physically present somewhere in the context window.

It is as if the model says: “I know Image A corresponds to Label 5. I see Image A as the query. I will output Label 5, but ONLY if I can find the number ‘5’ somewhere in the prompt.”

The Mechanism: Skip-Trigram-Copiers

How does the Transformer implement this? The researchers analyzed the attention heads in Layer 2 (L2) of the network.

In a Transformer, “Induction Heads” are usually responsible for ICL. They follow a pattern: [A] [B] ... [A] -> predict [B]. But for CIWL, the researchers found Skip-Trigram circuits.

Figure 2 showing attention patterns. Panel (a) is a heatmap showing Layer 2 heads attending strongly to the correct label token in the context. Panel (b) shows a correlation between attention strength and task performance.

As shown in Figure 2a, at the end of training (when CIWL dominates), the Layer 2 heads attend directly to the correct label in the context, regardless of what exemplar it is paired with.

This is a specific algorithmic implementation:

  1. The model sees the query image.
  2. It uses its weights (Key/Query matrices) to “look for” the corresponding label token in the context.
  3. Once found, it copies that token to the output.

This explains why it is “Context-Constrained.” It relies on the weights to know what to look for, but it relies on the context to find the token to copy.


3. The Switch: How ICL Transforms into CIWL

Here is where the “Coopetition” theory gets physical. We know the model transitions from ICL to CIWL. Does it tear down the neural circuitry for ICL and build new circuitry for CIWL?

No. It recycles it.

Layer 2 Stays the Same; Layer 1 Changes

The researchers performed “causal ablation” experiments. They froze specific parts of the network at different training stages to see which layers were responsible for the change in behavior.

Figure 3 illustrating the mechanistic shift. Panel (a) shows induction strength rising and falling. Panel (b) shows that fixing Layer 2 weights at the end of training reproduces the behavior, meaning Layer 2 doesn’t change much. Panel (c) shows that Layer 1 is the driver of the change.

Look at Figure 3b and 3c:

  • Layer 2 is Stable: If you take the Layer 2 weights from the end of training (CIWL phase) and plug them into the model during the middle of training, the behavior barely changes. Layer 2 is essentially doing the same thing the whole time: Copying.
  • Layer 1 is the Switch: The transition from ICL to CIWL is driven almost entirely by changes in Layer 1.

The Mechanism of the Switch

Recall that ICL usually relies on Induction Heads. An induction head is a two-step process:

  1. Layer 1: Attends to the previous token (linking the label B back to the exemplar A).
  2. Layer 2: Looks for the current query A, finds the previous A, and copies the token that follows it (B).

The researchers found that Layer 2 acts as the “Copier” in both strategies. The difference is what Layer 1 feeds it.

  • During ICL Phase: Layer 1 heads attend to the previous token. This sets up the induction circuit.
  • During CIWL Phase: Layer 1 heads switch to self-attention (attending to themselves). This breaks the induction circuit and allows Layer 2 to function as a “Skip-Trigram” lookup, relying on direct embeddings rather than relative positions.

This is Strategy Coopetition. The two strategies compete for control of Layer 1 (does it look at the previous token or itself?), but they cooperate by sharing the heavy lifting of Layer 2 (the copying mechanism).


4. Why Does ICL Emerge at All?

If CIWL is the dominant strategy that eventually wins, why doesn’t the model just learn CIWL from the start? Why bother with the temporary ICL phase?

The researchers propose a fascinating hypothesis: ICL is a “fast” hack that paves the way for the “slow” CIWL.

Because the two strategies share Layer 2, the development of one helps the other. The model can quickly learn the ICL pattern (look at previous tokens, copy match) because it is mechanistically simpler to discover early in training than the specific weight-mappings required for CIWL.

To prove this “cooperative” aspect, the researchers ran a clever experiment. They tried to train a model on a dataset where only ICL works (random labels). Usually, this is hard and slow to learn. However, if they initialized the model using the Layer 2 weights from a model that had already learned CIWL, the ICL emerged instantly.

Figure 4 showing cooperative interactions. Panel (a) demonstrates that transplanting Layer 2 weights from a CIWL model allows ICL to be learned much faster on a new task. Panel (b) shows this effect works best when weights are taken from the middle of training.

Figure 4a shows this dramatic result. The green line (using transplanted CIWL weights) learns ICL almost immediately compared to the black line (scratch).

This suggests that the asymptotic CIWL strategy actually acts as a scaffold for ICL. The model learns the Layer 2 “copying” machinery partly for CIWL, but ICL “steals” this machinery to emerge quickly. Later, as the specific weight-mappings for CIWL solidify, the model swaps Layer 1 to optimize for the more reliable (in-weights) strategy, and ICL fades.


5. A Mathematical Model of Coopetition

To formalize this intuition, the authors proposed a minimal mathematical toy model using tensor products. They modeled the loss function as a product of two mechanisms.

\[ \begin{array} { r l } & { \mathcal { L } ( \mathbf { a } , \mathbf { b } , \mathbf { c } , \mathbf { d } ) = } \\ & { \quad \left( \underbrace { \left\| \mathbf { a } ^ { * } \otimes \mathbf { b } ^ { * } \otimes \mathbf { c } ^ { * } - \mathbf { a } \otimes \mathbf { b } \otimes \mathbf { c } \right\|_{F}^{2} } _ { \mathrm { Mechanism \ 1 \ (ICL) \ Loss } } + \mu _ { 1 } \right) } \\ & { \quad \times \left( \underbrace { \left\| \mathbf { d } ^ { * } \otimes \mathbf { b } ^ { * } \otimes \mathbf { c } ^ { * } - \mathbf { d } \otimes \mathbf { b } \otimes \mathbf { c } \right\|_{F}^{2} } _ { \mathrm { Mechanism \ 2 \ (CIWL) \ Loss } } \right) } \\ & { \quad + \alpha \underbrace { \left\| \mathbf { a } \otimes \mathbf { d } \right\|_{F}^{2} } _ { \mathrm { Competition } }, } \end{array} \]

Equation 1 from the paper describing the loss function of the toy model. It includes terms for Mechanism 1 (ICL), Mechanism 2 (CIWL), and a competition penalty.

Here is how to read this equation shown in the image above:

  1. Mechanism 1 (ICL): Represented by vectors \(\mathbf{a}, \mathbf{b}, \mathbf{c}\).
  2. Mechanism 2 (CIWL): Represented by vectors \(\mathbf{d}, \mathbf{b}, \mathbf{c}\).
  3. Cooperation: Notice that vectors \(\mathbf{b}\) and \(\mathbf{c}\) appear in both mechanisms. This represents the shared Layer 2 circuitry. If you improve \(\mathbf{b}\) and \(\mathbf{c}\) for one task, you improve them for the other.
  4. Competition: The last term \(\alpha ||\mathbf{a} \otimes \mathbf{d}||\) forces a choice. The model wants to minimize the dot product of \(\mathbf{a}\) and \(\mathbf{d}\). It pushes the model to choose either strategy \(\mathbf{a}\) or strategy \(\mathbf{d}\), but not both simultaneously.
  5. Asymptotic Bias: The \(\mu_1\) term adds a small constant penalty to the ICL mechanism, making CIWL slightly preferred in the long run (asymptotically).

Simulation vs. Reality

Does this abstract math hold up? Yes. When they simulated this toy model, it reproduced the exact curves seen in the real Transformer.

Figure 5 comparing the toy model to the real transformer. Panel (a) shows the simulation results with the dip in loss. Panel (b) shows the real transformer loss, matching the simulation pattern.

In Figure 5, the blue line (Mechanism 1/ICL) drops quickly—it is “fast” to learn. But eventually, the green line (Mechanism 2/CIWL) catches up. Because of the competition term and the asymptotic bias, Mechanism 2 takes over, and Mechanism 1 is suppressed.

This mathematical simplicity captures the complex behavior of a neural network: Fast emergence due to shared resources (cooperation), followed by transience due to resource constraints (competition).


6. Can We Make ICL Stay?

Understanding the mechanism gives us power. If we know why ICL disappears (competition from the slower, but asymptotically preferred CIWL), can we change the conditions to make ICL persist?

The toy model suggests that if we remove the “bias” towards CIWL, ICL should win because it is faster to learn.

In the Omniglot task, CIWL is preferred because matching an image to a fixed label (CIWL) is slightly more robust than matching an image to a varied context exemplar (ICL). The researchers hypothesized that if they made the context matching easier—by ensuring the context exemplar is an exact pixel-match to the query image—ICL would become the superior strategy.

Figure 6 showing the results of the intervention. The left plot (small model) and right plot (large model) show that when context and query exemplars are matched exactly (dark red line), ICL persists and does not degrade, unlike the normal training (pink line).

The results in Figure 6 are striking.

  • Normal (Pink Line): The standard “transience” curve. ICL rises and falls.
  • Match Exemplars (Dark Red Line): ICL rises… and stays. It converges to 100% accuracy and never dips.

By tweaking the data properties to favor the mechanistic strengths of ICL, the researchers eliminated the transience. This confirms that ICL isn’t inherently temporary; it just loses the race when the data distribution favors in-weights memorization.


Conclusion

The story of “Strategy Coopetition” provides a nuanced view of how Large Language Models evolve during training. It is not a linear progression from “ignorant” to “smart.” It is a dynamic ecosystem where different algorithms—implemented by neural circuits—compete for resources while sharing components.

Here are the key takeaways:

  1. Emergent ICL is Transient: Models often learn to use context, only to discard that ability later in training in favor of hybrid memorization strategies like CIWL.
  2. CIWL is Hybrid: The replacement strategy isn’t pure memorization. It uses weights to know the mapping but requires the label to be present in the context (Context-Constrained).
  3. Circuit Recycling: The transition isn’t a teardown. The model reuses the “Copying” heads in Layer 2, simply changing the input they receive from Layer 1. This is Cooperation.
  4. The Race: ICL emerges first because it is a “fast” solution that leverages the shared circuitry. It fades because CIWL is the “slow but steady” solution that eventually dominates the shared resources. This is Competition.

This work reminds us that the behaviors we see in LLMs—like few-shot prompting—are not magic properties that are simply “on” or “off.” They are the result of complex internal dynamics that shift over time. Understanding these dynamics is the first step toward controlling them, ensuring that the useful abilities we want (like ICL) stick around for good.