Why More Isn't Always Better: Fixing Attention Dispersion in Many-Shot In-Context Learning

Large Language Models (LLMs) have transformed the landscape of Artificial Intelligence, largely due to their ability to perform In-Context Learning (ICL). This is the capability where a model learns to solve a task simply by looking at a few examples (demonstrations) provided in the prompt, without any parameter updates.

The prevailing wisdom—and the scaling law that governs much of deep learning—suggests that “more is better.” If giving an LLM five examples helps it understand a task, giving it a hundred examples should make it an expert. However, recent empirical studies have uncovered a baffling phenomenon: as the number of demonstrations increases from “few-shot” to “many-shot,” performance often plateaus or even degrades.

In this post, we will dive deep into the research paper “Focused Large Language Models are Stable Many-Shot Learners.” We will explore why standard LLMs get “distracted” by too much information, the mathematical proof behind this attention dispersion, and the proposed solution: FocusICL, a training-free method that helps models filter out the noise and focus on what matters.

The Broken Promise of Many-Shot Learning

To understand the gravity of the problem, we first need to look at the expectation versus reality of In-Context Learning.

In a standard ICL setup, you provide the model with a sequence of paired inputs and outputs \((q_1, r_1), (q_2, r_2), \dots\) followed by a final query \(q\). The model is expected to generate the response \(r\).

Theoretically, previous research has equated ICL to an implicit form of fine-tuning. Since fine-tuning generally follows scaling laws (performance improves power-law style with more data), ICL should do the same. However, as context windows have expanded—allowing us to stuff hundreds or thousands of examples into a prompt—we’ve seen the opposite.

Let’s look at the data. In the figure below, the authors tested several state-of-the-art models (like Llama-3 and Vicuna) across various benchmarks.

Figure 5: FocusICL helps different LLMs scale well with many-shot demonstrations compared with ICL.

Notice the orange lines representing standard ICL. In many cases, specifically in the CountA and ARC benchmarks, the accuracy flatlines or drops significantly as the number of demonstrations (the X-axis) increases. This is known as the inverse-scaling phenomenon. The model isn’t learning more; it’s getting confused.

The Hypothesis: Attention Dispersion

Why does this happen? The authors posit a hypothesis rooted in the very mechanism that makes Transformers work: Self-Attention.

When an LLM processes a sequence, it assigns “attention weights” to different tokens. These weights determine how much influence past information has on the current prediction. The authors argue that as you flood the context window with more demonstrations, the model’s attention gets dispersed. The massive volume of tokens in the demonstrations competes for attention, stealing focus away from the most critical part of the prompt: the current query.

The “Blank Space” Experiment

To prove that this isn’t just about the model getting bad examples, but rather a structural issue with attention, the researchers conducted a clever experiment. They took a standard prompt and simply added meaningless blank spaces to the demonstrations.

Figure 2: Accuracy and attention of LONGCHAT-7BV1.5-32K with varying number of spaces added per demonstration.Demonstration number is set as 100.

As shown above, as the number of blank spaces increased, the attention allocated to the query (the blue line) dropped, and consequently, the accuracy (the red line) plummeted. The blank spaces contained zero new information, yet they successfully distracted the model. This confirms that the sheer volume of context can actively harm the model’s ability to focus on the problem at hand.

The Mathematical Root of the Problem

To understand strictly why this happens, we have to look at the attention mechanism equations.

Previous works that claimed ICL is equivalent to fine-tuning often approximated the standard attention mechanism using Linear Attention. In Linear Attention, the softmax operation is removed or modified. Under that approximation, adding new demonstrations adds new information linearly without “hurting” the existing information.

However, LLMs use Softmax Attention. Softmax enforces a normalization constraint—all attention weights must sum to 1. This introduces competition.

The authors derive the output of the attention head \(\hat{h}_r\) for a response as a weighted sum of the outcome from the query (\(q\)) and the outcome from the demonstrations (\(demos\)).

Equation defining the attention output as a weighted sum of query and demonstrations.

The crucial part of this equation is \(\lambda(h_r)\), which acts as a weighting factor or a “gate” determining how much the model relies on demonstrations versus the query itself.

Equation 14: The definition of the weighting factor lambda.

In this equation:

The numerator represents the attention energy coming from the demonstrations (\(D_k\)).
The denominator is the total attention energy (demonstrations \(D_k\) + query \(Q_k\)).

As the number of demonstrations (\(N\)) increases, the term \(\sum \exp(h_r W_q D_k^\top)\) grows larger. Consequently, \(\lambda(h_r)\) approaches 1. Looking back at the previous equation, as \(\lambda\) grows, the weight on the outcome from \(q\) decreases.

Essentially, the noise drowns out the signal. The more examples you give, the less the model looks at the actual question it needs to answer.

This phenomenon is visualized in the graph below. The vertical axis shows the average attention weight assigned to the query tokens. As the number of demonstrations (N, on the x-axis) increases, the attention on the query steadily decays.

Figure 1: The average model attention for query is dispersed by the increased number of demonstrations, causing inadequate understanding of query.

The Solution: FocusICL

Inspired by how humans learn, the authors propose FocusICL. When humans are presented with a massive textbook of examples, we don’t memorize every word. We:

Filter: Ignore irrelevant words (trivialities).
Batch: Study a few examples at a time, rather than trying to hold 500 examples in working memory simultaneously.

FocusICL implements these two strategies into the attention mechanism without requiring any model retraining.

Figure 3: Overall illustration of FoCUsICL.

1. Token-Level: Triviality Filtering

Not all tokens in a demonstration are useful. Stop words, formatting symbols, or generic tokens might take up attention budget without providing reasoning value.

FocusICL calculates the attention scores \(s\) normally. It then identifies tokens in the demonstrations that receive very low attention scores relative to the others. The logic is that if the model (in a standard pass) barely looks at a token, it is likely “trivial.”

The method applies a Triviality Mask:

Equation 8: The logic for masking trivial tokens based on attention thresholds.

Logic: If the attention score \(s_i\) is in the bottom \(p\) percentile (where \(p\) is a threshold), it is masked out (set to negative infinity).
Result: When Softmax is applied, these tokens get exactly zero attention. This frees up the “attention budget” to be redistributed to the important tokens in the demonstrations and the query.

2. Demonstration-Level: Hierarchical Attention

Even with filtering, processing hundreds of demonstrations in a single attention pass causes the mathematical dilution we discussed earlier. To solve this, FocusICL introduces Hierarchical Attention.

Instead of attending to all \(N\) demonstrations at once, the demonstrations are split into \(T\) batches.

Figure 4: Input details of FoCUSICL.

As illustrated above, the model processes each batch independently alongside the query.

Intra-Batch Attention: For Batch 1, the model attends only to \(\{Demo_{Batch1}, Query\}\). For Batch 2, it attends to \(\{Demo_{Batch2}, Query\}\), and so on. This keeps the \(N\) small within the calculation, preventing the query’s attention signal from being diluted.
Inter-Batch Attention: The model then combines the results from all batches.

The aggregation formula is a weighted sum:

Equation 13: The formula for aggregating hierarchical attention results.

Here, the weight of each batch is determined by the total attention energy (\(\sum e^{s}\)) of that batch. If a batch contains examples that are highly relevant to the query, it will naturally have higher attention scores, and thus contribute more to the final representation.

Efficiency

One might worry that processing multiple batches makes the model slower. However, because standard attention has a quadratic complexity \(O(N^2)\), splitting inputs into batches actually linearizes the cost relative to the total number of demonstrations.

Equation 18: Cost comparison showing FocusICL efficiency.

By setting a fixed batch size \(B\), the complexity becomes \(O(N \cdot B)\), which is significantly more efficient than \(O(N^2)\) when \(N\) is large.

Experiments and Results

The researchers tested FocusICL against standard ICL and other baselines (like “EarlyStop,” which simply stops adding demos when performance drops) on benchmarks including CSQA (Commonsense QA), PIQA (Physical Interaction QA), and GSM8K (Math reasoning).

Accuracy Gains

The results were consistent across different model families. Let’s look at the performance on Llama-3-8B-Instruct, a very popular open-source model.

Table 3: Accuracy (%) of LLAMA-3-8B-INSTRUCT with compared methods across benchmarks.

FocusICL (the bottom row) consistently achieves the highest average accuracy. On the CountA dataset, where standard ICL often struggles with inverse scaling, FocusICL maintains high performance. Across all models tested (LongChat, Vicuna, Llama-3), FocusICL achieved an average improvement of 5.2% over vanilla ICL.

Stability and Scalability

Recall the first graph (Figure 5) showing ICL performance dropping as demonstrations increased. Let’s look at the breakdown for attention distribution with FocusICL.

Figure 6: Average model attention towards token of q with varying demonstration numbers.

The red line (FocusICL) remains almost flat as the number of demonstrations increases from 50 to 450. The blue line (Standard ICL) drops precipitously. By maintaining stable attention on the query, FocusICL turns LLMs into stable many-shot learners.

Hidden State Analysis

To visually confirm that FocusICL changes how the model represents information, the authors performed Principal Component Analysis (PCA) on the hidden states of the model’s last layer.

Figure 7: The PCA distribution results of the hidden states of the last input token from the penultimate layer of ICL (above) and FocusICL (below).

Top Plot (ICL): Notice the gradient of colors from purple (few shots) to yellow (many shots). The representation drifts significantly as more examples are added. This drift suggests the model’s internal understanding of the task is changing—and often degrading—simply due to the volume of text.
Bottom Plot (FocusICL): The points are tightly clustered regardless of the number of demonstrations. The model maintains a consistent representation of the query, proving it is robust to the distraction of added context.

Hyperparameter Search

Implementing FocusICL requires choosing a filtering threshold (\(p\)) and a batch size (\(B\)). Since these can vary by task, the authors propose an automated search strategy based on Perplexity (PPL).

They select a subset of demonstrations and measure how well the model predicts the answers (responses) within that subset.

They search for the threshold \(p\) that yields the lowest perplexity (best prediction).
They search for the batch size \(B\) where perplexity begins to rise (indicating the batch is getting too big and distraction is setting in).

This search adds very little overhead (only about 25 inference runs) compared to the thousands of runs needed for a full evaluation, making FocusICL practical for deployment.

Conclusion and Implications

The paper “Focused Large Language Models are Stable Many-Shot Learners” highlights a critical flaw in how we currently approach long-context learning: simply stuffing more data into the context window is not a guaranteed path to success. Due to the Softmax normalization in attention mechanisms, information overload leads to attention dispersion, causing the model to lose sight of the actual question.

FocusICL offers an elegant, training-free solution. By filtering out trivial tokens and batching demonstrations, it ensures that the model can leverage the wealth of information in many-shot settings without being overwhelmed by it.

Key Takeaways:

More \(\neq\) Better (Always): Without management, adding demonstrations can hurt performance due to attention competition.
Softmax is the Culprit: The normalization in Softmax attention forces a trade-off between attending to examples and attending to the query.
Filter and Batch: FocusICL’s strategy of masking trivialities and hierarchical batching restores the model’s focus.
Efficiency: This method is computationally cheaper than standard full-attention ICL for large contexts.

As we move toward AGI and models with infinitely long context windows, techniques like FocusICL will be essential to ensure that models don’t just “read” vast amounts of data, but actually focus on what matters.

The Broken Promise of Many-Shot Learning#

The Hypothesis: Attention Dispersion#

The “Blank Space” Experiment#

The Mathematical Root of the Problem#

The Solution: FocusICL#

1. Token-Level: Triviality Filtering#

2. Demonstration-Level: Hierarchical Attention#

Efficiency#

Experiments and Results#

Accuracy Gains#

Stability and Scalability#

Hidden State Analysis#

Hyperparameter Search#

Conclusion and Implications#