Stop Ignoring the Prompt: Boosting In-Context Learning with Contrastive Decoding

Large Language Models (LLMs) like GPT-4 and Llama-3 have revolutionized the way we approach Natural Language Processing (NLP). One of their most powerful features is In-Context Learning (ICL). Instead of fine-tuning a model for hours on a specific dataset, you simply provide a few examples (demonstrations) in the prompt, and the model figures out the pattern.

It feels like magic. You give the model three examples of translating English to French, and it translates the fourth sentence perfectly.

But here is the catch: Is the model actually learning from your examples, or is it just pretending?

Recent research suggests that LLMs often ignore the specific input-label mapping provided in your examples. Instead, they rely heavily on their pre-trained “priors”—the knowledge they absorbed during their initial massive training phase. Essentially, the model glances at your prompt, recognizes the task (e.g., “Oh, this is sentiment analysis”), and then ignores your specific examples, reverting to its gut instinct.

Today, we are diving into a fascinating paper titled “Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding”. The researchers propose a clever, training-free method to force LLMs to pay attention to the specific logic in your prompt. It’s called In-Context Contrastive Decoding (ICCD).

If you are a student of NLP or machine learning, this is a perfect example of how manipulating probability distributions at inference time can solve deep-rooted behavioral issues in models without touching a single model parameter.

The Problem: Task Recognition vs. Task Learning

To understand why In-Context Learning fails, we need to distinguish between two concepts:

Task Recognition (TR): The model realizes what task it is supposed to do (e.g., “Classify this movie review”).
Task Learning (TL): The model learns how to do the task based on the specific mappings in the examples (e.g., “In this specific context, ’not bad’ is labeled as Positive”).

Ideally, ICL should involve both. However, studies show that LLMs are great at Task Recognition but lazy at Task Learning. They suffer from Label Bias. If a model saw the word “terrible” associated with “Negative” a million times during pre-training, it will struggle to label it “Positive” even if your few-shot examples explicitly tell it to do so.

This creates a ceiling on performance. The model isn’t adapting; it’s just remembering.

The Solution: In-Context Contrastive Decoding (ICCD)

The researchers propose a method to mathematically isolate the “Task Learning” signal. They do this by contrasting the model’s behavior on correct examples against its behavior on nonsense examples.

1. The Standard Approach

In standard ICL, we give the model a context string \(c\) (the examples) and an input query \(x\). The model generates the target \(y\) based on the probability distribution:

Standard probability equation for LLM generation.

Here, \(\mathcal{T}(x)\) is the template wrapping the input. The model tries to maximize this probability. But as we discussed, this probability is heavily polluted by the model’s pre-trained priors.

2. Constructing the “Negative” Context

To fix this, the authors introduce a Negative Context (\(c^-\)).

The goal of the negative context is to trick the model. We want a set of examples that look like the original task but contain incorrect input-label mappings.

How do they build this?

Original Demonstration: Input: “The movie was great.” -> Label: “Positive”
Negative Demonstration: They keep the label “Positive” but randomly swap the input with a different sentence from the dataset, perhaps one that says “I hated it.”
Result: The negative context \(c^-\) has the same structure and label distribution as the real context, but the logical connection between input and label is broken.

3. The Contrastive Formula

Now comes the magic. We want the model to generate tokens that are likely given the correct context (\(c\)) but unlikely given the broken context (\(c^-\)).

If a token is highly probable in both contexts, it means the model is just relying on its priors (it didn’t need the correct mapping to guess the word). If a token is probable in the correct context but improbable in the negative context, that token is being driven by the specific input-label mapping we care about.

The researchers adjust the logits (the raw scores before Softmax) using this formula:

Equation showing the contrastive decoding formula with alpha parameter.

Here:

\(\mathbf{z}_t\) is the logit from the correct context.
\(\mathbf{z}_t^-\) is the logit from the negative (broken) context.
\(\alpha\) is a hyperparameter controlling how much we penalize the “lazy” priors.

By subtracting \(\mathbf{z}_t^-\), we are essentially saying: “Remove the part of the prediction that you could have guessed even with the wrong examples.”

This can also be viewed as a probability ratio:

Equation showing the probability ratio interpretation of ICCD.

This modification amplifies the signal of the correct mapping. Importantly, this happens entirely during inference. No weights are updated, and no training is required.

Experimental Setup

Does this actually work? The authors tested ICCD across a wide variety of setups to ensure robustness.

Models: They tested varying sizes of Llama-3 (up to 8B) and Qwen2 (up to 7B).
Tasks: 7 Natural Language Understanding (NLU) tasks, including sentiment analysis (SST-2, SST-5), subjectivity analysis (Subj), and natural language inference (MNLI, QNLI).
Baselines: They compared ICCD against regular greedy decoding and other decoding strategies.

They also investigated different ways of selecting demonstrations:

Random: Picking random examples.
BM25: Picking examples that share keywords with the input.
TopK: Picking examples that are semantically similar to the input (using vector embeddings).

Key Results

The results were surprisingly consistent across the board.

1. General Performance Gains

The most important takeaway is that ICCD improves accuracy almost everywhere. It’s not a niche trick that only works on one specific model.

Take a look at the comprehensive results below. The Red numbers indicate where ICCD outperformed standard decoding.

Performance table of different models across 7 NLU tasks showing mostly red improvements.

As you can see in Table 2, for the Qwen2-1.5B model, ICCD brought an average improvement of +2.3 points. Even on the larger Llama3.1-8B, which is already quite capable, ICCD squeezed out an extra +1.8 points on average.

Crucially, the gains are massive on harder tasks. Look at QNLI (a natural language inference task) for Llama3.1-8B—the accuracy jumped from 60.3% to 65.4%. That is a significant leap for a method that requires no training.

2. Robustness Across Selection Methods

A common critique of ICL research is that results depend heavily on which examples you pick. Maybe ICCD only works if you pick bad examples?

The authors debunked this. Whether you select examples Randomly, using BM25, or using TopK (embedding similarity), ICCD provides a boost.

Table showing improvements across Random, BM25, and TopK selection methods.

Table 1 shows that while smart selection methods like TopK naturally perform better than Random, adding ICCD (the “Ours” rows) improves performance on top of those smart selection methods. It is complementary technology.

3. Does it work on Chat Models?

Many researchers use base models, but students and practitioners often use “Instruct” or “Chat” versions (like ChatGPT or Llama-Instruct). These models are fine-tuned to follow instructions, so one might assume they don’t need this kind of help.

However, the experiments show otherwise:

Bar chart comparing Regular vs. Ours performance on Instruct models.

As shown in Figure 1, even alignment-tuned models (Llama-Instruct series) benefit from contrastive decoding. The red bars (Ours) consistently edge out the blue bars (Regular).

4. Handling More Complex Classes

Many academic benchmarks are binary (Positive/Negative). But real-world tasks often have many categories. The authors tested on TREC (6 classes) and Dbpedia (14 classes) to see if the contrastive method holds up when the output space is larger.

Table showing performance on datasets with larger target classes like TREC and Dbpedia.

Table 3 confirms that the method scales well. On Dbpedia with Llama3.2-3B, the accuracy jumped by +8.3 points. This suggests that as the decision space gets more complex, forcing the model to verify the input-label mapping becomes even more critical.

Why Does It Work? A Deeper Analysis

The paper includes several ablation studies that shed light on the mechanics of ICCD.

Input vs. Label Perturbation

When creating the “Negative Context” (\(c^-\)), we have two choices:

Change the Input: Keep the label, swap the sentence. (Used in ICCD).
Change the Label: Keep the sentence, swap the label to an incorrect one.

Intuition might suggest changing the label is more direct. However, the authors found that changing the label alters the class distribution, which confuses the model’s priors. If you flip “Positive” to “Negative,” you might accidentally penalize the concept of “Negative” entirely.

By changing the input but keeping the labels the same, the marginal distribution of labels remains identical. The only thing that changes is the connection between text and label.

Table comparing different negative example strategies.

Table 4 confirms this empirically. The “+Input” row (ICCD strategy) consistently outperforms “+Label” and “+NULL” (using an empty context as the negative).

The Divergence of Distributions

To prove the model is actually distinguishing between the correct and incorrect contexts, the researchers measured the KL Divergence between the output distributions of the positive and negative contexts.

Table showing average KL divergence values.

A high KL divergence (like 0.79 on MNLI) implies that the model sees a stark difference between the sensible examples and the nonsense ones. If the model were ignoring the mapping, this value would be near zero. The fact that it is high confirms that ICCD is successfully exploiting the difference between “reasoning” and “guessing.”

The Effect of Shots and Alpha

Finally, how many examples do you need?

Line graph showing performance improvements as the number of shots increases.

Figure 2 illustrates that while regular decoding (blue line) improves with more shots, ICCD (red line) maintains a superior lead throughout. Even with just a few shots, contrasting against a negative context helps the model settle on the right answer faster.

Regarding the \(\alpha\) parameter (the strength of the correction):

Table showing performance varying with the alpha parameter.

Table 6 suggests that \(\alpha = 1.0\) is a sweet spot. If \(\alpha\) is too low, you don’t correct enough. If it’s too high (like 2.0), you might over-penalize and distort the prediction.

Conclusion

In-Context Learning is powerful, but it’s not perfect. LLMs are prone to “recency bias” and “label bias,” often ignoring the very examples we carefully curate for them.

The In-Context Contrastive Decoding (ICCD) method proposed in this paper offers a mathematically elegant solution. By constructing a “nonsense” version of the context and subtracting its probability distribution from the original, we can mathematically filter out the model’s biases and isolate the true signal of the task.

For students and practitioners, the key takeaways are:

Don’t trust the model to read your prompt perfectly. It often falls back on pre-training priors.
Inference-time interventions are powerful. You don’t always need to re-train or fine-tune. Sometimes, clever decoding strategies can unlock performance that is already there, hiding beneath the noise.
Input-Label Mapping is key. The “Input” perturbation strategy teaches us that maintaining the label distribution while breaking the semantic link is the best way to model the “noise” we want to remove.

As LLMs continue to grow, methods like ICCD will be essential for ensuring they are not just fluent, but also faithful to the instructions we give them.

Stop Ignoring the Prompt: Boosting In-Context Learning with Contrastive Decoding#

The Problem: Task Recognition vs. Task Learning#

The Solution: In-Context Contrastive Decoding (ICCD)#

1. The Standard Approach#

2. Constructing the “Negative” Context#

3. The Contrastive Formula#

Experimental Setup#

Key Results#

1. General Performance Gains#

2. Robustness Across Selection Methods#

3. Does it work on Chat Models?#

4. Handling More Complex Classes#

Why Does It Work? A Deeper Analysis#

Input vs. Label Perturbation#

The Divergence of Distributions#

The Effect of Shots and Alpha#

Conclusion#