Introduction

Large Language Models (LLMs) have revolutionized how we interact with information, but they suffer from a persistent flaw: hallucinations. When an LLM doesn’t know an answer, it often makes one up. The standard industry solution to this is Retrieval-Augmented Generation (RAG). In a RAG system, the model retrieves relevant documents from an external database and uses them as context to answer the user’s question.

However, RAG introduces a new problem. What happens when the retrieval system pulls in “noisy” documents—irrelevant text, outdated information, or conflicting facts? Standard LLMs often struggle to distinguish gold from garbage within the retrieved context. They can get distracted, leading to answers that are factually incorrect despite having access to the right information.

In this post, we dive into a research paper that tackles this exact problem not by training a better model, but by changing how the model thinks during the generation process. We will explore Dynamic Contrastive Decoding (DVD), a novel strategy that treats RAG as a multi-document competition. By dynamically analyzing the model’s confidence across different documents, DVD amplifies the “voice” of the most reliable sources while suppressing noise, all without requiring any model training.

Background: The Multi-Document Challenge

To understand DVD, we first need to look at the limitations of current decoding strategies.

The Standard RAG Approach

In a typical RAG setup, an algorithm retrieves a set of documents ($D = \{d_1, d_2, ..., d_N\}$) relevant to a query ($q$). These documents are concatenated into a single long string of text and fed into the LLM as a prompt. The LLM then generates an answer token by token.

The issue here is distraction. If $d_1$ contains the correct answer but $d_2$ through $d_5$ contain irrelevant gibberish or subtle contradictions, the LLM treats them all as valid context. Previous research has shown that LLMs are easily swayed by irrelevant context, leading to lower accuracy.

Contrastive Decoding

The foundation of the DVD method lies in Contrastive Decoding. This technique aims to improve generation by contrasting a “strong” distribution against a “weak” one.

For example, a previous method called Context-Aware Decoding (CAD) runs the model twice:

With Context: The model sees the query and the retrieved documents.
Without Context: The model sees only the query (relying on its internal parametric memory).

By subtracting the logits (the raw prediction scores) of the “no-context” run from the “with-context” run, the system forces the model to rely more on the external documents and less on its pre-trained priors, which reduces hallucinations.

However, CAD treats all retrieved documents as a single block of text. It cannot distinguish between a good document and a bad document within that block. This is where DVD enters the picture.

DVD: Dynamic Contrastive Decoding

DVD moves beyond simple context-aware decoding by treating RAG as a Multi-Document Question Answering (MDQA) task. Instead of lumping all documents together, DVD analyzes them individually in real-time to decide which ones are trustworthy.

The framework operates on a “plug-and-play” basis—it requires no fine-tuning of the LLM. It intervenes during the inference stage, specifically when the model calculates the probability of the next word.

Figure 1: The framework of DvD. We propose a new decoding strategy with selection criteria and dynamic weight to incorporate knowledge from all documents and amplify knowledge from selected documents.

As shown in Figure 1, the process involves three main stages:

Batch Input Construction: Feeding multiple versions of the context to the model simultaneously.
Selection Criteria: Mathematically identifying which documents are helpful and which are confusing.
Dynamic Logit Adjustment: Modifying the final output probabilities to amplify the “good” signal.

1. Batch Input Construction

In a standard decoding process, the model computes the probability of the next token $y_t$ given the input $x$ and the previously generated tokens \(y_{

$() z _ { t } = \\theta ( x , y _ { < t } ) ()$

DVD does not feed the model just one input. Instead, it constructs a batch of inputs ($B$) containing $N + 2$ variations, where $N$ is the number of retrieved documents.

Let’s break down the inputs in the batch $B$:

$x_1$ (No Document): The query alone. This represents the model’s internal knowledge (often hallucination-prone).
$x_2$ (All Documents): The query plus all retrieved documents concatenated. This provides the most context but includes noise.
$x_3$ to $x_{N+2}$ (Single Documents): The query paired with each retrieved document individually.

The model processes this entire batch in parallel:

$() Z = \\theta ( B ) ()$

This results in a set of logit distributions $Z = \{z_1, z_2, ..., z_{N+2}\}$. Each $z$ represents the model’s prediction for the next word based on a different view of the context.

2. Selection Criteria: Judging Quality with Entropy

Now that we have predictions based on every individual document, how do we know which document is “good”? The authors propose using Entropy as a proxy for quality.

In information theory, high entropy implies high uncertainty (the model is spreading its probability across many different words). Low entropy implies high confidence (the model is very sure the next word is “Paris”).

However, calculating entropy over the entire vocabulary (often 30,000+ words) is noisy because of the “long tail” of low-probability words. The authors refine this by calculating entropy only on the Top-K tokens (the most likely next words).

$() f ( \\boldsymbol { z } _ { i } ) = - \\sum _ { j = 1 } ^ { K } p ( t _ { j } ) \\log p ( t _ { j } ) , t _ { j } \\in V _ { t o p K } ()$

For each logit distribution $z_i$, a score $s_i$ is calculated.

$() s _ { i } = f ( z _ { i } ) ()$

Using these scores, the system identifies two critical distributions from the single-document inputs:

$z_l$ (Lowest Score): The distribution with the lowest entropy. This corresponds to the document the model finds most clear and unambiguous (the “Expert” document).
$z_h$ (Highest Score): The distribution with the highest entropy. This corresponds to the most confusing or irrelevant document (the “Amateur” document).

3. Dynamic Logit Adjustment

This is the core contribution of DVD. The goal is to generate a final token distribution that:

Relies on the full context ($z_2$).
Subtracts the internal hallucinations ($z_1$).
Amplifies the best document ($z_l$) while suppressing the worst document ($z_h$).

The formula for the final adjusted logit $\hat{z}$ is:

$() \\hat { z } = z _ { 2 } + \\beta * ( z _ { 2 } - z _ { 1 } ) + \\gamma * ( z _ { l } - z _ { h } ) ()$

Here, $\beta$ controls how much we penalize internal knowledge (similar to standard Contrastive Decoding), and $\gamma$ controls how much we boost the best specific document over the worst one.

The final token is sampled from this adjusted distribution:

$() \\begin{array} { c } { y _ { t } \\sim p _ { \\theta } ( y _ { t } | x , y _ { < t } ) = { \\mathsf { s o f t m a x } } ( \\hat { z } ) } \\ { = { \\mathsf { s o f t m a x } } ( z _ { 2 } + \\beta ( z _ { 2 } - z _ { 1 } ) + \\gamma * ( z _ { l } - z _ { h } ) ) } \\end{array} ()$

Mathematically, this can also be viewed as multiplying the probabilities:

$() y _ { t } \\sim p _ { \\theta } ( y _ { t } | x _ { 2 } , y _ { < t } ) \\frac { p _ { \\theta } ( y _ { t } | x _ { 2 } , y _ { < t } ) } { p _ { \\theta } ( y _ { t } | x _ { 1 } , y _ { < t } ) } ^ { \\beta } \\frac { p _ { \\theta } ( y _ { t } | x _ { l } , y _ { < t } ) } { p _ { \\theta } ( y _ { t } | x _ { h } , y _ { < t } ) } ^ { \\gamma } ()$

Why “Dynamic”?

In early experiments, researchers found that using fixed values for $\beta$ and $\gamma$ was suboptimal. Sometimes the retrieved documents are all bad, so we should trust the internal knowledge. Sometimes the “best” document is barely better than the “worst,” so amplifying the difference adds noise.

To solve this, DVD calculates the weights $\beta$ and $\gamma$ dynamically at every single token step based on Confidence.

The confidence $C_i$ is defined as the gap between the probability of the #1 predicted token and the #2 predicted token. A large gap means the model is decisive.

$() C _ { i } = p ( y _ { t } ^ { 1 } | z _ { i } ) - p ( y _ { t } ^ { 2 } | z _ { i } ) ()$

The weights are then derived from these confidence gaps:

For Internal Knowledge ($\beta$): We only penalize internal knowledge if the model is significantly more confident with documents ($s_2$) than without ($s_1$). $() \\beta = \\mathfrak { m } \\mathfrak { a } \\times ( C _ { 2 } - C _ { 1 } , 0 ) * \\mathbb { 1 } ( { s } _ { 2 } / { 1 0 } < { s } _ { 1 } ) ()$

For Document Selection ($\gamma$): We amplify the document signal based on how much more confident the “best” document makes the model compared to the “worst.” $() \\gamma = \\mathsf { m a x } ( C _ { l } - C _ { h } , 0 ) ()$

This ensures the decoding strategy adapts moment-to-moment. If the documents are confusing, the model backs off. If one document clearly illuminates the answer, the model leans in.

Experiments and Results

The researchers evaluated DVD in a zero-shot setting across several benchmarks, including ALCE-ASQA, Natural Questions (NQ), TriviaQA (TQA), and PopQA. They compared it against standard decoding (“Regular”) and Context-Aware Decoding (CAD).

Main Performance

The results, measured in String Exact Match (Str-em), show that DVD consistently outperforms other methods.

$Table 1: Str-em results under zero-shot setting. Regular-closed,-full,and-single corresponds to Regular Decoding without documents, with all documents concatenated, and single document. DvD-fixed means fixed \$\\beta\$ and \$\\gamma\$ while DvD-dynamic refers to dynamic \$\\beta\$ and \$\\gamma\$$

In Table 1, notice the progression:

Regular-closed (No docs): Performs poorly (e.g., 9.28 on ASQA with LLaMA2-7B).
Regular-full (Standard RAG): Significant improvement (12.41).
CAD: Further improvement (14.73).
DVD-dynamic: The highest performance (15.85).

This trend holds across different model sizes (Mistral-7B, LLaMA2-13B, Vicuna-13B) and datasets. The ability to dynamically up-weight the best document ($z_l$) and down-weight the worst ($z_h$) provides a clear advantage over simply treating all documents as equal.

Does the “Top-K” Selection Matter?

The authors utilize the entropy of the Top-K tokens to select the best/worst documents. Does the value of $K$ matter?

$Figure 2: Str-em performance with different \$K\$ \$K\$ is the number of tokens.$

Figure 2 illustrates that $K=10$ is roughly the sweet spot.

If $K$ is too small (e.g., 1 or 2), the entropy calculation is too sparse to be reliable.
If $K$ is too large (approaching “All”), the metric gets diluted by the thousands of irrelevant, low-probability tokens in the vocabulary tail.

Selection Criteria Analysis

Is DVD’s complex entropy-based selection actually better than just picking a document at random or trusting the retriever’s ranking?

$Table 2: Str-em results on ALCE-ASQA withLLaMA2- 13b on zero-shot setting of different selection criteria. Selection Criteria refer to different methods to choose \$z _ { l }\$ and \$z _ { h }\$ . Fixed weight and dynamic weight refer to fixed or dynamic \$\\beta\$ and \$\\gamma\$ . See details in section 3.2.$

Table 2 confirms the hypothesis. Selecting documents randomly performs worse than standard retrieval. Interestingly, selecting based on the Retrieval ranking (assuming the #1 retrieved doc is $z_l$) works well, but DVD’s entropy method (Our DVD) still comes out on top. This implies that the document the Retriever thinks is best isn’t always the one the LLM finds most useful for generation.

Robustness to Document Count

Finally, does the method break if we retrieve more documents?

$Table 3: Str-em results on ALCE-ASQA with LLaMA2-13b on the zero-shot setting of diferent calculation of weight \$\\gamma\$ .Fixed weight approach doesn’t require confidence. Dynamic weight approaches have many variants based on the calculation of confidence and weight. Figure 3: Str-em performance with different \$N\$ \$N\$ is the number of documents.$

Figure 3 (right side of the image) shows the performance as $N$ (number of documents) increases. Standard decoding (Regular-full) tends to degrade slightly as more documents (and thus more noise) are added. However, DVD (Pink line) maintains a performance gap above the baselines, proving its ability to filter noise effectively even as the context window gets crowded.

Conclusion and Implications

The DVD paper presents a compelling argument: RAG performance isn’t just about better retrieval; it’s about smarter decoding.

By treating the generation phase as a dynamic competition between different context sources, DVD allows LLMs to “listen” to the clearest signal in the batch. The key takeaways are:

Granularity Matters: Treating retrieved documents as a single block of text (Standard RAG) masks the varying quality of individual documents. DVD exposes this by processing them individually.
Uncertainty is a Signal: Using entropy on the Top-K tokens is a reliable way for the model to self-assess which documents make sense and which are confusing.
Dynamic is Better than Static: Adjusting the influence of external documents token-by-token based on model confidence yields better results than fixed hyperparameters.

While DVD requires higher computational resources during inference (since it runs $N+2$ forward passes in a batch), it offers a powerful, training-free method to significantly boost the accuracy and reliability of RAG systems. As LLMs continue to be deployed in knowledge-intensive fields, decoding strategies like DVD will be essential for ensuring that models don’t just read documents, but actually understand which ones to trust.

Introduction#

Background: The Multi-Document Challenge#

The Standard RAG Approach#

Contrastive Decoding#

DVD: Dynamic Contrastive Decoding#

1. Batch Input Construction#

2. Selection Criteria: Judging Quality with Entropy#

3. Dynamic Logit Adjustment#

Why “Dynamic”?#

Experiments and Results#

Main Performance#

Does the “Top-K” Selection Matter?#

Selection Criteria Analysis#

Robustness to Document Count#

Conclusion and Implications#