Introduction

We are currently witnessing a massive deployment of Large Language Models (LLMs) across every conceivable domain—from writing code and summarizing emails to diagnosing medical conditions and analyzing financial data. However, for all their fluency, LLMs have a well-documented reliability problem: they hallucinate. They can confidently state falsehoods with the same authority as they state facts.

This creates a critical need for uncertainty quantification. Before we act on an LLM’s advice, we need to know: How confident is the model in its own answer?

Traditionally, researchers have relied on a metric called “Sequence Likelihood”—essentially the probability the model assigns to the words it generated. While this seems intuitive, it has a major flaw: it treats every word as equally important. A model might be unsure about a trivial preposition but confident about the main entity, or vice versa. Standard metrics conflate syntax (grammar) with semantics (meaning).

In this post, we will dive into a research paper that proposes a clever solution: Contextualized Sequence Likelihood (CSL). The authors, Zhen Lin, Shubhendu Trivedi, and Jimeng Sun, introduce a method that uses the LLM’s own attention mechanism to figure out which words actually matter. By weighting confidence scores based on attention, they achieve a significantly more reliable measure of trustworthiness.

Background: The Problem with Standard Confidence

To understand why CSL is necessary, we first need to look at how we currently measure confidence in Natural Language Generation (NLG).

How LLMs Predict

LLMs are autoregressive. They predict the next token (part of a word) based on all previous tokens. For every step in a sentence, the model outputs a probability distribution over its vocabulary. When the model selects a token, that token has an associated probability (or logit).

The Standard Metric: Sequence Likelihood

The most common way to estimate confidence is to look at the likelihood of the entire generated sequence. If the model assigned high probabilities to all the tokens it chose, we assume it is confident. Mathematically, the unnormalized confidence score (\(C_{SL}\)) is the sum of the log-probabilities (\(l_i\)) of the tokens:

Equation for Sequence Likelihood.

Here, \(\hat{p}(s_i | s_{

Because longer sentences naturally have lower total probabilities (since you are multiplying many numbers less than 1), it is also common to normalize this score by the length of the sentence (\(n\)):

Equation for Normalized Sequence Likelihood.

The Flaw: Syntax vs. Semantics

These equations (\(C_{SL}\)) look statistically sound, but they fail to capture context.

Consider a Question-Answering (QA) scenario: Question: “When did Neil Armstrong land on the Moon?” Answer: “Neil Armstrong landed on the Moon on July 20, 1969.”

In this sentence, the specific date “July 20, 1969” is the core semantic payload. The words “Neil Armstrong landed on the” are largely syntactic repetition from the question or general grammatical structure.

If the model is very confident about the grammar but guesses the date, the overall \(C_{SL}\) might still be high because the confident grammatical tokens drown out the low probability of the wrong date. Conversely, the model might phrase a correct answer awkwardly, resulting in a lower probability score despite the fact being correct.

This is the core problem: Sequence Likelihood conflates syntactic confidence with semantic confidence. It assumes every token contributes equally to the “correctness” of the answer.

The Core Method: Contextualized Sequence Likelihood (CSL)

The researchers propose that we should not treat all tokens equally. Instead, we should assign weights to tokens based on their relevance to the question. If the model is answering “When,” the tokens representing the time should carry more weight in the confidence score.

But how do we know which tokens are relevant without human annotation? The answer lies inside the LLM itself: Attention.

1. Eliciting Attention

Attention mechanisms in Transformers allow the model to “look” at different parts of the input sequence when processing data. The authors devised a method to prompt the LLM to verify its own answer, and then observe where the model “looks” during that verification.

The process works like this:

Generate: The LLM generates an answer (\(s\)) to a question (\(x\)).
Prompt: The researchers feed a new prompt back into the model that includes the question and the generated answer. The prompt asks the model to decide if the answer is correct.
Observe: They do not actually care about the model’s “Yes” or “No” text output. Instead, they extract the attention weights from the model as it processes the answer tokens.

If the prompt asks the model to verify context, the model’s internal mechanics naturally focus on the semantically important words (entities, dates, locations) rather than the filler words.

2. The Weighting Equation

Using these extracted attention values, the authors define the Contextualized Sequence Likelihood (CSL). Instead of a simple average, it is a weighted sum:

Equation for CSL.

In this equation:

\(l_i\) is the log-probability of the \(i\)-th token (from the original generation).
\(w_i\) is the weight derived from attention (\(a_i\)).
The weights are normalized so they sum to 1.

This ensures that if the model paid 5x more attention to the word “1969” than the word “the,” the confidence score of “1969” will have 5x more impact on the final score.

3. Visualizing the Impact

To prove that this prompting strategy actually shifts focus to relevant tokens, the authors visualized the attention changes.

In the example below, the same answer (“On July 20, 1969, Buzz Aldrin and Neil Armstrong landed on the Moon”) is evaluated against three different questions: “When?”, “Who?”, and “What?”.

Figure 2 showing attention shifts based on the question type.

As shown in Figure 2:

When asking “When” (Red), the attention spikes on the date “July 20, 1969.”
When asking “Who” (Blue), the attention spikes on “Buzz Aldrin.”
When asking “What” (Yellow), the attention spreads over the action “landed on the Moon.”

This confirms that the method is indeed contextualized. The confidence score adapts dynamically based on what the user actually asked.

4. Selecting the Right “Heads”

Modern LLMs are massive. LLaMA-2-13B, for instance, has 40 layers and 40 heads per layer, totaling 1,600 attention heads. Not all of these heads are useful; many focus on grammar, previous token lookups, or other mechanical tasks.

A naive approach would be to average the attention across all heads. However, this dilutes the signal with noise. The authors needed a way to pick the “smart” heads that focus on semantic correctness.

The Selection Strategy: They use a small validation set of questions. For each attention head, they calculate how well that specific head’s attention weights predict the correctness of the answer (measured by AUROC, which we will discuss later).

They found that “good heads” are consistent. A head that is good at identifying important tokens in the validation set is also good at it in the test set.

Scatter plot showing the consistency of attention heads.

Figure 3 shows the correlation between a head’s performance on the validation set vs. the test set. The strong diagonal trend indicates high consistency. Based on this, the authors select the top \(k\) heads (typically around 10) from the validation set and average only their attention weights.

Experiments and Results

The authors tested CSL across three popular QA datasets: TriviaQA, CoQA (Conversational QA), and Natural Questions. They utilized three different open-source LLM families: LLaMA-2, Mistral, and Gemma.

Evaluation Metric: AUROC

To measure the quality of a confidence score, we treat it as a binary prediction problem: Can the score distinguish between a correct answer and an incorrect answer?

The metric used is AUROC (Area Under the Receiver Operating Characteristic curve).

0.5 means the confidence score is random guessing.
1.0 means perfect prediction (high confidence always equals correct, low confidence always equals incorrect).

Key Performance Results

The results, presented in Table 2, show a clear hierarchy of performance.

Table 2: AUROC comparison across models and datasets.

Takeaways from Table 2:

CSL Wins: CSL (the proposed method) consistently achieves the highest AUROC scores across almost all datasets and models (highlighted in bold).
Beating the Baseline: It significantly outperforms standard Sequence Likelihood (\(SL\)) and its normalized version (\(SL(norm)\)). For example, on the Mistral model with the Natural Questions dataset, CSL scores 76.65 compared to SL’s 69.22. That is a massive jump in reliability.
Outperforming Competitors: CSL also beats “TokenSAR,” a recent method that tries to estimate token importance by deleting words and seeing how much the meaning changes (a computationally expensive process).

Improving Uncertainty Measures

The authors also applied CSL to “Semantic Entropy” (SE), a state-of-the-art method for estimating uncertainty by clustering semantically similar answers. Since SE relies on sequence likelihoods internally, replacing the standard likelihood with CSL (\(SE+CSL\)) should arguably improve it.

Table 4: Semantic Entropy improvements with CSL.

As Table 4 demonstrates, integrating CSL improves Semantic Entropy across the board. This suggests that CSL is a fundamental improvement that can be plugged into various other uncertainty quantification frameworks.

Qualitative Analysis: Does it find the right words?

Beyond the numbers, does the model actually highlight words that humans would consider important?

The authors provide examples of which tokens received increased weight (\(w_i\)) under CSL compared to a uniform average.

Table showing examples of weighted tokens.

In Figure 5, the highlighted text indicates increased attention.

For “How early…”, the concept “hour” is highlighted.
For “Who is Susan Boyle?”, the description “Scottish singer” and the show “Britain’s Got Talent” are highlighted.

While not perfectly interpretable in every instance (neural networks rarely are), the weighting generally shifts toward the entities and descriptors that define the correctness of the answer.

Ablation Study: How many heads do we need?

Recall that the authors select the top \(k\) attention heads. How sensitive is the method to this number \(k\)?

Graph showing performance gain vs. number of heads.

Figure 6 plots the gain in AUROC over the baseline as \(k\) increases (on a log scale).

Performance peaks around \(k=10\) to \(k=20\).
Using just 1 best head is better than the baseline, but noisy.
Using all heads (the far right of the graph) degrades performance significantly, sometimes becoming worse than the baseline.

This confirms the hypothesis that most attention heads are doing “maintenance work” (syntax) and only a few are doing “verification work” (semantics). Filtering for the right heads is crucial.

Conclusion and Implications

The paper “Contextualized Sequence Likelihood” presents a compelling argument: we have been measuring LLM confidence wrong by treating all words as equal.

By leveraging the model’s internal attention mechanism—specifically through a self-verification prompt—CSL allows us to weigh the confidence of specific tokens based on their semantic relevance to the question.

Key Takeaways

Context Matters: A confidence score must reflect the specific question being asked.
Efficiency: Unlike methods that require generating dozens of samples or running external NLI models (which are slow and expensive), CSL only requires one forward pass for the prompt. It is fast and scalable.
Reliability: It consistently provides a better signal for correctness than standard probability metrics.

Broader Implications

For students and practitioners, this paper highlights a growing trend in AI research: Introspection. Rather than treating the model as a black box that spits out text, we are learning to look inside the box—at logits, attention weights, and activations—to understand why the model wrote what it wrote.

As we move toward deploying agents that take actions in the real world, metrics like CSL will be the safety valves that allow an AI to say, “I wrote this, but looking at my attention patterns, I’m actually not sure about the specific date I mentioned. You should double-check.” That distinction is the key to safe and reliable AI.

Introduction#

Background: The Problem with Standard Confidence#

How LLMs Predict#

The Standard Metric: Sequence Likelihood#

The Flaw: Syntax vs. Semantics#

The Core Method: Contextualized Sequence Likelihood (CSL)#

1. Eliciting Attention#

2. The Weighting Equation#

3. Visualizing the Impact#

4. Selecting the Right “Heads”#

Experiments and Results#

Evaluation Metric: AUROC#

Key Performance Results#

Improving Uncertainty Measures#

Qualitative Analysis: Does it find the right words?#

Ablation Study: How many heads do we need?#

Conclusion and Implications#

Key Takeaways#

Broader Implications#