When LLMs Ramble: Measuring Uncertainty in Long-Form Text Generation

Large Language Models (LLMs) like GPT-4 and Gemini have transformed how we interact with information. We ask them to write emails, summarize complex topics, and even generate biographies of historical figures. But there is a well-known catch: hallucinations. An LLM can speak with absolute confidence while fabricating facts entirely.

For simple “Yes/No” questions or multiple-choice classifications, determining if a model is uncertain is relatively straightforward. We can look at the probability scores (logits) of the output tokens. But how do we measure confidence when the model generates a 300-word biography? If the model writes three paragraphs about a disease, how do we know which sentences are factual and which are creative fiction?

In this post, we dive into a recent research paper titled “LUQ: Long-text Uncertainty Quantification for LLMs.” This paper tackles the difficult challenge of detecting uncertainty in long-form generation—a critical step toward building more reliable and factual AI systems.

The Problem: Uncertainty in the “Black Box” Era

To understand why LUQ (Long-text Uncertainty Quantification) is necessary, we first need to look at why existing methods fail for modern use cases.

1. The “Length” Barrier

Most existing research on uncertainty quantification (UQ) focuses on short text. If a model generates a single word or a short phrase, we can easily compare it to other potential outputs. However, real-world applications often require long responses. When a model generates a long sequence, the number of possible variations explodes. You cannot simply check if two 200-word essays are “identical” because they never will be.

2. The “Closed Source” Barrier

Many traditional UQ methods require “White-Box” access—meaning they need to see the model’s internal probability distributions (logits). However, top-tier models like GPT-4 or Claude are often “Black-Box” models accessed via API. We get the text, but not the internal math.

The researchers behind LUQ set out to answer a crucial question: Can we predict if a long-form response is factual without knowing the ground truth, simply by analyzing the model’s behavior?

The Core Intuition: Consistency is Key

The fundamental hypothesis of this paper is simple yet powerful: Uncertainty manifests as inconsistency.

Imagine asking a historian to write a biography of Julius Caesar. If you ask them three times, the wording might change, but the core facts (dates, battles, titles) will remain the same. Now, imagine asking someone to write a biography of a fake, made-up king. If they are forced to answer, they might make up different details every time—one version says he reigned for 10 years, another says 5 years.

High Consistency = High Confidence (Low Uncertainty) Low Consistency = Low Confidence (High Uncertainty)

This is the principle of sampling-based uncertainty. By prompting the LLM multiple times with the same question and comparing the outputs, we can gauge how “sure” the model is.

Introducing LUQ: How It Works

LUQ is a novel framework designed specifically to handle the complexity of long text. Instead of looking at word overlap (which can be misleading), it looks at semantic consistency (meaning).

Figure 1: The illustration of the LUQ and LUQ-ENSEMBLE framework.

As shown in Figure 1, the process works in three main steps:

Sampling: Given a query (e.g., “Tell me a bio of Ramesses IV”), the system asks the LLM to generate multiple responses (\(n\) samples).
Sentence-Level Analysis: Long text is too messy to compare as a whole. LUQ breaks the primary response down into individual sentences.
Entailment Checking: This is the “magic” step. The system uses a Natural Language Inference (NLI) model—specifically a DeBERTa model fine-tuned for this task—to check if the sentences in the primary response are supported by the other generated samples.

The Mathematics of Entailment

Traditional methods might check if the word “Pharaoh” appears in both texts. LUQ checks if the statement is supported.

For a specific sentence \(s_j\) in a response, the model calculates the probability that this sentence is “entailed” (logically supported) by another sample response \(r'\).

Equation for entailment probability and similarity score.

In the equation above:

\(P(\text{entail} | s_j, r')\) is the probability that the reference response supports the specific sentence.
\(S(r_i, r')\) averages these probabilities across all sentences in the response.

If a sentence is “Ramesses IV reigned for six years,” and other samples say “He ruled from 1151 to 1145 BC,” the NLI model recognizes this as support (entailment), even if the words are different. If another sample says “He ruled for 20 years,” that is a contradiction, resulting in a low score.

Calculating the Final Uncertainty Score

Once the system has compared a response against all other samples, it calculates a confidence score \(C\). The final Uncertainty Score \(U(x)\) is essentially the inverse of confidence—the higher the confidence, the lower the uncertainty.

Equation for confidence and final uncertainty calculation.

This method allows LUQ to assign a single scalar value (0 to 1) representing how uncertain the model is about its long-form output.

Variations: ATOMIC and PAIR

The authors also introduced two variations to refine this process:

LUQ-ATOMIC: Instead of splitting by sentences, it uses ChatGPT to break text into “atomic facts” (indivisible pieces of information) for even finer granularity.
LUQ-PAIR: Instead of comparing a sentence to a whole paragraph, it compares a sentence to the best-matching sentence in the other sample.

Equation for LUQ-PAIR maximization.

While these variations offered slight performance boosts, the standard LUQ method proved to be a robust and efficient balance between accuracy and computational cost.

Experimental Results: Does Uncertainty Correlate with Truth?

To test their method, the researchers used FactScore, a benchmark that evaluates the factuality of generated biographies. They also created a new dataset, FactScore-DIS (focusing on diseases), to test domain generalization.

The goal was to see if LUQ scores correlated with actual factuality. Ideally, we want a strong negative correlation: as Uncertainty goes UP, Factuality should go DOWN.

The Scatter Plots

The results were compelling. The scatter plots below show the relationship between Factuality (x-axis) and Uncertainty (y-axis) for various models.

Scatter plots showing negative correlation between factuality and uncertainty across 6 LLMs.

Notice the downward trend in the red lines. For models like Gemini 1.0 Pro and Tulu-2-70B, the correlation is stark. When LUQ says the model is uncertain (high y-axis value), the actual factuality score (x-axis) is almost always low.

Beating the Baselines

The researchers compared LUQ against several existing methods, including “White-Box” methods (like Semantic Entropy) and “Black-Box” methods (like SelfCheckNLI and Lexical Similarity).

Table showing Pearson and Spearman correlations. LUQ outperforms baselines.

As Table 1 demonstrates, LUQ consistently achieved the highest negative correlation scores (closer to -100 is better). For example, on Gemini 1.0 Pro, LUQ achieved a Pearson correlation of -85.1, significantly outperforming traditional methods like Lexical Similarity (-67.2) or Eigenvalue Laplacian (-72.7).

This confirms that checking for logical consistency (entailment) is a much better proxy for truth in long text than checking for word overlap.

The Impact of Entity Frequency

An interesting finding was how model knowledge varies by popularity. The researchers analyzed performance based on how “frequent” or popular the subject was in the training data (e.g., “Very Frequent” like HIV/AIDS vs. “Very Rare” diseases).

Bar charts showing Factuality and Uncertainty across different entity frequencies.

Figure 3 shows a clear trend:

Top Chart (Factuality): Models are much more factual about frequent entities (dark green bars) than rare ones.
Bottom Chart (Uncertainty): Correspondingly, LUQ correctly assigns lower uncertainty scores to frequent entities and higher uncertainty to rare ones.

This validation suggests that LUQ is correctly identifying the “knowledge boundary” of the LLMs—it knows when the model is traversing into obscure territory where hallucinations are likely.

Application: The Power of Ensembling

So, we can measure uncertainty. Why does that matter?

One of the most practical applications proposed in the paper is LUQ-ENSEMBLE. This method leverages the “wisdom of the crowd” (or rather, the wisdom of multiple models).

Suppose you have access to three different LLMs (e.g., Tulu, Gemini, and Vicuna). You ask all three to answer a question. Which answer do you trust?

Instead of guessing, you calculate the LUQ score for each. You then select the response with the lowest uncertainty.

Table showing results of ensemble strategies. LUQ-ENSEMBLE improves scores.

Table 4 highlights the power of this approach.

Looking at the first group: The best individual model (Tulu-2-70B) had a penalized factuality score (PFS) of 47.2%.
By ensembling it with Gemini and Vicuna using LUQ as the selector, the score jumped to 52.8%.

This is a massive improvement achieved without retraining any models—simply by using uncertainty to filter out the “hallucinations” and keep the “confident truths.”

Selective Answering

Another application is Selective Question Answering. If the LUQ score crosses a certain threshold, the system can be programmed to refuse to answer rather than making things up. The experiments showed that by abstaining from the top 15% most uncertain questions, the overall factuality of the remaining answers improved significantly.

Conclusion

The transition from short-answer AI to long-form content generation brings new challenges in trust and reliability. The LUQ framework provides a robust solution for quantifying uncertainty in these complex scenarios without needing access to the model’s internal weights.

Key takeaways from this research:

Consistency implies Factuality: If an LLM tells the same story (semantically) across multiple samples, it is likely telling the truth.
Granularity Matters: Breaking long text into sentences and checking entailment is far more effective than analyzing the text as a whole block or checking simple word overlap.
Actionable Metrics: LUQ scores are highly correlated with ground-truth factuality, making them reliable triggers for model ensembling or refusal mechanisms.

As we integrate LLMs into high-stakes fields like medicine, law, and education, tools like LUQ will be essential guardrails, ensuring we know when an AI is acting as an expert—and when it’s just guessing.

The Problem: Uncertainty in the “Black Box” Era#

1. The “Length” Barrier#

2. The “Closed Source” Barrier#

The Core Intuition: Consistency is Key#

Introducing LUQ: How It Works#

The Mathematics of Entailment#

Calculating the Final Uncertainty Score#

Variations: ATOMIC and PAIR#

Experimental Results: Does Uncertainty Correlate with Truth?#

The Scatter Plots#

Beating the Baselines#

The Impact of Entity Frequency#

Application: The Power of Ensembling#

Selective Answering#

Conclusion#