DEEP: A New Framework for Catching Hallucinations in LLM Summaries

Large Language Models (LLMs) like GPT-4, Claude 3, and LLaMA-2 have revolutionized text generation. They can draft emails, write code, and—crucially—summarize vast amounts of information. However, despite their linguistic fluency, these models suffer from a persistent and dangerous flaw: hallucinations. They frequently generate plausible-sounding but entirely fabricated information.

In the context of text summarization, a hallucination is a “factual inconsistency”—a moment where the summary contradicts the source document or invents facts not present in it. While a typo in an email is embarrassing, a factual error in a medical or legal summary can be catastrophic.

Researchers from the University of Texas at Austin, Northeastern University, and Fidelity Investments have proposed a novel solution to this problem. Their paper, “Detecting Errors through Ensembling Prompts (DEEP)”, introduces an end-to-end framework that leverages the diversity of LLM prompts and ensemble learning to detect these errors with state-of-the-art accuracy.

This post will break down the limitations of current evaluation methods, explain the DEEP framework, and explore how it achieves reliable hallucination detection without the need for unrealistic tuning.

The Problem with Current Detection Methods

Before diving into DEEP, we must understand why existing tools fail to catch modern hallucinations.

Traditionally, metrics like ROUGE or BLEU were used to evaluate summaries. These metrics essentially count word overlaps between a generated summary and a reference summary. However, they are notoriously poor at detecting factual errors. A summary can use entirely different words than the reference but remain factually true, or use almost the exact same words but swap a “not” for a “yes,” rendering it false.

To address this, the NLP community developed Factual Consistency Models (such as QAFactEval, AlignScore, and SummaC). These are typically encoder-based models (often fine-tuned variations of RoBERTa) that read a summary and a source text, outputting a numerical score representing how “factual” the summary is.

The Threshold Trap

Here lies the critical issue: these models output a continuous score (e.g., 0.75), but we need a binary decision (Factual vs. Hallucination). To make that decision, we must set a threshold. If the score is above \(X\), we call it factual.

The researchers behind DEEP identified a major flaw in this approach. As shown in the paper, the “optimal” threshold varies wildly depending on the dataset and the summarization model used.

Figure 1: Optimal thresholds of factual consistency models when set to maximize balanced accuracy on each test dataset.

Figure 1 illustrates this inconsistency. The scatter plots show the optimal thresholds for different factual consistency models (Quest Eval, SummaC, etc.) across various datasets (represented by different shapes). Ideally, these shapes would cluster tightly, indicating that a single threshold (e.g., 0.5) works everywhere. Instead, they are scattered. For one dataset, the optimal cutoff might be 0.2; for another, 0.9.

Why This Matters in the Real World

In a research setting, you might “cheat” by looking at your test data to find the best threshold. In the real world, you cannot do this. You have to pick a threshold before seeing the new data.

The researchers demonstrated that if you optimize thresholds on training data (a realistic scenario) or just set them to the center, the performance of these traditional models plummets.

Figure 2: Factual consistency models’ performance varies significantly based on threshold optimization strategy.

Figure 2 quantifies this failure. The red circles represent the unrealistic “perfect” scenario (optimizing on the test set). The green dots represent the realistic scenario (optimizing on training data). The gap between the red and green dots represents a massive loss in accuracy—sometimes dropping by over 10 points. This sensitivity makes traditional encoder-based models unreliable for practical applications where the data distribution might shift.

The DEEP Framework

To solve the thresholding problem and improve accuracy, the authors propose DEEP. The core philosophy is straightforward: instead of relying on a single model with a fragile threshold, use a diverse set of prompts to query an LLM, treat those answers as votes, and use an ensemble method to make the final decision.

Architecture Overview

The framework operates in a three-stage pipeline: Prompting, Ensembling, and Calibration.

Figure 3: Diagram of our end-to-end framework.

As shown in Figure 3, the process begins with a Context (source text) and a Summary.

Prompting: The system feeds these inputs into \(n\) unique LLM prompts. Each prompt asks the model to identify factual errors but uses different instructions or reasoning strategies.
Ensembling: The binary outputs (1 for consistent, 0 for inconsistent) are collected into a vector. An “Ensembler” model aggregates these votes into a single probability.
Calibration: Finally, a “Calibrator” adjusts this probability to ensure it accurately reflects the likelihood of an error, mitigating the LLM’s tendency to be overconfident.

Step 1: Diverse Prompting Strategies

The researchers created a pool of prompts using techniques like Chain-of-Thought (CoT) to encourage reasoning. Rather than just asking “Is this true?”, the prompts guide the model through specific steps:

Breaking the claim down into atomic facts.
Checking for specific error types (e.g., numerical errors, entity errors).
Comparing the summary to the article side-by-side.

By using different prompts, the framework captures different “perspectives” on the text. One prompt might be excellent at catching number discrepancies, while another excels at logic errors.

Step 2: Ensembling Methods

Once the prompts generate their binary Yes/No votes, how do you combine them? The simplest way is a Majority Vote (if 3 out of 5 prompts say “Error,” it’s an error). However, the authors explored more sophisticated machine learning approaches, treating the prompt outputs as features for a classifier.

They tested 16 ensembling methods, including:

Tree-Based Methods: RandomForest, GradientBoosting, XGBoost.
Label Aggregation: Snorkel’s LabelModel. This is particularly interesting because it is designed for “weak supervision.” It learns which prompts are correlated and which are noisy, effectively weighing the “smart” prompts more heavily than the “dumb” ones without needing massive ground-truth datasets.

Step 3: Calibration

A known issue with neural networks, including LLMs, is overconfidence. A model might say it is “99% confident” that a summary is factual, even when it is wrong. To trust the system’s output, the predicted probability should match the empirical accuracy (i.e., if the model predicts 70% confidence for 100 items, exactly 70 of them should be correct).

The authors use a metric called Expected Calibration Error (ECE) to measure this reliability gap:

Equation for Expected Calibration Error (ECE)

To fix the overconfidence, DEEP employs Platt Scaling, a technique that fits a logistic regression model to the output scores, effectively reshaping the probability curve to align with reality.

Experiments and Results

The authors evaluated DEEP on three challenging benchmarks for hallucination detection: AggreFact-XSUM FTSOTA, TofuEval, and HaluEval Summarization. These datasets consist of summaries generated by modern transformers, making them harder and more relevant than older datasets.

Individual Prompt Performance

First, how well does a single prompt perform?

Table 1: Individual performance of the top-five performing prompts across all test datasets.

Table 1 shows the performance of the top 5 individual prompts. While they perform decently (balanced accuracy in the 60s and 70s), they show a distinct bias: Recall is generally higher than Precision. This means the prompts are good at finding errors but prone to false alarms (flagging accurate summaries as errors). Notably, GPT-4 prompts consistently outperformed GPT-3.5 prompts by an average of 2.5%.

The Power of Ensembling

The real jump in performance comes from combining these prompts.

Table 2: Exploring the impact of various LLM prompt sizes and ensembling methods on balanced accuracy.

Table 2 reveals the efficacy of the ensemble approach. Using just 3 prompts consistently yields performance improvements over the single best individual prompt.

The LabelModel (Snorkel) frequently emerged as the top performer. This confirms the hypothesis that sophisticated aggregation can filter out the noise from individual prompts better than simple voting.
Interestingly, increasing the ensemble size from 5 to 9 prompts did not always yield better results. This is likely because the additional prompts (6 through 9) were powered by GPT-3.5 (to save costs), which introduced more noise than the high-quality GPT-4 prompts could offset in some cases.

Beating the State-of-the-Art

The most significant finding is that DEEP outperforms existing encoder-based models (like AlignScore and QAFactEval) when those models are evaluated realistically (i.e., without optimizing thresholds on the test set). DEEP achieved SOTA balanced accuracy across the benchmarks.

It does this without fine-tuning the LLM itself. The “learning” happens in the lightweight ensemble layer (e.g., the LabelModel or Logistic Regression), which is computationally cheap to train compared to fine-tuning a transformer.

Visualizing Calibration

Does the calibration step actually work? The reliability diagrams suggest it works very well.

Figure 4: An example reliability diagram highlighting the difference in reliability between the predicted probabilities before and after calibration.

Figure 4 compares the model’s confidence (Red) with its actual accuracy (Blue).

Uncalibrated (Top): The red bars are consistently higher than the blue bars. The model is overconfident.
Calibrated (Bottom): The red and blue bars are nearly level. The model’s confidence now accurately reflects its likelihood of being correct.

This is crucial for automated systems. If DEEP says there is a 90% chance a summary is factual, users can actually trust that probability.

Conclusion and Implications

The DEEP framework addresses a critical gap in the deployment of Large Language Models. As we increasingly rely on AI to summarize meetings, news, and documents, the ability to automatically and reliably flag hallucinations is non-negotiable.

The authors have demonstrated three key contributions:

Exposing Fragility: They showed that current encoder-based detectors are too sensitive to threshold settings to be practical in wild, unseen environments.
Ensembling Efficacy: They proved that combining diverse LLM prompts via weak supervision techniques (like LabelModel) yields better accuracy than single prompts or traditional models.
Reliability: By calibrating the ensemble, they produce probability scores that are empirically accurate, not just high numbers.

While DEEP is more computationally expensive than a simple BERT classifier (since it requires multiple calls to an LLM), the cost is justified for high-stakes tasks where accuracy is paramount. This framework offers a robust, “set-it-and-forget-it” approach to factuality that doesn’t require constant threshold retuning for every new dataset.

The Problem with Current Detection Methods#

The Threshold Trap#

Why This Matters in the Real World#

The DEEP Framework#

Architecture Overview#

Step 1: Diverse Prompting Strategies#

Step 2: Ensembling Methods#

Step 3: Calibration#

Experiments and Results#

Individual Prompt Performance#

The Power of Ensembling#

Beating the State-of-the-Art#

Visualizing Calibration#

Conclusion and Implications#