Stop the Hallucinations: How FIZZ Zooms In and Out to Fact-Check AI Summaries

The rapid evolution of Large Language Models (LLMs) has revolutionized how we process information. Abstractive summarization—where an AI reads a long document and writes a concise summary in its own words—is one of the most practical applications of this technology. However, anyone who has used these tools knows they suffer from a critical flaw: hallucination.

Models often generate summaries that sound fluent and natural but contain factual errors not present in the source text. Detecting these inconsistencies is a major challenge in Natural Language Processing (NLP). Traditional metrics like ROUGE check for word overlap, which is insufficient for checking facts. Newer methods use Natural Language Inference (NLI) to check logic, but they often operate at the sentence level.

This sentence-level approach is where the cracks begin to show. A summary sentence might be 90% correct but contain one tiny, crucial error (like a wrong date or name). A sentence-level evaluator might give this a “pass” because the sentence is mostly similar to the source. Conversely, a summary might synthesize information from three different parts of a document. A standard evaluator looking for a one-to-one sentence match will fail to find the evidence and flag the summary as a hallucination, even though it is true.

In this post, we will explore FIZZ (Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document), a new method proposed by researchers from Chung-Ang University and Adobe Research. FIZZ addresses these limitations by “zooming in” to check atomic facts and “zooming out” to understand the broader context.

The Limitation of Sentence-Level Evaluation

To understand why we need a new method, we first need to look at where current methods fail. Most modern fact-checking systems use NLI models. These models take a premise (the document) and a hypothesis (the summary) and determine if the premise entails the hypothesis.

Typically, these systems pair a sentence from the summary with the most similar sentence in the document.

Comparison between sentence level evaluation and atomic facts level evaluation. The numbers in parentheses represent the maximum NLI entailment scores obtained by comparing each sentence and atomic fact with the source document on a sentence-wise basis.

As shown in Figure 1, a sentence-level evaluation might look at the sentence: “the 27-year-old joined spurs from manchester city in 2011.” The system gives it a confidence score of 0.53—a middling score that doesn’t tell us much. Is it wrong? Which part is wrong?

However, when we zoom in and break that sentence down into “Atomic Facts,” the error becomes obvious. The fact “Emmanuel Adebayor is 27 years old” receives a score of 0.09 (likely hallucinated), while “Emmanuel Adebayor joined Spurs” gets a 0.97 (verified).

Sentence-level evaluation aggregates these errors, obscuring the specific falsehoods. FIZZ aims to solve this by operating at the atomic level while simultaneously handling the context issues inherent in complex documents.

The FIZZ Pipeline: Zooming In and Out

FIZZ is designed to be both highly effective and interpretable. It doesn’t just tell you a summary is bad; it can pinpoint exactly which fact is incorrect. The architecture consists of four main stages: Coreference Resolution, Atomic Facts Decomposition (Zoom-in), Scoring, and Granularity Expansion (Zoom-out).

Figure 2: Overall flow of FIZZ. The pipeline begins by applying coreference resolution to both the summary and the document. Atomic facts are then decomposed from the summary using an LLM. These atomic facts are filtered and subsequently scored against the document. The scores are refined through granularity expansion. The ultimate score is defined by choosing the minimum score.

1. Coreference Resolution: Clarifying “Who” and “What”

NLI models struggle with pronouns. If a document says “He scored a goal” and the summary says “Messi scored a goal,” a model might not realize “He” refers to “Messi” without context.

Before checking facts, FIZZ applies Coreference Resolution to both the source document and the generated summary. This process replaces pronouns (he, she, it, they) with the specific entities they refer to.

Equation 1 showing the coreference resolution function applied to Document D and Summary S.

By explicitly naming entities, the text becomes self-contained. This is crucial because later steps will chop the text into smaller pieces; without coreference resolution, those pieces would lose their context.

2. Zoom-In: Atomic Facts Decomposition

Once pronouns are resolved, FIZZ “zooms in.” It takes the summary sentences and decomposes them into Atomic Facts. An atomic fact is defined as a short, concise statement containing no more than two or three entities.

For example, the sentence “Wales defender Chris Gunter says it would be a massive mistake to get complacent” is broken down into:

Chris Gunter is a soccer player.
Chris Gunter plays as a defender.
Chris Gunter is from Wales.
Chris Gunter says it would be a “massive mistake” to get complacent.

The Filtering Step: There is a risk here. Because FIZZ uses an LLM to generate these atomic facts, the LLM might hallucinate during the decomposition process itself (e.g., adding “Chris Gunter is a noun”). To prevent this, FIZZ employs a filtering mechanism. It checks if the generated atomic fact is actually entailed by the original summary sentence. If the atomic fact doesn’t align with the summary it came from, it is discarded.

3. Atomic Facts Scoring

With a clean list of atomic facts, FIZZ now checks them against the source document. It compares every atomic fact against every sentence in the document using an NLI model.

Equation 2 showing the calculation of vector T based on the maximum entailment score.

For each atomic fact (\(a_k\)), the system finds the single sentence in the document (\(d_i\)) that provides the highest entailment score. This creates a vector of scores, where each score represents how well a specific atomic fact is supported by the document.

4. Zoom-Out: Adaptive Granularity Expansion

This is where FIZZ distinguishes itself from many other decomposition-based methods. Abstractive summaries often combine information from multiple sentences. If an atomic fact relies on information spread across three sentences in the document, comparing it to just one sentence will yield a low score, resulting in a false positive (flagging a true fact as false).

To solve this, FIZZ uses Adaptive Granularity Expansion.

If the NLI model is not confident about an atomic fact (i.e., the outcome is not clearly “Entailment”), FIZZ “zooms out.” Instead of looking at a single document sentence, it expands the window to include surrounding sentences (e.g., the previous sentence and the next sentence). It then re-evaluates the atomic fact against this larger context block.

Figure 3: The effect of granularity expansions and coreference resolution in real AGGREFACT dataset. The entailment score of an atomic fact and document sentence with (a) only Coreference Resolution, (b) only Granularity Expansion, and (c) the both.

Figure 3 illustrates the power of combining Coreference Resolution and Granularity Expansion:

Case (a): With just coreference resolution, the score is low (0.02) because the context is split.
Case (b): With just granularity expansion, the score is still low (0.09) because pronouns like “he” are ambiguous.
Case (c): When both are applied, the system understands “Chris Gunter” (coref) and sees the full context (granularity), resulting in a high verification score of 0.83.

Final Scoring

After checking all facts and expanding context where necessary, FIZZ determines the final factual consistency score of the summary. It takes a conservative approach: a summary is only as factually consistent as its weakest link.

Equation 3 showing the final FIZZ score calculation as the minimum of the score vector.

The final FIZZ score is the minimum score found in the vector of atomic facts. If even one atomic fact is a hallucination, the score for the whole summary drops, reflecting the strict requirement for factual accuracy.

Experimental Results

The researchers evaluated FIZZ on the AGGREFACT benchmark, a comprehensive dataset aggregating nine different summary evaluation datasets. They compared FIZZ against strong baselines, including QuestEval, QAFactEval, AlignScore, and methods based on ChatGPT and GPT-4 (FactScore, FacTool).

Performance Comparison

The results show that FIZZ achieves state-of-the-art performance.

Table 1: Balanced accuracy using a single threshold with 95% confidence intervals on the AGGREFACT-FTSOTA split dataset. Highest performance is highlighted in bold, and the second highest is underlined.

In Table 1, FIZZ demonstrates the highest balanced accuracy (71.0) on the combined average, outperforming complex LLM-based evaluators. Notably, the “w/o GE” (without Granularity Expansion) row shows a dip in performance (69.3), proving that the “zoom-out” feature significantly contributes to accuracy, especially on the XSUM dataset which is known for highly abstractive summaries that require multi-sentence reasoning.

The Impact of Coreference Resolution

The researchers also isolated the impact of coreference resolution. They tested performance when applying resolution to the document, the atomic facts, neither, or both.

Table 6: Effect of coreference resolution of document and atomic facts on AGGREFACT-FTSOTA splits before the process of granularity expansion.

Table 6 reveals a clear trend: resolving coreferences in both the document and the atomic facts yields the highest performance (69.2 vs 64.8 for original text). This confirms that resolving pronouns is essential for NLI models to accurately track entities across text.

Granularity Size

How much should we zoom out? The paper explores different window sizes for granularity expansion.

Table 5: Size of granularity choice in granularity expansion on AGGREFACT-FTSOTA split. s/it indicates seconds per iteration for the inference of an NLI model.

As shown in Table 5, increasing the window size generally improves performance, particularly for XSUM (which requires more context). However, there is a trade-off. Larger windows increase computational cost (s/it). The authors found that a maximum window of three sentences offered the best balance between accuracy and speed.

Which LLM is Best for Zooming In?

FIZZ relies on an LLM to decompose summaries into atomic facts. Does the choice of LLM matter?

Table 3: Experimental results of FIZZ with atomic facts generated by different LLMs using the same prompt on AGGREFACT-FTSOTA split. Avg. Token Length indicates the average number of total tokens of atomic facts per summary.

Surprisingly, larger models like GPT-3.5-turbo did not produce the best results for this specific task. Orca-2 (a 7B parameter model) achieved the highest accuracy (71.0). The authors attribute this to the conciseness of the atomic facts. As seen in the “Avg. Token Length” column, Orca-2 produced shorter, sharper facts (81.4 tokens avg) compared to Zephyr or GPT-3.5. Shorter facts are easier for NLI models to verify effectively.

Interpretability and Limitations

One of the strongest arguments for FIZZ is interpretability. Traditional scores give you a single number. FIZZ provides a list of atomic facts, each with a verification score. This allows a user to see exactly why a summary was rejected—for example, “The model hallucinated the age of the protagonist.”

However, the authors note a potential drawback to the atomic fact approach.

Figure 4: Drawbacks of atomic fact level evaluation versus the sentence level evaluation. The numbers represent the maximum NLI entailment scores obtained by comparing each sentence and atomic fact with the source document on a sentence-wise basis.

As seen in Figure 4, sometimes breaking a sentence down too far can strip away necessary context. The fact “The tweet was about a rocket landing” receives a low score (0.33) because the specific context linking the tweet to the landing might be implicit in the full sentence but lost in isolation. This can lead to false positives where a valid summary is penalized.

Conclusion

FIZZ represents a significant step forward in automated fact-checking. By combining the precision of atomic facts (Zoom-in) with the context awareness of granularity expansion (Zoom-out), it addresses the two main failure modes of previous systems: missing fine-grained errors and failing to understand multi-sentence context.

The method outperforms existing baselines and offers a transparent view into the factual consistency of AI-generated text. As we rely more on LLMs to summarize news, medical reports, and legal documents, robust and interpretable metrics like FIZZ will be essential to ensure we can trust what we read.

The Limitation of Sentence-Level Evaluation#

The FIZZ Pipeline: Zooming In and Out#

1. Coreference Resolution: Clarifying “Who” and “What”#

2. Zoom-In: Atomic Facts Decomposition#

3. Atomic Facts Scoring#

4. Zoom-Out: Adaptive Granularity Expansion#

Final Scoring#

Experimental Results#

Performance Comparison#

The Impact of Coreference Resolution#

Granularity Size#

Which LLM is Best for Zooming In?#

Interpretability and Limitations#

Conclusion#