Trust but Verify: How SYNCHECK and FOD Fix Hallucinations in RAG Systems

Retrieval-Augmented Generation (RAG) has revolutionized how we use Large Language Models (LLMs). By giving models access to external tools, Wikipedia, or private documents, we turned them from creative fiction writers into knowledgeable assistants. Theoretically, at least.

In practice, RAG systems suffer from a “trustworthiness” crisis. Even when you provide the correct document to an LLM, it might hallucinate, misinterpret the text, or revert to its pre-trained memory, ignoring the evidence entirely. This phenomenon is known as unfaithfulness.

How do we stop this? A new paper from UCLA, Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation, proposes a clever solution. Instead of checking the facts after the entire essay is written (which is slow and expensive), they introduce SYNCHECK: a system that monitors the model’s “heartbeat” during the generation process to catch lies the moment they happen.

In this post, we’ll break down how SYNCHECK works, the clever metrics it tracks, and how it powers a new decoding algorithm (FOD) that doesn’t just detect errors—it fixes them in real-time.

The Problem: Faithfulness vs. Factuality

Before diving into the solution, we need to define the problem accurately. In the context of RAG, there is a distinct difference between factuality and faithfulness:

Factuality: Is the statement true in the real world?
Faithfulness: Does the statement accurately reflect the retrieved context provided to the model?

RAG systems are designed to be faithful. If a retrieved document says “The sky is green,” a faithful model should report “The sky is green.” If the model ignores that and says “The sky is blue” (because its pre-training tells it so), it is being unfaithful to the context.

Current methods to stop this are clunky. They either refuse to answer entirely (abstention) or try to “critique” the answer after it’s generated. The researchers behind SYNCHECK argue that we need a finer-grained approach: synchronous monitoring.

SYNCHECK: Checking the Model’s Pulse

The core contribution of this paper is SYNCHECK. Think of it as a polygraph test running in parallel with the LLM. As the LLM generates sentences, SYNCHECK analyzes four specific signals to determine if the sentence is trustworthy.

Figure 1: (a) An illustration of SYNCHECK, a fine-grained faithfulness checker for RALMs. (b) The Faithfulness-Oriented Decoding pipeline.

As shown in Figure 1(a) above, SYNCHECK aggregates multiple signals to produce a faithfulness probability (\(P_{faith}\)). Let’s look at the four “vital signs” it monitors:

1. Likelihood (The Confidence Check)

The most basic signal is the probability the model assigns to its own tokens. If the model is stammering (generating tokens with low probability), it usually indicates a “knowledge gap.” It doesn’t know what to say, so it’s likely making things up. SYNCHECK measures the minimum and average token likelihoods in a sentence.

2. Uncertainty (The Confusion Check)

Even if a model assigns a high probability to a token, it might still be uncertain compared to other options. SYNCHECK measures entropy (how “spread out” the probability distribution is). High entropy means the model is confused.

Crucially, the researchers also use a metric called Local Intrinsic Dimension (LID) on the model’s internal layers. This complex-sounding metric helps quantify how “familiar” the current generation state is to the model. High LID often correlates with the model struggling to mix the retrieved context with its own memory.

3. Context Influence (The Attention Check)

This is arguably the most interesting signal. The system asks: Did the retrieved document actually change what the model wrote?

To calculate this, SYNCHECK runs a quick comparison using Kullback-Leibler (KL) Divergence. It compares two probability distributions:

What the model predicts with the context.
What the model predicts without the context.

If these two distributions are nearly identical, the KL divergence is low. This means the model ignored the context and relied entirely on its pre-trained memory—a huge red flag for unfaithfulness.

4. Semantic Alignment (The Meaning Check)

Finally, SYNCHECK performs a lightweight “entailment” check. It uses a smaller, efficient model to verify if the generated sentence is logically supported by the retrieved context. This catches cases where the model sounds confident and uses the context, but misinterprets the meaning (e.g., saying “hugging increases mortality” when the text says “hugging reduces mortality”).

The Aggregator

These four signals (Likelihood, Uncertainty, Influence, Alignment) are fed into a lightweight aggregator. Surprisingly, the researchers found that a simple Logistic Regression model or a small MLP (Multi-Layer Perceptron) works incredibly well to combine these signals into a single “Faithfulness Score.”

For a detailed look at the specific features used, see the table below:

Table 4: A list of features monitored by SYNCHECK.

Performance: Does SYNCHECK Work?

To test this, the authors compiled a comprehensive benchmark covering Question Answering, Summarization, Data-to-Text, and Biography generation. They compared SYNCHECK against existing methods like Self-RAG (which uses critique tokens) and FLARE (which uses likelihood).

The results were stark.

Table 1: AUROC results of context faithfulness tracking methods.

As Table 1 shows, traditional methods (FLARE, CriticTok) struggle, hovering around 0.6 AUROC (where 0.5 is random guessing). SYNCHECK consistently scores above 0.80, and often near 0.85-0.90. This proves that relying on just one signal (like probability) isn’t enough; you need the full picture of decoding dynamics to catch unfaithfulness.

FOD: Fixing the Problem in Real-Time

Detecting an error is useful, but fixing it is better. The authors leveraged SYNCHECK to build Faithfulness-Oriented Decoding (FOD).

Traditional decoding (Greedy Search) just picks the most likely next word. FOD, illustrated in Figure 1(b), adds a safety mechanism:

Generate: The model generates a sentence.
Monitor: SYNCHECK calculates the faithfulness score.
Backtrack: If the score drops below a threshold (\(\tau_1\)), the system hits the brakes. It discards the unfaithful sentence.
Beam Search: The model goes back to the previous faithful sentence and initiates a “Beam Search.” It tries multiple different paths forward, pruning any branches that look unfaithful, and keeps the best one.

This approach balances faithfulness (being true to the text) with informativeness (actually answering the question).

Results: FOD vs. The World

The researchers compared FOD against Greedy Search, Abstention (refusing to answer if unsure), and Context-Aware Decoding (CAD).

Figure 2: Faithfulness score on Llama 2 7B chat with output truncated to the first L sentences.

Figure 2 visualizes the performance. The x-axis represents the length of the generated text (number of sentences).

Blue Triangle (Greedy): As the text gets longer, faithfulness often drops or stays mediocre.
Green Circle (FOD): This line is consistently higher. FOD maintains high faithfulness even as the generation gets longer, particularly in difficult tasks like Data-to-Text.

FOD outperformed standard Greedy Search by 12% and Context-Aware Decoding by 19% across six datasets.

Deep Dive: What Matters Most?

You might wonder, “Do we really need all four signals?” The researchers performed an ablation study to find out.

Figure 3: Feature ablation study with SYNCHECK MLP.

Figure 3 shows the impact of removing features. While Semantic Alignment (green bar) is the strongest individual contributor, removing Context Influence (pinkish-orange) or Uncertainty (orange) also hurts performance. The best results (blue bar) come from combining all signals. This confirms that unfaithfulness is a multifaceted problem; sometimes it’s due to confusion (Uncertainty), and sometimes it’s due to ignoring the docs (Context Influence).

Can it Generalize?

One of the most promising findings is the transferability of SYNCHECK. Do you need to train a specific detector for every single task?

Figure 4: Performance of SYNCHECK on different train-test task pairs.

Figure 4 is a heatmap showing cross-task performance. The y-axis is the training task, and the x-axis is the testing task. The abundance of red/orange blocks indicates that a SYNCHECK model trained on one task (like QA) performs surprisingly well when testing on a completely different task (like Summarization). This suggests that the “signatures” of hallucinations—high entropy, low context influence—are universal across different types of writing.

Conclusion

The “black box” nature of LLMs has always made them difficult to trust in high-stakes environments. Synchronous Faithfulness Monitoring offers a peek inside that box. By measuring not just what the model says, but how it behaves while saying it (likelihood, entropy, context usage), we can build guardrails that actually work.

SYNCHECK and FOD demonstrate that we don’t have to choose between a creative model and a faithful one. With the right monitoring, we can nudge models to stick to the facts, making RAG systems reliable enough for the real world.

Key Takeaways:

Don’t trust likelihood alone: Just because a model is confident doesn’t mean it’s faithful.
Context Influence is key: Measuring if the model is ignoring the retrieved text (via KL divergence) is a powerful detector of hallucinations.
Fix it live: Post-hoc editing is too late. Monitoring and backtracking during generation (FOD) yields the best results.

This post explores the research paper “Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented Generation” by Wu et al. (2024).

The Problem: Faithfulness vs. Factuality#

SYNCHECK: Checking the Model’s Pulse#

1. Likelihood (The Confidence Check)#

2. Uncertainty (The Confusion Check)#

3. Context Influence (The Attention Check)#

4. Semantic Alignment (The Meaning Check)#

The Aggregator#

Performance: Does SYNCHECK Work?#

FOD: Fixing the Problem in Real-Time#

Results: FOD vs. The World#

Deep Dive: What Matters Most?#

Can it Generalize?#

Conclusion#