RAG's Reality Check: How Statistical Testing Can Stop Hallucinations

Large Language Models (LMs) have a reputation for being confident, articulate, and occasionally, completely wrong. This phenomenon, known as hallucination, is a significant barrier to deploying AI in safety-critical fields like healthcare or finance. To combat this, the industry has largely adopted Retrieval Augmented Generation (RAG).

The premise of RAG is simple: instead of relying solely on the LM’s internal memory, we give the model an open-book test. We retrieve relevant documents from a trusted external database and ask the model to answer based on that information. Theoretically, this anchors the model in reality.

But there is a catch. What happens if the user asks a question that isn’t covered by the external database? Or what if the database is outdated? In these scenarios, RAG systems often fail silently. They retrieve “irrelevant” documents because they are forced to retrieve something, and the LM proceeds to hallucinate an answer based on noise.

In this post, we are deep-diving into a research paper titled “Do You Know What You Are Talking About? Characterizing Query-Knowledge Relevance For Reliable Retrieval Augmented Generation.” The researchers propose a rigorous statistical framework to give RAG systems the ability to say “I don’t know.” By characterizing the relevance between a query and the knowledge base, they introduce methods to detect Out-of-Knowledge (OoK) queries in real-time and monitor system health over time.

The Core Problem: The Silent Failure of RAG

Imagine a medical RAG system trained on a database of cardiology textbooks. A user asks a question about a rare neurological disorder. The retriever, which calculates semantic similarity, looks through the cardiology books and finds the “closest” match—perhaps a paragraph about nerves in the heart. The generator then tries to construct an answer about the brain using heart-related context. The result is a plausible-sounding but factually incorrect diagnosis.

This happens because standard RAG systems lack a mechanism to assess Query-Knowledge Relevance. They assume that if a query is asked, the answer must be in the database.

The authors of this paper argue that we need to treat this as a statistical hypothesis testing problem. We need to distinguish between:

In-Knowledge (IK) Queries: Questions that can be answered using the corpus.
Out-of-Knowledge (OoK) Queries: Questions outside the scope of the corpus.

The Solution: A Statistical Framework

The researchers introduce a framework that operates in two distinct scenarios: Online Testing (checking individual queries as they come in) and Offline Testing (checking for shifts in user behavior over time).

Figure 1: Overview of the hypothesis-testing framework for assessing query-knowledge relevance in RAG.

As shown in Figure 1 above, the framework relies on embedding models (which convert text into vector numbers) to measure how well a query fits into the distribution of known, answerable questions.

Defining Relevance

Before we can test for relevance, we must define it mathematically. The authors propose a formal definition of Query-Knowledge Relevance (\(r\)).

Equation 1

In simple terms, this equation defines relevance as the improvement in the probability of generating the correct answer (\(a_{gt}\)) when the model is given the retrieved documents (\(D^r\)) compared to when it relies only on its internal knowledge.

If \(r(q|\mathcal{D}) > 0\), the corpus helps answer the question. The query is In-Knowledge.
If \(r(q|\mathcal{D}) \le 0\), the corpus does not help (or might even hurt). The query is Out-of-Knowledge.

Since we cannot calculate this probability directly for every new query (because we don’t know the ground truth answer), the paper proposes using Goodness-of-Fit (GoF) tests to estimate this empirically.

Core Method: The Online Testing Procedure

The online testing procedure is designed to protect the system in real-time. When a user submits a query, the system decides whether to attempt an answer or flag it as Out-of-Knowledge.

1. The Hypothesis Test

The authors frame this as a hypothesis test.

Null Hypothesis (\(\mathcal{H}_0\)): The new query \(q\) comes from the same distribution as the In-Knowledge (IK) queries.
Alternative Hypothesis (\(\mathcal{H}_1\)): The new query comes from a different distribution (it is OoK).

Equation 2

To accept or reject the null hypothesis, the system calculates a p-value. If the p-value is below a certain threshold (typically \(\alpha = 0.05\)), we reject the null hypothesis and flag the query as Out-of-Knowledge.

The calculation of the p-value relies on comparing the new query’s “score” against the scores of known valid queries. This is formalized in the empirical Cumulative Distribution Function (eCDF):

Equation 4

Here, \(t(q)\) is the test statistic (the score) for the new query, and \(Q_I\) represents a set of known In-Knowledge queries.

2. Test Statistics: How Do We Score a Query?

The “score” mentioned above is critical. What mathematical property best separates a valid cardiology question from an invalid neurology question in our hypothetical system? The authors explore several statistics derived from the embedding space:

Maximum Similarity Score (MSS): The similarity score of the single closest document. If the best match is still distinct, the query is likely OoK.
K-th Nearest Neighbor (KNN): The similarity score of the \(k\)-th closest document. This checks if there is a dense cluster of relevant documents, rather than just one lucky match.
Average KNN (AvgKNN): The average similarity of the top \(k\) documents.
Entropy: Measures the uncertainty of the retrieval distribution. High entropy suggests the retriever is “confused” and spreading its probability across many weakly related documents.
Energy: Adapted from out-of-distribution detection in computer vision. It uses the free energy function of the embedding space. This method often captures the density of the data more holistically than simple cosine similarity.

3. The Problem of “Known” Queries

To run a Goodness-of-Fit test, you need a reference distribution—a bucket of queries you know are valid (\(Q_I\)). But in a real-world application, you start with a pile of documents, not a pile of valid questions.

The authors propose a clever workaround: Synthetic Query Generation. They use an LLM (like GPT-3.5) to read chunks of the document corpus and generate questions that can be answered by that chunk. These synthetic questions serve as the proxy for the In-Knowledge distribution.

This allows the framework to be entirely self-supervised. You don’t need human-labeled data to set up the safety guardrails; you just need the documents themselves.

Core Method: The Offline Testing Procedure

The second part of the framework addresses a longer-term problem: Distribution Shift.

Over time, user interests change. If a medical database was built in 2019, it would be highly relevant for queries about “Flu” but statistically irrelevant for queries about “COVID-19,” even though both are medical topics.

The offline procedure takes a batch of recent user queries (\(Q_P\)) and compares their distribution to the historical In-Knowledge distribution (\(Q_I\)) using the Kolmogorov-Smirnov (KS) Test.

Equation 5

The KS test (\(t_{KS}\)) measures the maximum distance between the cumulative distribution functions of the two query sets. If this distance is statistically significant, it signals to the developers that the knowledge base is no longer aligned with what users are asking. It’s time to update the corpus.

Experiments and Results

The authors subjected their framework to a rigorous evaluation using eight Question-Answering (QA) datasets, specifically focusing on the biomedical domain (which is high-stakes and terminology-dense). They used datasets like MedQA-US (medical exams) and PubMedQA as “In-Knowledge” benchmarks, and general domain datasets like TruthfulQA as “Out-of-Knowledge” noise.

1. Which Test Statistic is Best?

The researchers compared the ability of different statistics (MSS, KNN, Entropy, Energy, etc.) to distinguish between valid and invalid queries.

Table 1: AUROC results of different test statistics. (b) PubMed Corpus

The tables above show the AUROC (Area Under the Receiver Operating Characteristic) scores. An AUROC of 1.0 represents perfect detection.

Performance is high: Across most datasets, the methods achieved AUROC scores above 0.90, and often close to 0.99 for “far” OoK queries (like asking a physics question to a medical bot).
No single winner: While Energy scores (derived from energy-based models) performed exceptionally well on the Textbooks corpus, MSS (Maximum Similarity Score) was slightly better for the PubMed corpus. This suggests that the choice of statistic should be tuned to the specific data domain.

2. Beating the Baselines

The paper compares the Goodness-of-Fit (GoF) approach against standard outlier detection algorithms like One-Class SVM and Local Outlier Factor (LOF).

Table 2: Comparison with outlier detection-based baselines on the Textbooks corpus.

The results in Table 2 are decisive. The GoF framework (specifically using Energy scores) significantly outperforms traditional outlier detection methods. The authors attribute this to sample efficiency; standard outlier detection struggles with the high dimensionality of text embeddings when sample sizes are small.

3. Can’t the LLM Just Tell Us?

A common counter-argument is: “Why not just ask GPT-4 if the retrieved documents are relevant?”

The authors tested this “LLM-based Relevance Score” approach. They prompted GPT-3.5 and GPT-4 to rate relevance on a scale of 0 to 1.

Table 3: Comparison with LM-based relevance score on the Textbooks corpus.

As shown in Table 3, the LLMs failed spectacularly. GPT-4 achieved an AUROC of only 0.2088 on PubMedQA—worse than random guessing. This highlights a critical weakness in current LLMs: they are often overconfident and struggle to strictly quantify the relevance of complex technical text without hallucinating connections.

4. The Power of Synthetic Queries

Recall the “workaround” of generating synthetic questions to train the detector. Does training on fake questions actually work for detecting real bad questions?

Figure 2: Illustration of critical values estimated using true in-knowledge and synthetic queries…

Figure 2 visualizes the distribution of test statistics. The blue histogram (Synthetic Queries) overlaps significantly with the orange/brown histogram (True In-Knowledge Queries). Because the distributions match, the critical threshold (the red/blue vertical lines) derived from synthetic data is very close to the one derived from real data. This confirms that developers can use synthetic queries to build reliable guardrails without needing a labeled log of user data.

5. Embedding Models Matter

Not all embedding models are created equal for this task. The researchers compared models like Contriever, MedCPT, and BGE.

Figure 3: Comparison of six different embedding models.

Figure 3 reveals a fascinating misalignment. Look at MedCPT (the third group). It achieves the highest accuracy on the actual QA task (the green bar is high), meaning it’s great at retrieving the right answer when the query is valid. However, its AUROC for detecting OoK queries (the orange bars) is the lowest.

This implies that a model optimized purely for retrieval performance might be “over-fitting” to the domain, making it less capable of differentiating between “slightly relevant” and “completely irrelevant.” It tries too hard to find a match.

6. Detecting Distribution Shift (Offline)

Finally, the offline experiments demonstrated the sensitivity of the KS test.

Figure 5: Offline testing results with the Textbooks corpus.

In Figure 5, the left chart shows p-values. As the ratio of In-Knowledge queries (x-axis) drops (meaning more OoK queries are introduced), the p-value crashes toward zero. The red dashed line represents the significance threshold. This confirms that the system can reliably sound the alarm when the user query stream starts drifting away from the knowledge base’s expertise.

Conclusion and Implications

This paper provides a necessary reality check for the booming RAG ecosystem. It moves us away from the naive assumption that “retrieval equals relevance.”

By establishing a statistical framework, the authors have provided a way to quantify uncertainty in RAG.

Safety: We can now mathematically identify when a system is operating outside its knowledge boundary and force it to abstain from answering.
Self-Supervision: The success of synthetic queries means this framework can be deployed on any document set—legal, technical, or financial—without expensive human labeling.
Maintenance: The offline testing framework gives engineers a “Check Engine” light, signaling exactly when the database needs an update.

As we continue to integrate Large Language Models into critical infrastructure, the ability to say “I don’t know” may turn out to be the most intelligent feature of all.

The Core Problem: The Silent Failure of RAG#

The Solution: A Statistical Framework#

Defining Relevance#

Core Method: The Online Testing Procedure#

1. The Hypothesis Test#

2. Test Statistics: How Do We Score a Query?#

3. The Problem of “Known” Queries#

Core Method: The Offline Testing Procedure#

Experiments and Results#

1. Which Test Statistic is Best?#

2. Beating the Baselines#

3. Can’t the LLM Just Tell Us?#

4. The Power of Synthetic Queries#

5. Embedding Models Matter#

6. Detecting Distribution Shift (Offline)#

Conclusion and Implications#