Inside the Black Box: Using Gradients and Embeddings to Catch LLM Hallucinations

Large Language Models (LLMs) like GPT-4 and LLaMa have transformed how we interact with information. They write code, compose poetry, and answer complex questions. But they have a notorious flaw: hallucinations. We’ve all seen it—an LLM confidently asserts a “fact” that is completely made up, citing non-existent court cases or inventing historical events.

For students and researchers in NLP, solving hallucination is the “holy grail” of current reliability research. Most existing solutions treat the model as a “black box,” simply asking it, “Are you sure about that?” via prompting. But what if we could look under the hood?

In this post, we are deep-diving into a fascinating paper titled “Embedding and Gradient Say Wrong: A White-Box Method for Hallucination Detection.” The researchers propose a method called EGH (Embedding and Gradient-based Hallucination detection). Instead of trusting the model’s text output, they analyze the model’s internal states—specifically its embeddings and gradients—to mathematically detect when it is making things up.

By the end of this article, you will understand how comparing a model’s “conditional” and “unconditional” behavior can reveal the truth, and how Taylor Expansions—yes, that calculus concept you learned in undergrad—are the key to unlocking this detection method.


1. The Core Intuition: Ignoring the Question

To understand the solution, we must first understand the problem from a probabilistic perspective.

When you ask an LLM a question (\(Q\)), it generates an answer (\(A\)). Ideally, the probability of that answer should depend heavily on the question. We denote this as \(P(A|Q)\)—the probability of \(A\) given \(Q\).

However, when a model hallucinates, something interesting happens. It stops relying on the source text or the specific question and starts relying on its internal priors (what it memorized during training). In a sense, the generated answer \(A\) becomes loosely coupled, or even unrelated, to \(Q\).

The researchers pose a clever hypothesis: We can measure hallucination by calculating the distance between the model’s behavior with the question and its behavior without the question.

The Two Modes

To test this, the paper proposes feeding the LLM two different inputs:

  1. Conditional Query: The standard input where we feed the Question (\(Q\)) and the Answer (\(A\)).
  2. Unconditional Query: We feed only the Answer (\(A\)), replacing the Question tokens with zero-padding to keep the dimensions the same.

Figure 1: Two query forms of our method. Conditional query jointly inputs the question-answer text, while unconditional query only inputs the answer text. We use zero padding in the unconditional query to align the dimension.

As shown in Figure 1 above, the top path shows the standard flow: “Who is LeBron?” \(\rightarrow\) “A basketball player.” The bottom path effectively blinds the model to the question.

If the model is answering truthfully based on the context, the internal state of the model (its embeddings and probabilities) should look very different between these two scenarios. However, if the model is just making things up based on internal memory (hallucinating), the presence or absence of the specific question might matter less than we think, or in a different way. The discrepancy between these two states is what we need to measure.


2. The Math: From Probability to Gradients

How do we quantify this discrepancy? The authors define a distance metric, \(D\).

Let \(P(A|Q)\) be the probability distribution of the answer given the question, and \(P(A|\mathbf{0})\) be the probability of the answer given the zero-padded (unconditional) input. The distance is defined as:

Equation defining the difference between conditional and unconditional probability distributions.

Here, Difference could be a standard metric like KL Divergence or Cross-Entropy.

Why Simple Metrics Aren’t Enough

You might be thinking: “Why don’t we just calculate the KL Divergence between the two probabilities and be done with it? If the divergence is high, it’s hallucinating.”

The researchers tried this. While there is a statistical difference, it turns out that collapsing all the model’s complexity into a single number (a scalar like KL Divergence) throws away too much information. It’s too imprecise to serve as a reliable detector.

Enter the Taylor Expansion

This is where the paper introduces its main innovation. Instead of just looking at the output probability scalar, they analyze the factors contributing to that difference using a Taylor Expansion.

They treat the difference \(D\) as a function and expand it around the unconditional input point \([\mathbf{0}, A]\). The first-order Taylor expansion gives us this equation:

Taylor Expansion Equation of the divergence D.

Don’t let the notation scare you. Let’s break down the three parts of this equation:

  1. \(D([\mathbf{0}, A])\): This is a constant term (the baseline divergence).
  2. \(R_1\): These are higher-order terms (second derivatives, etc.). The authors argue these are computationally too expensive and likely unnecessary, so they are ignored.
  3. The Middle Term: \([\nabla D([\mathbf{0}, A])]^T ([Q, A] - [\mathbf{0}, A])\).

This middle term is the gold mine. It contains two distinct components that the authors extract as features for their detector:

  • The Embedding Difference: \(([Q, A] - [\mathbf{0}, A])\) represents the change in the input representation. Since we can’t subtract raw tokens, we use the embeddings from the model’s hidden layers.
  • The Gradient: \(\nabla D([\mathbf{0}, A])\) represents how sensitive the divergence is to changes in the input.

This theoretical derivation proves that to truly capture the relationship between the conditional and unconditional outputs, we need to look at both the Embedding (\(E\)) and the Gradient (\(G\)).


3. The EGH Architecture

Based on the math above, the researchers built the EGH (Embedding and Gradient-based Hallucination detection) method. It is a “white-box” method because it requires access to the model’s internals—you can’t do this with just an API that returns text.

Let’s walk through the architecture using the schematic below.

Figure 2: The algorithm schematic of the EGH.

Step 1: Extracting Feature \(E\) (Embeddings) The model processes both the Question-Answer pair and the Zero-Answer pair. The system extracts the final hidden state embeddings from the LLM for both passes.

  • \(E(A|Q)\): Embedding with the question.
  • \(E(A|\mathbf{0})\): Embedding without the question.

The feature \(E\) is simply the vector difference between them:

\[E = E(A|Q) - E(A|\mathbf{0})\]

Step 2: Extracting Feature \(G\) (Gradients) This step is slightly more complex.

  1. The system calculates the standard KL Divergence between the output probability distributions of the two passes. Equation for KL Divergence summation over tokens.
  2. It then performs backpropagation from this KL Divergence loss value back to the embedding layer of the unconditional input.
  3. The resulting gradient vector is our feature \(G\). \[G = \nabla D([\mathbf{0}, A])\] Step 3: The Hallucination Detector Now we have two rich vectors, \(E\) and \(G\), which mathematically represent the first-order approximation of the hallucination behavior.

These vectors are fused using a weighted sum. The researchers introduced a hyperparameter \(\lambda\) to control the importance of embeddings versus gradients:

Equation for the final hallucination label prediction using a trainable function f.

Finally, a simple Multi-Layer Perceptron (MLP)—a small neural network—takes this fused vector and classifies it as either “Hallucination” (1) or “Not Hallucination” (0).


4. Why Not Just Use Probabilities?

Earlier, we claimed that simple metrics like KL Divergence or Cross-Entropy weren’t enough. The authors provide empirical evidence for this.

They visualized the distribution of KL Divergence values for hallucinated (orange) vs. non-hallucinated (blue) samples.

Figure 3: Probability Distribution for KL Divergence and Cross Entropy.

Look closely at Figure 3. While there is a shift—hallucinated samples tend to have slightly lower divergence (meaning the model didn’t change its mind much whether the question was there or not)—there is massive overlap.

If you tried to draw a vertical line (a threshold) to separate blue from orange, you would make a lot of mistakes. This confirms that while the signal exists in the probabilities, it is too weak on its own. The EGH method works better because it unpacks the vectors that create these probabilities, providing a much higher-dimensional view of the problem.

To drive this point home, the authors trained a simple Logistic Regression model using just the scalar KL and Cross-Entropy values.

Table 6: Comparison of Logistic Regression with simple metrics versus EGH.

As Table 6 shows, using simple metrics yields an accuracy of only ~67% on QA tasks. The EGH method? 97.19%. The difference is night and day.


5. Experimental Results

The researchers tested EGH on several benchmark datasets, including HaluEval, SelfCheckGPT, and HADES. They used LLaMa-2-7B and OPT-6.7B as the base models.

Performance on HaluEval

HaluEval contains specific tasks like QA, Dialogue, and Summarization.

Table 1: Results in HaluEval Benchmark comparing EGH to baselines.

In Table 1, look at the “QA” column.

  • The baseline (asking the model to check itself) gets 59-63% accuracy.
  • GPT-4 (using CoNLI) gets 86.20%.
  • EGH (using LLaMa2-7B) achieves 97.19%.

This is a stunning result. It suggests that a smaller, open-source model equipped with white-box detection can significantly outperform even GPT-4 prompting strategies in detecting hallucinations.

Generality: SelfCheckGPT

Does this method generalize? The authors trained their detector on one dataset (HaluEval) and tested it on a completely different one (SelfCheckGPT).

Table 2: Results on the SelfCheckGPT dataset showing generalization capabilities.

Table 2 shows that even when trained on a different dataset, EGH achieves an AUC of 87.23%, which is competitive with methods trained specifically on that data. This suggests the patterns of “embedding difference” and “gradient sensitivity” are universal indicators of hallucination, not just dataset-specific quirks.

Does the Model Architecture Matter?

The method isn’t limited to LLaMa. They also tested it on BERT, RoBERTa, and GPT-2 using the HADES dataset.

Table 3: Results on HADES dataset with BERT and RoBERTa.

Table 3 confirms that EGH consistently beats benchmarks across different architectures, proving it is a model-agnostic solution (as long as you have white-box access).


6. What Matters More: Embeddings or Gradients?

The method combines Embeddings (\(E\)) and Gradients (\(G\)) using a weight \(\lambda\).

\[ \text{Input} = \lambda E + (1-\lambda)G \]

If \(\lambda=1\), we use only embeddings. If \(\lambda=0\), we use only gradients.

The authors ran an ablation study to see which component was doing the heavy lifting.

Table 4: Ablation Study on Weights on dialogue task.

Table 4 reveals an interesting nuance.

  • Using only gradients (\(\lambda=0\)): 74.22% accuracy.
  • Using only embeddings (\(\lambda=1\)): 76.54% accuracy.
  • Using both (\(\lambda=0.8\)): 77.39% accuracy.

Both features contribute, but Embeddings seem slightly more discriminative. However, the combination yields the best results, validating the Taylor Expansion theory that both terms are relevant.


7. Conclusion: The Future of White-Box Detection

The “Embedding and Gradient Say Wrong” paper makes a compelling case for moving beyond black-box prompting. By mathematically modeling the difference between “answering a question” and “just talking,” the authors provided a rigorous method to catch hallucinations.

Key Takeaways

  1. Context Matters: Hallucination is fundamentally about the disconnect between the generated answer and the source input.
  2. Look Inside: Probabilities (output logits) are insufficient. You need the rich information stored in embeddings and gradients.
  3. Taylor Expansion Works: Calculus provides a solid theoretical foundation for feature extraction in deep learning interpretation.
  4. SOTA Performance: EGH achieves 97% accuracy on QA tasks, vastly outperforming prompt-based checks.

Limitations & Future

The main drawback of EGH is the computational cost. Calculating gradients (backpropagation) during inference is expensive and slower than a simple forward pass. It requires memory and compute power, making it harder to deploy in real-time, low-latency applications.

However, for offline validation, auditing LLMs, or high-stakes scenarios where accuracy is paramount, EGH offers a powerful new tool in the fight for trustworthy AI. It reminds us that sometimes, to find the truth, we have to stop listening to what the model says and start looking at what its neurons are doing.