Unmasking the Black Box: How to Accurately Trace LLM Knowledge Back to Source Data

Large Language Models (LLMs) like LLaMA and Qwen have revolutionized how we interact with information. They draft emails, write code, and summarize complex texts with eerie proficiency. However, these models operate as massive “black boxes.” When an LLM generates a specific fact—or worse, a hallucination—it is notoriously difficult to pinpoint exactly which document in its massive training dataset taught it that specific piece of information.

This problem is not just academic curiosity. It is central to issues of data copyright, fairness, and safety. If a model generates hate speech or plagiarizes a protected work, developers need to know the source.

This process is called Training Data Attribution (TDA). While methods exist to trace model outputs back to data, they often fail when applied to LLMs. A recent research paper, “Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration,” proposes a novel solution called Debias and Denoise Attribution (DDA). This method addresses a fundamental flaw in how we currently calculate attribution: the assumption that our models are perfectly trained.

In this post, we will explore why current attribution methods struggle with LLMs, the mathematics behind “fitting errors,” and how the DDA method fixes these issues to achieve state-of-the-art results.

The Foundation: Influence Functions

To understand the new solution, we first need to understand the standard tool used for this job: Influence Functions.

In machine learning, we typically train models using a principle called Empirical Risk Minimization (ERM). The goal is to find a set of parameters \(\theta\) (the weights of the neural network) that minimizes the loss (error) across the entire training dataset.

The formula for Empirical Risk Minimization.

Here, \(\ell(z_i, \theta)\) is the loss for a single training sample \(z_i\), and \(\hat{\theta}\) represents the optimal parameters found after training.

The “What If” Scenario

Influence functions answer a “counterfactual” question: How would the model’s parameters change if we slightly increased the importance (weight) of just one training sample \(z_t\)?

If upweighting a specific training document significantly changes the model’s parameters in a way that helps predict a specific test answer, we can say that document was “influential.” Mathematically, we imagine adding a small weight \(\epsilon\) to a training sample and finding the new optimal parameters:

The formula for parameters under perturbed weight.

By applying a Taylor expansion (a method to approximate functions), researchers derived the classic Influence Function formula. It essentially calculates the gradient (direction of steepest ascent) of the loss and scales it by the inverse of the Hessian matrix (the curvature of the loss landscape).

The classic Influence Function formula involving the Hessian inverse.

However, calculating the Hessian matrix for an LLM with billions of parameters is computationally impossible. Therefore, in practice, researchers use a first-order approximation. They simplify the influence score (\(IS\)) to a dot product between the gradient of the test sample \(z_e\) and the gradient of the training sample \(z_t\).

Simplified Influence Score using dot product of gradients.

This equation essentially measures the similarity between what the model learned from the training sample and what it needs to predict the test sample.

The Problem: The Myth of Perfect Fitting

The simplified equation above relies on a massive assumption: that the model has reached the absolute minimum of the loss function (perfect convergence).

In reality, LLMs never achieve this.

Due to the sheer size of the data and constraints on computational resources, LLM training is often stopped early, or the optimization gets stuck in local minima. The model has “fitting errors”—it hasn’t perfectly fit the training data.

When the model hasn’t converged, the standard influence function breaks down. The researchers of this paper mathematically proved that when fitting errors exist, the influence score is biased by the starting state of the model (the base model).

Instead of the clean equation above, the actual influence score looks more like this:

Equation showing the influence score with bias from the base model.

Notice the extra term \(W_{\epsilon} IF_{\theta_0}\). This represents the bias introduced by the base model (\(\theta_0\)). Because the model hasn’t learned perfectly, the “knowledge” it had before fine-tuning interferes with our ability to attribute learning to specific new data points. Furthermore, the first term is affected by noise from the fluctuations of the training process.

The Solution: Debias and Denoise Attribution (DDA)

The researchers propose a two-pronged strategy to fix these inaccuracies: Debias and Denoise.

Strategy 1: Debias

Since the base model introduces a bias that skews attribution, the logical step is to subtract it. However, calculating the exact bias coefficient matrix is complex. The researchers introduce a hyperparameter \(\beta\) (beta) to approximate this correction.

They modify the influence score by subtracting the influence of the base model (\(\theta_0\)) from the influence of the trained model (\(\theta'\)).

The Debias formula introducing the beta coefficient.

By tuning \(\beta\), we can effectively cancel out the interference from the pre-trained knowledge, isolating the contribution of the specific fine-tuning data.

Strategy 2: Denoise

During training, a model’s weights fluctuate from epoch to epoch. Depending on exactly when you stop training (e.g., epoch 3 vs. epoch 5), the standard influence score might vary wildly due to overfitting or underfitting at that specific moment.

To smooth this out, the Denoise strategy averages the gradients across multiple checkpoints (\(N\)) during the training process.

The Denoise formula averaging gradients across N epochs.

This creates a more stable and robust representation of how the model evolved, rather than relying on a single snapshot in time.

Putting It Together

The final DDA method combines both strategies. It calculates the average influence over the training trajectory and then subtracts the base model’s influence.

The combined Debias and Denoise formula.

Furthermore, to specifically target hallucinations, the authors use a “Contrastive” approach. They calculate the influence score for the hallucinated output (negative sample) and subtract the influence score for a correct output (positive sample). This highlights the training data responsible for the error specifically.

The final DDA formula using contrastive attribution.

Experimental Setup: The Hallucination Detective

How do you prove that an attribution method works? The authors used a clever “Hallucination Tracing” setup.

Dataset: They used XSum (a summarization dataset).
Injected Hallucinations: They deliberately poisoned the training data. For example, they took summaries containing “England” and swapped the word to “China” in a small percentage of documents.
The Test: After training an LLM on this poisoned data, they fed it prompts that should result in “England.” If the model hallucinated “China,” the attribution method was tasked with finding the specific poisoned documents that caused this error.

This setup provides a ground truth. We know exactly which files caused the hallucination. If DDA points to those files, it works.

Results and Analysis

The results were compelling. The authors compared DDA against several strong baselines, including:

TRAK: A state-of-the-art approximation method.
TracIN: A method that tracks gradient changes over time.
BM25: A standard keyword similarity search (to prove that the model isn’t just matching words).

Superior Accuracy

As shown in Table 1, DDA (the rightmost column) dominates the competition.

Table comparing DDA performance against baselines on LLaMA, Qwen, and Mistral models.

Look at the AUC (Area Under Curve) scores. While methods like TracIN and TRAK hovered between 50% and 60% (barely better than random guessing in some cases), DDA consistently achieved AUC scores above 90% across different models (LLaMA2, Qwen2, Mistral). This indicates that DDA is exceptionally good at ranking the true culprit documents at the top of the list.

Scalability and Robustness

One concern with new methods is whether they only work on specific model sizes. The authors tested DDA on the Qwen2 model at three different scales: 0.5B, 1.5B, and 7B parameters.

Bar chart showing consistent performance across Qwen model sizes.

Figure 1 shows that DDA maintains high performance (AUC around 90%) regardless of the model size. This suggests the method follows a “scaling law” and could likely be applied to even larger foundational models.

Why Both Strategies Matter

Is it just the debiasing? Or just the denoising? The authors performed an ablation study (removing one component at a time) to find out.

Table showing ablation study results.

Table 2 reveals the answer.

Full DDA: ~93.5% AUC.
Without Denoise: Drops to ~84.8%.
Without Debias: Crashes to ~67.9%.

This highlights that while smoothing the noise is helpful, removing the bias of the base model is critical. The base model’s prior knowledge acts as a massive confounder in attribution tasks; without correcting for it, we cannot accurately trace new learning.

The Role of Beta (\(\beta\))

Finally, the researchers analyzed the sensitivity of the hyperparameter \(\beta\), which controls how much base-model influence is subtracted.

Graph showing the impact of the debias coefficient beta on accuracy.

As shown in Figure 2, the accuracy (AUC) improves rapidly as \(\beta\) increases from 0 to about 0.4, and then stabilizes. This is good news for practitioners: the method is stable and doesn’t require finding a “magic number” for \(\beta\) to work effectively.

Conclusion

The “black box” nature of Large Language Models is one of the biggest hurdles to their safe and regulated deployment. When a model makes a mistake or reproduces copyrighted content, we need reliable tools to trace that behavior back to the source.

This research demonstrates that we cannot simply copy-paste attribution methods from traditional deep learning to LLMs. The assumption of “perfect fitting” does not hold for these massive models. By acknowledging the fitting errors—specifically the bias from pre-training and the noise from the training trajectory—the DDA method provides a robust mathematical framework for accurate attribution.

With an AUC exceeding 90% in hallucination tracing tasks, DDA represents a significant step forward. It allows developers to peel back the layers of the neural network and see not just what the model knows, but exactly where it learned it.

The Foundation: Influence Functions#

The “What If” Scenario#

The Problem: The Myth of Perfect Fitting#

The Solution: Debias and Denoise Attribution (DDA)#

Strategy 1: Debias#

Strategy 2: Denoise#

Putting It Together#

Experimental Setup: The Hallucination Detective#

Results and Analysis#

Superior Accuracy#

Scalability and Robustness#

Why Both Strategies Matter#

The Role of Beta (\(\beta\))#

Conclusion#