Large Language Models (LLMs) like LLaMA and Qwen have revolutionized how we interact with information. They draft emails, write code, and summarize complex texts with eerie proficiency. However, these models operate as massive “black boxes.” When an LLM generates a specific fact—or worse, a hallucination—it is notoriously difficult to pinpoint exactly which document in its massive training dataset taught it that specific piece of information.
This problem is not just academic curiosity. It is central to issues of data copyright, fairness, and safety. If a model generates hate speech or plagiarizes a protected work, developers need to know the source.
This process is called Training Data Attribution (TDA). While methods exist to trace model outputs back to data, they often fail when applied to LLMs. A recent research paper, “Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration,” proposes a novel solution called Debias and Denoise Attribution (DDA). This method addresses a fundamental flaw in how we currently calculate attribution: the assumption that our models are perfectly trained.
In this post, we will explore why current attribution methods struggle with LLMs, the mathematics behind “fitting errors,” and how the DDA method fixes these issues to achieve state-of-the-art results.
The Foundation: Influence Functions
To understand the new solution, we first need to understand the standard tool used for this job: Influence Functions.
In machine learning, we typically train models using a principle called Empirical Risk Minimization (ERM). The goal is to find a set of parameters \(\theta\) (the weights of the neural network) that minimizes the loss (error) across the entire training dataset.

Here, \(\ell(z_i, \theta)\) is the loss for a single training sample \(z_i\), and \(\hat{\theta}\) represents the optimal parameters found after training.
The “What If” Scenario
Influence functions answer a “counterfactual” question: How would the model’s parameters change if we slightly increased the importance (weight) of just one training sample \(z_t\)?
If upweighting a specific training document significantly changes the model’s parameters in a way that helps predict a specific test answer, we can say that document was “influential.” Mathematically, we imagine adding a small weight \(\epsilon\) to a training sample and finding the new optimal parameters:

By applying a Taylor expansion (a method to approximate functions), researchers derived the classic Influence Function formula. It essentially calculates the gradient (direction of steepest ascent) of the loss and scales it by the inverse of the Hessian matrix (the curvature of the loss landscape).

However, calculating the Hessian matrix for an LLM with billions of parameters is computationally impossible. Therefore, in practice, researchers use a first-order approximation. They simplify the influence score (\(IS\)) to a dot product between the gradient of the test sample \(z_e\) and the gradient of the training sample \(z_t\).

This equation essentially measures the similarity between what the model learned from the training sample and what it needs to predict the test sample.
The Problem: The Myth of Perfect Fitting
The simplified equation above relies on a massive assumption: that the model has reached the absolute minimum of the loss function (perfect convergence).
In reality, LLMs never achieve this.
Due to the sheer size of the data and constraints on computational resources, LLM training is often stopped early, or the optimization gets stuck in local minima. The model has “fitting errors”—it hasn’t perfectly fit the training data.
When the model hasn’t converged, the standard influence function breaks down. The researchers of this paper mathematically proved that when fitting errors exist, the influence score is biased by the starting state of the model (the base model).
Instead of the clean equation above, the actual influence score looks more like this:

Notice the extra term \(W_{\epsilon} IF_{\theta_0}\). This represents the bias introduced by the base model (\(\theta_0\)). Because the model hasn’t learned perfectly, the “knowledge” it had before fine-tuning interferes with our ability to attribute learning to specific new data points. Furthermore, the first term is affected by noise from the fluctuations of the training process.
The Solution: Debias and Denoise Attribution (DDA)
The researchers propose a two-pronged strategy to fix these inaccuracies: Debias and Denoise.
Strategy 1: Debias
Since the base model introduces a bias that skews attribution, the logical step is to subtract it. However, calculating the exact bias coefficient matrix is complex. The researchers introduce a hyperparameter \(\beta\) (beta) to approximate this correction.
They modify the influence score by subtracting the influence of the base model (\(\theta_0\)) from the influence of the trained model (\(\theta'\)).

By tuning \(\beta\), we can effectively cancel out the interference from the pre-trained knowledge, isolating the contribution of the specific fine-tuning data.
Strategy 2: Denoise
During training, a model’s weights fluctuate from epoch to epoch. Depending on exactly when you stop training (e.g., epoch 3 vs. epoch 5), the standard influence score might vary wildly due to overfitting or underfitting at that specific moment.
To smooth this out, the Denoise strategy averages the gradients across multiple checkpoints (\(N\)) during the training process.

This creates a more stable and robust representation of how the model evolved, rather than relying on a single snapshot in time.
Putting It Together
The final DDA method combines both strategies. It calculates the average influence over the training trajectory and then subtracts the base model’s influence.

Furthermore, to specifically target hallucinations, the authors use a “Contrastive” approach. They calculate the influence score for the hallucinated output (negative sample) and subtract the influence score for a correct output (positive sample). This highlights the training data responsible for the error specifically.

Experimental Setup: The Hallucination Detective
How do you prove that an attribution method works? The authors used a clever “Hallucination Tracing” setup.
- Dataset: They used XSum (a summarization dataset).
- Injected Hallucinations: They deliberately poisoned the training data. For example, they took summaries containing “England” and swapped the word to “China” in a small percentage of documents.
- The Test: After training an LLM on this poisoned data, they fed it prompts that should result in “England.” If the model hallucinated “China,” the attribution method was tasked with finding the specific poisoned documents that caused this error.
This setup provides a ground truth. We know exactly which files caused the hallucination. If DDA points to those files, it works.
Results and Analysis
The results were compelling. The authors compared DDA against several strong baselines, including:
- TRAK: A state-of-the-art approximation method.
- TracIN: A method that tracks gradient changes over time.
- BM25: A standard keyword similarity search (to prove that the model isn’t just matching words).
Superior Accuracy
As shown in Table 1, DDA (the rightmost column) dominates the competition.

Look at the AUC (Area Under Curve) scores. While methods like TracIN and TRAK hovered between 50% and 60% (barely better than random guessing in some cases), DDA consistently achieved AUC scores above 90% across different models (LLaMA2, Qwen2, Mistral). This indicates that DDA is exceptionally good at ranking the true culprit documents at the top of the list.
Scalability and Robustness
One concern with new methods is whether they only work on specific model sizes. The authors tested DDA on the Qwen2 model at three different scales: 0.5B, 1.5B, and 7B parameters.

Figure 1 shows that DDA maintains high performance (AUC around 90%) regardless of the model size. This suggests the method follows a “scaling law” and could likely be applied to even larger foundational models.
Why Both Strategies Matter
Is it just the debiasing? Or just the denoising? The authors performed an ablation study (removing one component at a time) to find out.

Table 2 reveals the answer.
- Full DDA: ~93.5% AUC.
- Without Denoise: Drops to ~84.8%.
- Without Debias: Crashes to ~67.9%.
This highlights that while smoothing the noise is helpful, removing the bias of the base model is critical. The base model’s prior knowledge acts as a massive confounder in attribution tasks; without correcting for it, we cannot accurately trace new learning.
The Role of Beta (\(\beta\))
Finally, the researchers analyzed the sensitivity of the hyperparameter \(\beta\), which controls how much base-model influence is subtracted.

As shown in Figure 2, the accuracy (AUC) improves rapidly as \(\beta\) increases from 0 to about 0.4, and then stabilizes. This is good news for practitioners: the method is stable and doesn’t require finding a “magic number” for \(\beta\) to work effectively.
Conclusion
The “black box” nature of Large Language Models is one of the biggest hurdles to their safe and regulated deployment. When a model makes a mistake or reproduces copyrighted content, we need reliable tools to trace that behavior back to the source.
This research demonstrates that we cannot simply copy-paste attribution methods from traditional deep learning to LLMs. The assumption of “perfect fitting” does not hold for these massive models. By acknowledging the fitting errors—specifically the bias from pre-training and the noise from the training trajectory—the DDA method provides a robust mathematical framework for accurate attribution.
With an AUC exceeding 90% in hallucination tracing tasks, DDA represents a significant step forward. It allows developers to peel back the layers of the neural network and see not just what the model knows, but exactly where it learned it.
](https://deep-paper.org/en/paper/2410.01285/images/cover.png)