Can Fake Data Fix Real Confidence? Improving Model Calibration with LLMs

In the fast-moving world of Artificial Intelligence, we often obsess over a single metric: accuracy. We want to know if the model got the answer right. But in high-stakes environments—like healthcare diagnosis, legal analysis, or autonomous driving—being “right” isn’t enough. We also need to know how confident the model is in its decision.

Imagine a doctor who is wrong 10% of the time but insists they are 100% sure of every diagnosis. That doctor is dangerous. Similarly, a machine learning model that predicts incorrect outcomes with high confidence is “miscalibrated.”

Standard methods to fix this often require a trade-off: you can make the model’s confidence more realistic, but you might hurt its accuracy or ability to generalize to new data. However, researchers from Arizona State University have proposed a novel solution in their paper, “Fill In The Gaps”: Model Calibration and Generalization with Synthetic Data. Their approach uses Large Language Models (LLMs) to generate synthetic data that specifically targets the model’s “blind spots,” improving both calibration and accuracy simultaneously.

In this post, we will break down their methodology, the mathematical theory behind it, and how they use LLMs to “fill the gaps” in model reliability.

The Problem: High Accuracy, Low Reliability

Modern deep learning models, particularly in Natural Language Processing (NLP), are notorious for being overconfident. As models become deeper and wider (more layers and parameters), they tend to become more miscalibrated.

To understand this, we need to look at Expected Calibration Error (ECE).

What is ECE?

Calibration measures how well predicted probabilities align with actual outcomes. If a model predicts an event with 70% confidence, that event should happen roughly 70% of the time.

To calculate ECE, researchers group predictions into “bins” based on confidence (e.g., all predictions with 0.0-0.1 confidence go in Bin 1, 0.1-0.2 in Bin 2, etc.). For each bin, they calculate the Accuracy (how many were actually correct) and the average Confidence.

Equations defining Accuracy and Confidence for a specific bin.

The ECE is essentially the weighted average of the difference between accuracy and confidence across all bins. A perfectly calibrated model has an ECE of 0.

The formula for Expected Calibration Error (ECE).

The Reliability Diagram

The best way to visualize this is a Reliability Diagram. In this chart, the x-axis represents confidence, and the y-axis represents accuracy. A perfectly calibrated model follows the diagonal line.

Under the line: The model is overconfident. (e.g., Confidence is 80%, but Accuracy is only 60%).
Above the line: The model is underconfident. (e.g., Confidence is 40%, but Accuracy is 60%).

The researchers noticed that existing methods, like temperature scaling or random data augmentation, often fail to address specific miscalibrated regions without harming the model’s overall decision-making power.

The Theory: Why Synthetic Data?

Before diving into the “how,” it is helpful to understand the “why.” The researchers grounded their method in the Probably Approximately Correct (PAC) learning framework.

PAC learning helps determine the bounds on learning a model’s parameters. The researchers derived a specific bound for ECE. Without getting too bogged down in the calculus, the core insight is captured in this inequality:

Mathematical derivation showing the bound for Expected Calibration Error.

This equation suggests a few critical things:

Sample Size Matters: Increasing the training data (\(n\)) tightens the bound, reducing the error.
The “Gap” Matters: The term \(|\text{Conf}(X) - \text{Conf}(X^*)|\) represents the difference between the model’s predicted confidence and the true underlying probabilities.

The researchers realized that if they could reduce this specific difference—filling the gap between predicted and true confidence—they could lower the calibration error. Since collecting more real data is often expensive or impossible, synthetic data becomes the logical tool to increase sample size (\(n\)) and target that specific gap.

The Methodology: Fill in the Gaps

The proposed framework is an iterative process involving a downstream model (like a BERT classifier) and a generative LLM (like Llama 2).

The framework of the proposed method. Real data trains a model, poorly calibrated bins are identified, and LLMs generate synthetic data to retrain the model.

Step 1: Identify the “Bad” Bins

First, the model is trained on real data. The researchers then analyze the validation set using a reliability diagram to find bins where the gap between accuracy and confidence is significant (e.g., greater than 0.03).

Step 2: Strategic Data Generation

This is the clever part. They don’t just generate random data. They generate data specifically designed to shift the decision boundary for those problematic bins.

The strategy depends on the type of error:

Low Probability & Overconfidence: The model predicts a class with low probability (e.g., 0.3) but is still too confident compared to reality. The solution is to generate synthetic data that pushes the prediction away from the decision boundary.
Overconfidence vs. Underconfidence: Depending on where the bin lies relative to the perfect diagonal, the method adjusts the target probability for the synthetic data.

Visualizing the strategy: Identifying a target bin in the reliability diagram (a) and generating synthetic data relative to the decision boundary (b).

As shown in Figure 2 above, if a specific bin is identified (a), the system generates synthetic points (b) that are strategically placed to “pull” the model’s learned distribution toward better calibration.

Step 3: Prompting the LLM

To generate this data, the researchers use Llama 2. They employ a “two-stage, three-shot” learning approach.

Stage 1: Ask the LLM to generate a text that belongs to a specific class with a specific probability mix (e.g., “Generate a sentence that is 55% negative and 45% positive”).
Stage 2: Relabel the generated text to ensure it falls into the correct binary class for training.

Here is an example of what that prompt looks like:

Table showing an example of the prompt input and the synthetic text output generated by the LLM.

By explicitly asking the LLM to mimic the classifier’s uncertainty (e.g., “act as a classifier… generate an utterance that belongs 55% to negative”), they create data points that lie on the “fuzzy” edges of the decision boundary, exactly where the model is struggling.

A Toy Example: Visualizing the Fix

To prove this works mathematically before applying it to complex NLP tasks, the researchers demonstrated it on a 1D logistic regression problem.

Four panels showing the iterative improvement of a logistic regression model using synthetic data.

Let’s walk through the figure above:

(a) Original Fit: The dashed red line (fitted model) doesn’t quite match the blue line (true logistic curve).
(b) Miscalibration: The reliability diagram shows gaps; the points aren’t on the diagonal.
(c) Adding Synthetic Data: They identified the bad bins (bins 2 and 4) and added synthetic yellow points. Notice how the new red dashed line shifts closer to the blue solid line.
(d) Result: The final reliability diagram is much tighter to the diagonal.

This simple experiment confirms that adding data points targeted at specific confidence intervals can mechanically “bend” the model back into calibration.

Experiments and Results

The team tested this method on four distinct NLP datasets, including sentiment analysis (Tweets, Reviews) and intent classification (Banking). They compared their method against standard calibration techniques like Isotonic Regression, Platt Scaling, and Temperature Scaling.

Key Findings

Accuracy Boost: Unlike temperature scaling (which only adjusts output probabilities but changes nothing about the model’s internal representations), this method actually retrains the model. This resulted in an accuracy increase of up to 34% in some cases.
ECE Reduction: The Expected Calibration Error dropped by an average of 33%.
Outperforming Baselines: The “Synthesis” method consistently achieved a better balance of high accuracy and low ECE compared to traditional methods. While techniques like Monte Carlo Dropout improved calibration, they often hurt accuracy. This method improved both.

The researchers also found that the number of bins used for calibration mattered. Using more bins (e.g., 15 or 20) generally allowed for more precise targeting of synthetic data, leading to better results.

Conclusion and Future Implications

The paper “Fill In The Gaps” offers a compelling argument for the use of synthetic data in machine learning. We often view synthetic data merely as a way to increase volume when data is scarce. However, this research highlights a more sophisticated use case: using synthetic data as a precision tool to tune the reliability of a model.

By leveraging the world knowledge embedded in LLMs like Llama 2, we can generate “difficult” or “ambiguous” training examples that force smaller, downstream models to refine their decision boundaries.

Key Takeaways:

Calibration is crucial: High accuracy is dangerous if the model doesn’t know when it might be wrong.
PAC Learning theory supports synthetic data: Mathematically, filling the gap between predicted and true confidence reduces overall error.
Targeted generation works: Random data augmentation is less effective than targeting specific “miscalibrated bins.”

As we move toward deploying AI in safety-critical sectors, techniques like this—which prioritize “knowing what you don’t know”—will be essential for building trustworthy systems.

The Problem: High Accuracy, Low Reliability#

What is ECE?#

The Reliability Diagram#

The Theory: Why Synthetic Data?#

The Methodology: Fill in the Gaps#

Step 1: Identify the “Bad” Bins#

Step 2: Strategic Data Generation#

Step 3: Prompting the LLM#

A Toy Example: Visualizing the Fix#

Experiments and Results#

Key Findings#

Conclusion and Future Implications#