Can We Trust AI? Teaching LLMs to Know When They Don't Know

Imagine you are using a Large Language Model (LLM) to assist with a medical diagnosis or a complex legal precedent. The model gives you an answer with 99% confidence. You trust it, act on it, and later find out it was completely wrong. This is the nightmare scenario for deploying AI in high-stakes environments.

We often evaluate LLMs based on accuracy—how often they get the right answer. But there is a second, equally important metric that often gets overlooked: Trustworthiness. A trustworthy model isn’t just one that is right; it’s one that knows when it might be wrong. Its confidence level should match the actual likelihood of correctness.

In a recent paper titled “FIRST: Teach A Reliable Large Language Model Through Efficient Trustworthy Distillation,” researchers from HKUST and Wuhan University tackle this exact problem. They propose a novel framework called FIRST (eFfIcient tRustworthy disTillation) that creates smaller, efficient models that are not only accurate but also honest about their own uncertainty.

In this post, we will break down why modern fine-tuning methods create over-confident “liars,” and how the FIRST method fixes this using a clever mix of knowledge distillation and calibration.

Figure 1: A trustworthy model should be both accurate (left) and well-calibrated (right). A well-calibrated model should produce high probabilities for the correct answer and low probabilities for the wrong answer.

The Problem: Tuning-Induced Mis-calibration

To understand the solution, we first have to understand the flaw in how we currently train specific models.

Most usable LLMs today start as massive pre-trained base models. To make them useful for specific tasks (like following instructions or answering questions), we perform Fine-Tuning. This involves training the model on a dataset of questions and correct answers.

While fine-tuning is fantastic for boosting accuracy, the researchers identified a critical side effect: Tuning-Induced Mis-calibration.

What is Calibration?

Calibration is the relationship between a model’s predicted confidence and its actual accuracy.

Perfect Calibration: If a model makes 100 predictions, each with 80% confidence, exactly 80 of them should be correct.
Over-confidence: The model predicts with 90% confidence but is only right 60% of the time.
Under-confidence: The model predicts with 40% confidence but is actually right 80% of the time.

The researchers discovered that standard fine-tuning pushes models toward over-confidence. During training, the model is punished unless it assigns a probability of 1.0 (100%) to the correct token and 0.0 to everything else. This forces the model to become “arrogant.”

Figure 3: “Tuning-induced Mis-calibration” : Position-wise prediction probabilities with corresponding actual accuracy.

As shown in Figure 3 above, look at chart (b) for a fine-tuned small model. The blue bar (accuracy) is high for the top choice, but the model is massively over-confident (the green area). It thinks it is right far more often than it actually is. This makes the model unreliable for decision-making.

The Solution: Distillation

If fine-tuning breaks calibration, how do we fix it? The answer lies in Knowledge Distillation.

In distillation, instead of training a small “student” model on hard answers (Yes/No), we train it to mimic the behavior of a larger “teacher” model. The teacher provides a probability distribution—soft labels. For example, rather than saying the answer is 100% “Dog”, a teacher might say it’s 85% “Dog”, 10% “Cat”, and 5% “Wolf”.

This nuance helps the student learn relationships between concepts. However, there are two major hurdles to standard distillation:

Inefficiency: Storing and computing the full probability distribution for a teacher’s vocabulary (which can be 50,000+ tokens) is computationally incredibly expensive.
Bad Teachers: If the teacher model was itself fine-tuned, it is likely mis-calibrated. If the student mimics a mis-calibrated teacher, the student becomes mis-calibrated too.

The FIRST method solves both of these issues simultaneously.

FIRST: Efficient Trustworthy Distillation

The FIRST framework is built on two key insights: Concentrated Knowledge (to solve efficiency) and Trustworthy Maximization (to solve calibration).

Insight 1: Concentrated Knowledge (Efficiency)

Do we really need the teacher’s opinion on all 50,000 words in the dictionary to teach the student? The researchers found that the answer is a definitive “no.”

In LLMs, the probability distribution is highly skewed. The vast majority of the “knowledge” regarding a specific prediction is contained in the top few tokens.

Figure 2: The blue line with range shows the averaged accumulated probability coverage for each token entry, from Top-1 to Top-100. “Concentrated Knowledge” : The red point represents accumulated probability for Top-5 tokens already exceed 95%.

As illustrated in Figure 2, the Top-5 tokens alone (the red dot) account for over 95% of the accumulated probability mass. The remaining tens of thousands of tokens hold negligible information (near-zero probability).

By focusing only on the Top-5 tokens, the FIRST method drastically reduces storage and computational overhead. For a standard dataset, storing the full distribution might take 120 TB, whereas storing just the Top-5 takes only 1.2 GB. This makes high-end distillation accessible without massive infrastructure.

Insight 2: Trustworthy Maximization (Calibration)

Now that we’ve selected the Top-5 tokens, we have to address the second problem: the teacher might be “hallucinating” confidence. If the teacher says the Top-1 token is 99% probable when it should be 80%, we don’t want the student to learn that bad habit.

This is where the Trustworthy Maximization step comes in. Before passing the knowledge to the student, the researchers apply a transformation to the teacher’s probabilities to “re-calibrate” them.

The researchers compared two ways to do this:

Label Smoothing: Simply subtracting a fixed value from the top prediction and adding it to others.
Temperature Scaling: A more dynamic approach that “softens” the distribution based on a global parameter.

They found that Temperature Scaling was superior. It looks like this:

Equation for Temperature Scaling

Here, \(P_T(i)\) is the probability of token \(i\), and \(c\) is the temperature parameter.

If \(c > 1\), the distribution flattens (confidence decreases).
If \(c < 1\), the distribution sharpens (confidence increases).

The team runs a grid search on a validation set to find the optimal temperature \(c\) that minimizes calibration error. This effectively “fixes” the teacher’s over-confidence before the student ever sees the data.

The Complete Pipeline

Putting it all together, the FIRST pipeline looks like this:

Figure 4: The overall Efficient Trustworthy Distillation Pipeline.

Fine-tune the Teacher: Start with a large model.
Generate Top-5: Extract only the top 5 probabilities (Concentrated Knowledge).
Optimize Temperature: Find the best temperature \(c\) to minimize error on a validation set.
Knowledge Matching: Train the student model to minimize the difference (KL Divergence) between its predictions and the re-calibrated teacher predictions.

The loss function used to train the student is the Kullback–Leibler divergence:

Loss Function Equation

Experimental Results: Does it Work?

The researchers tested FIRST against standard Fine-Tuning and Direct Distillation (without re-calibration) across several datasets (CommonsenseQA, BoolQ, Alpaca).

To evaluate success, they used two primary metrics. First, the Expected Calibration Error (ECE). This measures the average gap between confidence and accuracy. Lower is better.

Equation for ECE

Second, they introduced a composite metric called the Trust Score, which combines accuracy and calibration.

Equation for Trust Score

Performance Comparison

The results, summarized in Table 1, are striking.

Table 1: Smaller models obtained by our method FIRST consistently achieves high accuracy Acc across various scenarios while maintaining a low expected calibration error ECE.

Key takeaways from the data:

Fine-tuning is unreliable: Look at the “Fine-tune 7B” row. While accuracy is decent, the ECE (error) is high, leading to a lower Trust Score.
FIRST is superior: The row “FIRST 7B w/ TS” (Temperature Scaling) consistently achieves the lowest ECE and the highest Trust Score.
Out-of-Domain Generalization: The right side of the table shows tests on datasets the model wasn’t trained on. Fine-tuned models fall apart here (ECE spikes to 21.9%), but the FIRST model maintains low error (7.1%), proving it generalizes the concept of uncertainty better.

Visualizing Reliability

Numbers in a table are one thing, but Reliability Diagrams paint a clearer picture. In these charts, a perfect model follows the diagonal dotted line. Bars below the line indicate over-confidence.

Figure 5: Reliability diagrams based on Llama-1 reveal the mis-calibration of various models on the CSQA dataset.

Fine-tune 7B (Second chart): Massive green bars indicate severe over-confidence. The model is almost always sure it’s right, even when it’s wrong.
FIRST 7B (Far right): The bars hug the diagonal line tightly. The green (over-confidence) and red (under-confidence) areas are minimal. This model is essentially saying, “I’m 60% sure,” and getting it right 60% of the time.

Why Temperature Scaling?

Why did the researchers choose Temperature Scaling over simple Label Smoothing?

Label smoothing (equation below) is rigid. It subtracts a fixed \(\delta\) regardless of the context.

Equation for Label Smoothing

Temperature scaling, however, preserves the relative ranking of tokens and adjusts the shape of the distribution. The researchers optimized the temperature coefficient \(c\) on a validation set. As shown in Figure 6, finding that “sweet spot” (around 0.3 in this specific test) drastically reduces the calibration error on the test set.

Figure 6: Left shows the comparison of different smoothing coefficients on the validation set, while the right part demonstrates its corresponding calibration effect on the test set.

Conclusion

The FIRST framework represents a significant step forward in making Large Language Models safe for real-world adoption. By acknowledging that accuracy is not enough, the researchers have provided a blueprint for creating models that are self-aware of their limitations.

The method is elegant in its simplicity:

Don’t use all the data (Top-5 is enough).
Don’t trust the teacher blindly (Re-calibrate with temperature scaling).

For students and practitioners, this paper highlights a crucial lesson: standard fine-tuning metrics can be deceptive. A model that is 90% accurate but 100% confident is a liability. A model that is 90% accurate and 90% confident is a tool you can actually use. Through methods like FIRST, we can build AI that earns our trust—not by being perfect, but by being honest.

The Problem: Tuning-Induced Mis-calibration#

What is Calibration?#

The Solution: Distillation#

FIRST: Efficient Trustworthy Distillation#

Insight 1: Concentrated Knowledge (Efficiency)#

Insight 2: Trustworthy Maximization (Calibration)#

The Complete Pipeline#

Experimental Results: Does it Work?#

Performance Comparison#

Visualizing Reliability#

Why Temperature Scaling?#

Conclusion#