Inside the Mind of the Model: Improving LLM Truthfulness with ACTCAB and CODEC

Large Language Models (LLMs) are often compared to confident students who, when they don’t know an answer, prefer to make up a plausible-sounding lie rather than admit ignorance. This phenomenon, known as hallucination, remains one of the significant hurdles in deploying LLMs for high-stakes applications like healthcare, law, or finance.

The core of the problem isn’t just that models make mistakes; it’s that they are often miscalibrated. A perfectly calibrated model would have a confidence score that matches its accuracy—if it says “I am 80% sure,” it should be correct 80% of the time. Unfortunately, modern LLMs tend to be overconfident, assigning high probabilities even to complete fabrications.

In this deep dive, we will explore a significant step forward in solving this problem presented in the paper “Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding.” The researchers propose two novel techniques: ACTCAB, a method to accurately measure model confidence using internal “brain waves” (activations), and CODEC, a decoding strategy that uses that confidence to steer the model toward the truth.

The Calibration Conundrum

Before understanding the solution, we must understand why measuring an LLM’s confidence is so difficult. Currently, there are three main ways researchers try to gauge if an LLM is telling the truth:

Verbalization: Simply prompting the model, “Are you sure? Give me a score from 0 to 1.” (This is unreliable because the model can hallucinate its own confidence).
Self-Consistency: Asking the model the same question 10 times and seeing if the answers agree. (This works well but is computationally expensive and slow).
Logit-Based Methods: Looking at the probability scores of the output tokens. (This is fast, but superficial. A model might be statistically sure that a specific word follows the previous one, without being semantically sure that the sentence is factually true).

The researchers behind ACTCAB argue that the “truth” is often hidden deeper inside the model. Recent studies suggest that even when an LLM outputs a lie, its internal hidden states (activations) often contain information about the true fact.

ACTCAB: Activation-Based Confidence Calibration

The first contribution of this work is ACTCAB. Instead of looking at the output words (logits), ACTCAB looks at the internal activations of the model—specifically, the hidden states from the last layer of the LLM.

The hypothesis is that these activations capture a richer representation of knowledge than the final token probabilities. To harness this, the authors train a lightweight linear classifier (a simple neural network layer) that sits on top of the LLM. It takes the average activations of a generated response and outputs a single confidence score.

The Problem with Binary Labels

Training this classifier sounds straightforward: give it a response, label it “Correct” (1) or “Incorrect” (0), and train it to predict the difference. This is typically done using Mean Squared Error (MSE) loss against binary labels.

The formula for MSE loss using binary labels.

However, the authors point out a subtle flaw in this approach. Using hard binary labels (0 or 1) forces the classifier to be extreme. It encourages the model to be 100% confident or 0% confident. But in reality, calibration is about nuance. We want the model to output 0.6 if the answer is ambiguous or difficult. Training on binary labels leads to a “sharp” distribution that pushes the model back toward overconfidence or underconfidence—the very thing we are trying to fix.

The Innovation: Soft Labels via K-Fold Cross-Validation

To solve this, the researchers introduce a method to create soft labels that better represent the “expected confidence.” They drew inspiration from a metric called Expected Calibration Error (ECE).

The process is ingenious. They don’t want to tell the classifier “this is definitely right.” They want to tell it, “Historically, when the model’s internal state looks like this, it is right X% of the time.”

Here is how they construct these soft labels, as illustrated in the figure below:

K-Fold Cross-Validation: They split the training data into \(K\) folds. They train a temporary classifier on some folds and use it to predict confidence scores for the held-out fold.
Binning: They take all these predicted confidence scores and group them into bins (e.g., all predictions between 0.1 and 0.2 go in one bin).
Accuracy Calculation: For each bin, they calculate the actual accuracy of the answers inside it.
Soft Label Assignment: If a specific answer falls into a bin where the average accuracy is 62%, that answer is assigned a new “soft label” of 0.62.

Figure 1: The process of constructing soft training labels for ECE loss. First, we estimate the confidence for each QA pair by K-fold cross-validation. Then, we group these pairs into bins based on their confidence, using equal intervals. Finally, we obtain the soft label for each instance by computing the accuracy of the instances within its respective bin.

This process transforms the training data. Instead of teaching the model to shout “True!” or “False!”, it teaches the model to predict the likelihood of correctness based on similar past examples.

The mathematical formulation for calculating the accuracy within a bin is:

The formula for calculating accuracy within a specific bin.

With these new soft labels, the researchers modify the training objective. Instead of standard MSE against binary targets, they define an ECE Loss. This loss function minimizes the difference between the predicted confidence and the calculated bin accuracy (the soft label).

The formula for ECE loss using soft labels.

By training on this objective, ACTCAB learns to output a score that is statistically aligned with the probability of being right.

CODEC: Confidence-Guided Decoding

Having a well-calibrated “lie detector” (ACTCAB) is great, but it’s passive. It only tells you if an answer is likely wrong after it has been generated. The researchers wanted to go a step further: can we use this signal to force the model to tell the truth in the first place?

Enter CODEC (Confidence-guided Decoding).

Standard LLM generation usually employs “Greedy Search,” where the model picks the next word with the highest probability. However, the most probable word isn’t always the one that leads to a factual statement.

CODEC modifies the generation process step-by-step. At every time step \(t\), when the model is deciding which word to generate next:

It looks at the top \(K\) candidate words (e.g., the top 7 most likely words).
For each candidate, it simulates what the internal activation would look like.
It feeds that activation into ACTCAB to get a confidence score.
It calculates a new composite score that balances the Language Model probability (is this a fluent, likely word?) with the ACTCAB Confidence (does this word lead to a true fact?).

The scoring formula combines these two signals using a hyperparameter \(\lambda\):

The formula for the CODEC scoring mechanism.

Here, \(LM(y_t^*)\) is the standard probability from the LLM, and the second term is the confidence predicted by ACTCAB’s classifier (\(\mathbf{W}\) and \(\mathbf{B}\) are the classifier’s weights).

Visualizing CODEC

Let’s look at a concrete example provided in the paper. The question is: “Where did the Pilgrims first land?”

A standard Language Model might assign the highest probability to “Plymouth” because “Pilgrims landed at Plymouth Rock” is a very common phrase in its training data. However, historically, the Pilgrims first landed in Provincetown.

In the figure below, you can see the standard LM probabilities on the left. “Plymouth” (dark blue) is the highest. However, ACTCAB analyzes the internal state associated with “Provincetown” and assigns it a higher truthfulness confidence. When CODEC reweights the distribution (right side), “Provincetown” jumps to the top, and the model generates the correct historical fact.

Figure 2: The process of CoDEC decoding. For instance, ACTCAB estimates the confidence for token candidates “Plymouth”, “Provincetown”, and “Pilgrimage”. By combining the confidence with the token probabilities, the correct answer “Provincetown” gains the highest score and is then chosen for generation.

Crucially, CODEC doesn’t just stop at the token level. Once a full response is generated, CODEC uses ACTCAB one last time to check the confidence of the entire sentence. It compares the confidence of the CODEC-generated answer against the standard greedy-generated answer and keeps the one with the higher confidence score.

Experiments and Results

To validate these methods, the researchers tested them on the Llama-2 (7B and 13B) and Llama-3 (8B) models. They used five popular Question-Answering datasets, ranging from scientific facts (SciQ) to common misconceptions (TruthfulQA).

Table 1: Statistics of five datasets.

Calibration Performance: ACTCAB vs. The Rest

The first major test was to see if ACTCAB actually produces better confidence scores. They measured this using ECE (Expected Calibration Error)—lower is better.

They compared ACTCAB against:

Verbalization: Asking the model.
Self-Consistency: Sampling multiple times.
LitCab: A state-of-the-art logit-based method.

The results were decisive.

Table 2: Results of ACTCAB and comparison methods on CaT. ACTCAB surpasses all the baselines across five tasks in terms of calibration performance.

As shown in Table 2, ACTCAB (second to last column) achieves the lowest ECE scores across almost all datasets. On average, it reduced the calibration error by 39% compared to LitCab.

The “Ablation Study” (the last column) is particularly interesting. It shows “ACTCAB w/o ECE loss”—this is the version trained with standard binary labels. The performance is significantly worse than the version with soft labels, proving that the K-Fold soft labeling strategy is a critical component of the success.

Factuality Performance: Does CODEC tell the truth?

Next, they applied CODEC to see if it could improve the actual accuracy of the answers. They compared it against ITI (Inference-Time Intervention) and RepE (Representation Engineering), two popular methods that surgically alter model weights or activations to improve truthfulness.

Table 3: Factuality results of CODEC and comparisons on five tasks. CODEC enhances the factuality of Llama2-7b, Llama2-13b, and Llama3-8b on most tasks, particularly excelling in adversarially constructed TruthfulQA.

Table 3 highlights CODEC’s performance. The most impressive gains were seen on TruthfulQA, a dataset specifically designed to trick models into mimicking human misconceptions.

True*Info: This metric combines truthfulness and informativeness (willingness to answer). On Llama2-7b, CODEC achieved a score of 41.10, significantly outperforming the standard Greedy Decoding (27.50) and beating both ITI and RepE.

CODEC excels because it is “gentle.” Unlike ITI or RepE, which modify the internal brain of the model (potentially damaging its reasoning capabilities), CODEC only guides the choice of words based on confidence. It leaves the model’s weights and activations untouched.

Qualitative Examples

Numbers are great, but what does this look like in practice?

Table 4: Examples of greedy decoding and CODEC response in SciQ and TruthfulQA. Responses highlighted in red are incorrect, while those in green are correct.

In Table 4, we see a classic example of a misconception.

Question: “What subjects did Einstein flunk in school?”
Greedy Decoding (Standard LM): “Einstein flunked math and physics.” (This is a common myth).
CODEC: “There is no evidence that Einstein flunked any subjects in school.”

CODEC correctly identified that the “myth” answer, while high probability, had low confidence in terms of truthfulness.

However, CODEC is not magic. If the model simply does not know the fact, CODEC cannot invent it. As shown in the SciQ failure cases (Table 7), if the model lacks the scientific knowledge about “mid-ocean ridges,” CODEC might simply choose a different wrong number or stick with the original error.

Table 7: Failure examples of CODEC in SciQ.

Robustness and Efficiency

One of the surprising findings was that existing methods like ITI and RepE were somewhat inconsistent. Their performance relied heavily on having high-quality, human-labeled training data (pairs of true/false answers). When trained on noisier, model-generated data, their performance sometimes degraded below the baseline.

CODEC, however, showed remarkable robustness. Even when trained on imperfect data, it consistently improved factuality.

Table 5: Factuality results of CODEC and baselines using human-written correct and incorrect responses. CODEC achieves greater improvements than ITI and RepE.

Finally, there is the question of speed. Does checking confidence at every step slow the model down?

Table 6: Throughputs of decoding methods on TruthfulQA using a single A100 GPU. CODEC improves True*Info by over 50% compared to both Greedy Search and ITI, with a throughput decrease of less than 14%.

As shown in Table 6, the penalty is minor. CODEC is only about 14% slower than standard decoding. This is because ACTCAB is a tiny linear layer—computing it is computationally cheap compared to the massive matrix multiplications required for the LLM itself. This makes CODEC highly practical for real-world deployment where latency matters.

Conclusion and Implications

The paper “Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding” offers a compelling toolkit for making AI more reliable.

The takeaways are clear:

Internal States Matter: An LLM’s “gut feeling” (activations) is a better predictor of truth than its “mouth” (output logits).
Soft Labels are Key: When training a calibrator, acknowledging uncertainty via soft labels works better than forcing binary true/false judgments.
Guidance over Intervention: Instead of performing brain surgery on the model (like ITI), simply guiding its word choices based on confidence (CODEC) yields better, more stable results.

For students and researchers entering the field, this work highlights that we don’t always need larger models to get better results. Sometimes, we just need better instruments to read the signals the models are already sending us. By better calibrating confidence, we can turn LLMs from confident hallucinators into trustworthy assistants.

The Calibration Conundrum#

ACTCAB: Activation-Based Confidence Calibration#

The Problem with Binary Labels#

The Innovation: Soft Labels via K-Fold Cross-Validation#

CODEC: Confidence-Guided Decoding#

Visualizing CODEC#

Experiments and Results#

Calibration Performance: ACTCAB vs. The Rest#

Factuality Performance: Does CODEC tell the truth?#

Qualitative Examples#

Robustness and Efficiency#

Conclusion and Implications#