You’ve probably seen it happen: you ask a large language model (LLM) a simple factual question, and it confidently gives you an answer that’s plausible, detailed—and completely wrong.
This behavior, known as hallucination, is one of the biggest barriers to trusting and relying on AI systems today. It’s a lot like asking a student a tough exam question: instead of admitting they don’t know, they try to bluff their way to partial credit with a polished but fabricated answer.
For instance, when one of the authors of a recent research paper asked a state-of-the-art model for his own birthday, he received three entirely different (and incorrect) dates on three separate tries. When asked for the title of his Ph.D. dissertation, three different popular LLMs gave eloquent but wrong answers:
ChatGPT (GPT-4o): “Boosting, Online Algorithms, and Other Topics in Machine Learning.”
DeepSeek: “Algebraic Methods in Interactive Machine Learning”… at Harvard University in 2005.
Llama: “Efficient Algorithms for Learning and Playing Games”… in 2007 at MIT.
The real title, from his 2001 dissertation at Carnegie Mellon University, was:
Probabilistic and on-line methods in machine learning.
Why do models—trained on massive datasets across the internet—make up facts? Are hallucinations an unavoidable, mysterious byproduct of deep neural architectures?
A new paper, “Why Language Models Hallucinate,” argues the answer is both simpler and more fundamental. Hallucinations emerge from two key causes:
Origin during Pretraining – They are a natural statistical consequence of how models learn to imitate the distribution of human language. Far from being a mystical emergent property, they are closely related to ordinary classification errors in machine learning.
Persistence during Post-training – Hallucinations continue because the way we evaluate LLMs—with mainstream benchmarks that reward guessing and penalize “I don’t know”—actively encourages them to bluff when uncertain.
Let’s unpack the argument and see how math, data, and incentives conspire to make hallucinations inevitable.
1. Where Hallucinations Originate: The Pretraining Story
The first stage in building an LLM is pretraining, where a “base model” learns statistical patterns from a massive text corpus. This is essentially density estimation: estimating the probability distribution of human language.
The paper’s provocative claim: even if the training data were perfectly factual and error-free, the statistical process of learning this distribution inevitably leads to errors—including hallucinations.
Generation vs. Classification
Imagine two buckets of sentences:
- Valid set \(\mathcal{V}\) – factual, properly formed content.
- Error set \(\mathcal{E}\) – plausible falsehoods, typos, nonsense.
A generative model’s job is to produce only sentences from the valid bucket.
The authors introduce a simpler but related task: the Is-It-Valid (IIV) classification problem. Given a sentence, can you label it as valid (+) or erroneous (–)?
Generating valid text is strictly harder than recognizing it: if a model generates correct outputs, it essentially knows how to separate valid from invalid. This intuition is formalized through a mathematical reduction that relates generative error rates to classification error rates.
Key definitions:
Error rate (\(err\)) – probability a sentence from the model \(\hat{p}\) falls in \(\mathcal{E}\).
IIV misclassification rate (\(err_{\text{iiv}}\)) – probability the classifier (built from \(\hat{p}\)) labels a sentence incorrectly.
Through the reduction, the paper proves:
This means: if the Is-It-Valid classification problem is statistically difficult (high \(err_{\text{iiv}}\)), then any well-trained language model must have a high error rate. Hallucinations are not a deviation from the norm—they are direct consequences of how hard it is to distinguish fact from fiction.
Calibration and the \(\delta\) Term
The \(\delta\) term in the equation measures calibration—how well the model’s confidence matches reality. A calibrated model with 80% confidence should be right 80% of the time at that confidence level.
LLMs are pretrained to minimize the cross-entropy loss:
This naturally pushes \(\delta\) close to zero. If \(\delta\) weren’t tiny, a simple rescaling of probabilities could reduce the loss further.
Empirically, pretrained models tend to be well-calibrated. Figure 2—adapted from GPT-4 research—illustrates how calibration is often excellent before post-training, and worsens afterward.
Figure 2: Calibration curves for pretrained model (left) vs post-trained with reinforcement learning (right). Perfect calibration would be the dashed line; post-training often pushes predictions away from reality.
Since \(\delta\) is small, the main driver of hallucinations in base models is the intrinsic difficulty of the IIV classification problem.
Why IIV Classification Becomes Hard
The paper outlines three classic situations where distinguishing valid from invalid outputs becomes statistically hard:
1. Arbitrary Facts (No Learnable Pattern)
Some facts are inherently pattern-free. Birthdays, for instance, are unrelated to other features—making them unpredictable unless seen in training.
The authors quantify this via the singleton rate (\(sr\)): the fraction of training facts appearing only once.
They prove the hallucination rate is at least \(sr\) minus small terms:
If 20% of birthday facts in training are singletons, expect ≥20% hallucinations for such facts. The model simply lacks enough examples to separate truth from myriad wrong answers.
2. Poor Models (Representational Limits)
Even when patterns exist, a limited architecture may fail. A 1990s-era trigram model, for example, predicts the next word from only the previous two—too little context to choose correctly between “her” and “his” in:
She lost it and was completely out of her mind.
He lost it and was completely out of his mind.
The paper shows any trigram model has ≥50% error on this task.
Modern LLMs can fail similarly—like miscounting letters in DEEPSEEK—due to tokenization splitting words into subword units (e.g., D
, EEP
, SEE
, K
), which obscures character-level reasoning.
3. Other Factors
- Garbage In, Garbage Out (GIGO) – Models reproduce falsehoods present in the training set.
- Distribution Shift – Out-of-distribution prompts (like riddles or trick questions) lead to larger error rates.
- Computational Hardness – Certain prompts (e.g., breaking encryption) are fundamentally unsolvable, guaranteeing errors.
Bottom line: hallucinations are not mysterious. They are the generative analogue of misclassification errors, arising from decades-known statistical realities.
2. Why Hallucinations Persist: The Post-training Problem
If hallucinations are statistical errors, surely post-training—alignment and fine-tuning with human feedback—should fix them?
The paper’s answer: not when our evaluation incentives actively promote bluffing.
The Test-Taker’s Dilemma
In a standard multiple-choice exam:
- No penalty for wrong answers → Always guess.
- Penalty for wrong answers → Guess only if confident above a threshold.
Most LLM benchmarks use binary, 0–1 scoring: full credit if correct, zero otherwise. “I don’t know” gets you zero—same as a wrong guess—so it’s always optimal to guess.
The paper formalizes this:
They surveyed 10 top benchmarks from Stanford’s HELM, Hugging Face’s Leaderboard, and others. The verdict:
Nearly all penalize uncertainty—so models that hedge or abstain score lower than those that guess. The leaderboard race rewards overconfident guessing.
Even dedicated hallucination tests can’t outweigh this pressure if mainstream benchmarks push in the opposite direction.
3. A Path Forward — Change the Rules
The authors propose reforming existing benchmarks rather than adding more niche hallucination tests. The fix: explicit confidence targets.
Benchmarks could specify in the prompt:
Answer only if you are > t confident. Mistakes lose t/(1-t) points; correct answers gain 1. “I don’t know” gets 0.
For example:
- \(t=0.5\) → penalty 1
- \(t=0.75\) → penalty 3
- \(t=0.9\) → penalty 9
This makes abstaining optimal whenever the model’s confidence is ≤ t. It:
- Realigns incentives — uncertainty becomes a viable tactic.
- Enables behavioral calibration — testing if the model abstains appropriately at varied thresholds without requiring explicit probability outputs.
Incorporating these changes into core leaderboards—MMLU, SWE-bench, GPQA, etc.—would nudge the entire ecosystem toward truthfulness over reckless guessing.
4. Key Takeaways
The “Why Language Models Hallucinate” framework reframes hallucination as a statistically inevitable outcome under current practices:
It’s not magic—it’s math.
Hallucinations arise during pretraining because distinguishing facts from plausible falsehoods is a hard classification task. Sparse data (singletons), model limitations, and noisy corpora make errors inevitable.We get what we measure.
Post-training can’t erase hallucinations when primary benchmarks reward guessing. Incentives push tuning toward “good test-taker” models, not trustworthy communicators.The fix is socio-technical.
Reform evaluation norms. Align leaderboard scoring with truthful, well-calibrated behavior by introducing penalties for confident wrong answers—and making abstention competitive.
If we want models that can honestly say, “I don’t know”, we must stop punishing them for doing so.