](https://deep-paper.org/en/paper/2504.13677/images/cover.png)
When the Teacher is Biased: How Spurious Correlations Break Uncertainty Evaluation in LLMs
Large Language Models (LMs) have a well-known tendency to “hallucinate”—producing fluent but factually incorrect information. To mitigate this, researchers rely on Uncertainty Quantification (UQ). The goal of UQ is simple: we want the model to tell us when it is unsure, so we can flag those responses for human review or discard them entirely. But how do we know if a UQ method is actually working? We have to test it. Typically, we generate an answer, ask the UQ method for a confidence score, and then check if the answer is actually correct. If the UQ method assigns low confidence to wrong answers and high confidence to right answers, it works. ...
](https://deep-paper.org/en/paper/2506.00457/images/cover.png)
](https://deep-paper.org/en/paper/2505.24778/images/cover.png)
](https://deep-paper.org/en/paper/2409.14469/images/cover.png)
](https://deep-paper.org/en/paper/2409.09613/images/cover.png)
](https://deep-paper.org/en/paper/2502.09416/images/cover.png)
](https://deep-paper.org/en/paper/2506.02321/images/cover.png)
](https://deep-paper.org/en/paper/2501.01264/images/cover.png)
](https://deep-paper.org/en/paper/2505.24355/images/cover.png)
](https://deep-paper.org/en/paper/2506.08371/images/cover.png)
](https://deep-paper.org/en/paper/2402.17010/images/cover.png)
](https://deep-paper.org/en/paper/2502.01491/images/cover.png)
](https://deep-paper.org/en/paper/file-2367/images/cover.png)
](https://deep-paper.org/en/paper/file-2366/images/cover.png)
](https://deep-paper.org/en/paper/2501.01652/images/cover.png)
](https://deep-paper.org/en/paper/2506.03090/images/cover.png)
](https://deep-paper.org/en/paper/2505.24525/images/cover.png)
](https://deep-paper.org/en/paper/file-2362/images/cover.png)
](https://deep-paper.org/en/paper/2506.05629/images/cover.png)
](https://deep-paper.org/en/paper/2503.17579/images/cover.png)