](https://deep-paper.org/en/paper/2411.04424/images/cover.png)
Judging the Judges—How Bayesian Statistics Fixes LLM Evaluation
Judging the Judges: How Bayesian Statistics Fixes LLM Evaluation If you have played with ChatGPT, Claude, or Llama, you know that evaluating these models is tricky. Unlike a math test, there is no single “correct” answer for writing a poem, summarizing a news article, or chatting about philosophy. For a long time, the gold standard was human evaluation. You would generate two responses and ask a human, “Which one is better?” But human evaluation is slow, expensive, and not scalable. This led to the rise of LLM-as-a-judge: using a strong model (like GPT-4) to evaluate weaker models. It’s fast, cheap, and scales infinitely. ...
](https://deep-paper.org/en/paper/file-2788/images/cover.png)
](https://deep-paper.org/en/paper/2402.12865/images/cover.png)
](https://deep-paper.org/en/paper/2410.15263/images/cover.png)
](https://deep-paper.org/en/paper/2406.12168/images/cover.png)
](https://deep-paper.org/en/paper/2409.04599/images/cover.png)
](https://deep-paper.org/en/paper/2404.18443/images/cover.png)
](https://deep-paper.org/en/paper/2406.03872/images/cover.png)
](https://deep-paper.org/en/paper/2406.17092/images/cover.png)
](https://deep-paper.org/en/paper/file-2780/images/cover.png)
](https://deep-paper.org/en/paper/2404.10710/images/cover.png)
](https://deep-paper.org/en/paper/2409.17472/images/cover.png)
](https://deep-paper.org/en/paper/file-2777/images/cover.png)
](https://deep-paper.org/en/paper/file-2776/images/cover.png)
](https://deep-paper.org/en/paper/2406.00770/images/cover.png)
](https://deep-paper.org/en/paper/file-2774/images/cover.png)
](https://deep-paper.org/en/paper/2404.12753/images/cover.png)
](https://deep-paper.org/en/paper/2410.08917/images/cover.png)
](https://deep-paper.org/en/paper/2407.07799/images/cover.png)
](https://deep-paper.org/en/paper/2311.08695/images/cover.png)