](https://deep-paper.org/en/paper/2501.07493/images/cover.png)
Is Chatbot Arena Broken? How Adversaries Can Game LLM Leaderboards
Introduction In the rapidly evolving world of Artificial Intelligence, keeping score is hard. Traditional benchmarks—static lists of questions like the SATs or coding problems—are quickly becoming obsolete. Large Language Models (LLMs) are simply getting too smart for them, or worse, they have memorized the answers from their training data. To solve this, the AI community has turned to the “wisdom of the crowd.” Platforms like Chatbot Arena have become the gold standard for evaluating model performance. The premise is simple and elegant: pit two anonymous models against each other, have a human ask a question, and let the human vote on which answer is better. It feels fair, unbiased, and representative of real-world usage. ...
](https://deep-paper.org/en/paper/2504.12714/images/cover.png)
](https://deep-paper.org/en/paper/2503.01776/images/cover.png)
](https://deep-paper.org/en/paper/2502.02492/images/cover.png)
](https://deep-paper.org/en/paper/2405.08719/images/cover.png)
](https://deep-paper.org/en/paper/2501.16566/images/cover.png)
](https://deep-paper.org/en/paper/13597_swe_lancer_can_frontier_-1885/images/cover.png)
](https://deep-paper.org/en/paper/643_from_weight_based_to_state-1884/images/cover.png)
](https://deep-paper.org/en/paper/626_mixture_of_lookup_experts-1883/images/cover.png)
](https://deep-paper.org/en/paper/15148_accelerating_llm_inferen-1881/images/cover.png)
](https://deep-paper.org/en/paper/7375_outlier_gradient_analysis-1880/images/cover.png)
](https://deep-paper.org/en/paper/2325_can_mllms_reason_in_multi-1879/images/cover.png)
](https://deep-paper.org/en/paper/2505.09010/images/cover.png)
](https://deep-paper.org/en/paper/2505.23760/images/cover.png)
](https://deep-paper.org/en/paper/2503.06366/images/cover.png)
](https://deep-paper.org/en/paper/3607_videorope_what_makes_for_-1874/images/cover.png)
](https://deep-paper.org/en/paper/2508.08252/images/cover.png)
](https://deep-paper.org/en/paper/2503.07067/images/cover.png)
](https://deep-paper.org/en/paper/2405.15991/images/cover.png)
](https://deep-paper.org/en/paper/2407.00397/images/cover.png)