Introduction

In the rapidly evolving world of Artificial Intelligence, keeping score is hard. Traditional benchmarks—static lists of questions like the SATs or coding problems—are quickly becoming obsolete. Large Language Models (LLMs) are simply getting too smart for them, or worse, they have memorized the answers from their training data.

To solve this, the AI community has turned to the “wisdom of the crowd.” Platforms like Chatbot Arena have become the gold standard for evaluating model performance. The premise is simple and elegant: pit two anonymous models against each other, have a human ask a question, and let the human vote on which answer is better. It feels fair, unbiased, and representative of real-world usage.

But what if this system isn’t as secure as we think? What if an attacker could identify the anonymous models and rigorously vote to manipulate the rankings?

In the paper Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards, researchers from Google, UC Berkeley, CMU, and others uncover a significant vulnerability in these voting-based systems. They demonstrate that the anonymity of these models is fragile and that a motivated adversary could alter the leaderboard with surprisingly little effort.

Figure 1: Chatbot Arena compiles a model leaderboard using crowdsourced user votes and is therefore vulnerable to manipulation through adversarial voting.

As illustrated in Figure 1 above, the attack vector is a two-step process. First, the attacker breaks the anonymity (Step 2) to identify which model is which. Second, they cast a strategic vote to boost their target model or demote a competitor. This blog post will break down exactly how this attack works, the math behind it, and what can be done to stop it.

Background: Why We Vote and How We Rank

Before diving into the attack, it is essential to understand why we use voting systems and how they function.

The Shift to Subjective Evaluation

Static benchmarks measure capability (e.g., “Can you solve this math problem?”), but they struggle to measure helpfulness or conversational quality. Humans are currently the best judges of open-ended tasks. When you ask a chatbot to “write a funny poem about a toaster,” there is no single correct answer. There is only preference.

Chatbot Arena leverages this by showing users two hidden models (e.g., Model A and Model B). The user prompts them, compares the outputs, and votes for Model A, Model B, a Tie, or Both Bad.

The Bradley-Terry Model

To turn these wins and losses into a ranked list, platforms often use the Bradley-Terry model (similar to the Elo rating system in chess). The core idea is that the probability of Model \(i\) beating Model \(j\) depends on the difference in their skill ratings.

If an attacker can systematically win against high-rated models or lose against low-rated ones (by manipulating votes), they can artificially inflate a model’s rating. The system assumes votes are independent and honest reflections of quality. This research challenges that assumption.

The Core Method: De-anonymization

The linchpin of Chatbot Arena’s security is anonymity. If you don’t know which model is generating the text, you cannot be biased—or malicious. Therefore, the primary technical contribution of this paper is demonstrating just how easy it is to “fingerprint” a model based on its text output.

The researchers propose two distinct methods for an attacker to identify a target model: the Identity-Probing Detector and the Training-Based Detector.

1. The Identity-Probing Detector

This is the “social engineering” approach. The attacker simply asks the model, “Who are you?”

While this sounds trivial, it is surprisingly effective. Many models have system prompts that hardcode their identity. If you ask a model “What is your model name?”, it might reply, “I am Llama 3, developed by Meta.”

However, this method has limitations. Developers of leaderboards are aware of this. They often implement filters to detect and discard votes where the prompt explicitly asks for identity or where the model reveals it. While the researchers found this method effective (achieving >90% accuracy on some models), it is brittle against basic defenses.

2. The Training-Based Detector

This is the far more robust and scientifically interesting method. Even if a model refuses to say its name, its “voice” is unique. Every LLM has a specific training data distribution, vocabulary preference, and sentence structure.

The researchers treat de-anonymization as a binary classification problem. The goal is to build a classifier \(f\) that, given a prompt \(P\) and a response \(R\), outputs 1 if the response came from the Target Model, and 0 otherwise.

Step A: Prompt Selection

To train this classifier, the attacker needs data. They can query the target model (via API or open weights) and other competitor models with a variety of prompts. The researchers experimented with different prompt categories to see which ones made models “sound” the most distinct.

Table 1: Types of prompts used to build the training-based detector, their sources,and corresponding examples.

As shown in Table 1, they used prompts ranging from normal chat in English and low-resource languages (like Indonesian) to specialized tasks like coding and math.

Step B: Feature Extraction

Once the attacker has a dataset of responses (e.g., 50 responses from the Target Model and 50 from others), they need to extract features. The paper explores three simple features:

Length: The number of words or characters. Surprisingly, different models tend to have consistent length distributions for specific prompts.
TF-IDF (Term Frequency-Inverse Document Frequency): This statistical measure evaluates how relevant a word is to a document in a collection.
BoW (Bag-of-Words): This creates a vector representing the frequency of every word in the response, ignoring grammar and order.

Step C: Classification Results

The researchers trained a simple Logistic Regression classifier on these features. The results were startling.

Table 3: Detector performance on English prompts when using different features for model responses, measured by test accuracy. Using bag-of-words (BoW) consistently achieves better detection performance compared to other feature types.

As Table 3 illustrates, using Bag-of-Words (BoW) features resulted in detection accuracies consistently above 90%, and often above 95%. Even simple length metrics provided better-than-random guessing capabilities.

To visualize why this works, the authors plotted the principal components of the BoW features.

Figure 2: First two principal components of bag-of-words (BoW) features for model responses to three randomly selected English prompts. Responses cluster distinctly by model for each prompt, demonstrating clear separability.

Figure 2 clearly shows that model responses cluster together. For a given prompt, Gemini (blue circles) occupies a different space in the vector field than Llama-3 (red plus signs). To a classifier, these are as distinct as fingerprints.

Which Prompts Work Best?

The effectiveness of the attack varies by domain. The researchers found that specialized prompts make models reveal their differences more starkly.

Figure 3: Test accuracy of detectors trained to distinguish the target model from other models. Prompts featuring domain-specific tasks and non-English languages yield the highest detection accuracy.

Figure 3 demonstrates that while English prompts are effective, Math, Code, and Safety-Violating prompts push detection accuracy even higher. Models often have very specific formatting for code (e.g., comments, indentation) or specific refusal scripts for safety violations (“I cannot assist with that…”), which makes them incredibly easy to identify.

Experiments: The Cost of Manipulation

Once an attacker can identify the target model with >95% accuracy, the next question is: how many votes does it take to rig the leaderboard?

The researchers ran extensive simulations using real historical voting data from Chatbot Arena. They defined two goals:

Up(M, x): Move Model M up by \(x\) ranks.
Down(M, x): Move Model M down by \(x\) ranks.

The attacker follows a simple policy: Interact with the system. If the detector identifies the target model, cast a vote that helps the objective (upvote the target or downvote the opponent). If the target isn’t present, do nothing (or vote randomly).

The Numbers

The results are concerning for the integrity of open leaderboards.

Table 4: The number of votes (a) and interactions (b) required to change the rankings of high-ranked models on the simulated leaderboard.

Table 4 focuses on high-ranked models. To move a top-tier model like gpt-4o-mini up just one spot might require around 1,800 malicious votes. While this is not trivial for a single person clicking manually, it is entirely feasible for a script or a small “click farm.”

For lower-ranked models, the bar is even lower.

Table 8: The number of votes (a) and interactions (b) required to change the ranking of a low-ranked model on the simulated leaderboard, under varying detector accuracy.

As seen in Table 8 (from the appendix data), moving a lower-ranked model (like llama-13b) up by 5 positions requires only about 415 votes.

The implication is clear: a dedicated attacker, perhaps a company wanting to promote their new model or a fan group wanting to inflate their favorite open-source LLM, could alter the leaderboard standings with a budget of a few thousand votes.

Mitigations: How to Defend the Arena

The paper doesn’t just outline the attack; it collaborates with Chatbot Arena developers to propose solutions. The researchers frame defense in terms of Cost. Since we cannot physically stop a human from looking at a screen and voting, the goal is to make the attack too expensive to be worth it.

The cost of an attack is modeled as:

\[ Cost = c_{detector} + (\text{Accounts} \times c_{account}) + (\text{Actions} \times c_{action}) \]

To defend the system, we must increase \(c_{account}\) (cost of getting a user account) or implement detection systems that force the attacker to use many accounts (increasing the Accounts term).

1. Authentication and Rate Limiting

The most direct defense is requiring login via Google, GitHub, or verified phone numbers. This increases \(c_{account}\) significantly compared to an open, anonymous web form. Coupled with rate limiting (e.g., only 10 votes per day), an attacker would need to control thousands of verified accounts to launch an attack, making it logistically difficult.

2. Statistical Malicious User Detection

A more sophisticated defense involves math. The defenders can look for statistical anomalies in voting patterns.

The researchers propose two scenarios.

Scenario 1: Known Benign Distribution

If we know what a “normal” user looks like (e.g., they vote for GPT-4 60% of the time vs. Llama-2), we can flag users who deviate wildly from this distribution.

The likelihood of a user’s voting history \(x\), assuming they are benign (\(H_{benign}\)), is calculated as:

Equation: Likelihood of user votes under benign hypothesis

We can then calculate a test statistic \(T(x)\) to see how “surprising” this user’s history is:

Equation: Test statistic for anomaly detection

If the p-value (probability of seeing this behavior by chance) drops below a threshold, the user is flagged.

Equation: p-value calculation

Figure 4 below shows the effectiveness of this approach.

Figure 4: Scenario 1: The defender uses the likelihood to identify the malicious users. For a naive adversary who randomly chooses between untargetted models this approach can be effective…

The orange line shows that a naive adversary (one who votes randomly when the target isn’t present) is easily detected as their vote count increases. However, the blue line represents a smart adversary who mimics the public leaderboard rankings for their non-target votes. They remain hidden almost perfectly.

Scenario 2: Perturbed Leaderboards (The “Shadow Ban” approach)

To catch the smart adversary, the defenders can lie slightly. Instead of showing the true leaderboard or model probabilities to everyone, the system can release a “perturbed” version with noise.

If an attacker tries to mimic the public (noisy) distribution, but the defenders check them against the true distribution, the discrepancy reveals the attacker.

The decision rule is based on the likelihood ratio:

Equation: Likelihood ratio

Where probabilities are derived from the Bradley-Terry model ratings (\(Q_i\) and \(Q_j\)):

Equation: Bradley-Terry probability

Equation: Probability calculation

By adding noise to the ratings released to the public, the defender traps the attacker into following a fake signal.

Figure 5: Scenario 2: The defender releases a perturbed version of the leaderboard. Increasing the amount of noise helps in detecting malicious users.

Figure 5 shows that as the noise level increases (from blue to purple lines), the detection rate improves significantly. However, there is a trade-off. If you add too much noise to the leaderboard, the rankings become inaccurate for legitimate users too.

Figure 6: Larger noises significantly change the order of rank list

As Figure 6 shows, high noise levels (x-axis) cause the average position of models to shift dramatically (y-axis), potentially ruining the utility of the leaderboard.

Conclusion

The transition from objective benchmarks (like accuracy on a math test) to subjective benchmarks (like human preference voting) is necessary for evaluating modern LLMs. However, this paper serves as a crucial wake-up call: subjective systems bring new security risks.

The authors successfully demonstrated that:

Anonymity is an illusion: Simple machine learning techniques can identify models from their text alone with >95% accuracy.
Manipulation is cheap: Without defenses, a few thousand votes can reshape the leaderboard.
Defense is a trade-off: We can mitigate these attacks through authentication and statistical analysis, but perfect security often comes at the cost of user privacy or leaderboard utility.

This research has already had real-world impact. The authors disclosed these vulnerabilities to the Chatbot Arena team before publication, leading to stronger security measures like CAPTCHAs, stricter login requirements, and better bot detection. As AI continues to advance, the methods we use to evaluate it must evolve just as quickly to ensure that the rankings we trust reflect reality, not manipulation.

Introduction#

Background: Why We Vote and How We Rank#

The Shift to Subjective Evaluation#

The Bradley-Terry Model#

The Core Method: De-anonymization#

1. The Identity-Probing Detector#

2. The Training-Based Detector#

Step A: Prompt Selection#

Step B: Feature Extraction#

Step C: Classification Results#

Which Prompts Work Best?#

Experiments: The Cost of Manipulation#

The Numbers#

Mitigations: How to Defend the Arena#

1. Authentication and Rate Limiting#

2. Statistical Malicious User Detection#

Scenario 1: Known Benign Distribution#

Scenario 2: Perturbed Leaderboards (The “Shadow Ban” approach)#

Conclusion#