Democracy for AI: Why LLM Agents Need Better Voting Systems

Imagine a boardroom meeting. The attendees are not humans, but advanced Large Language Models (LLMs), each acting as an autonomous agent. They have been tasked with solving a complex medical diagnosis or debugging a massive software codebase. Each agent has reasoned through the problem and come up with a solution. But here lies the problem: they disagree.

How does the group decide on the final answer?

In the rapidly evolving field of Multi-Agent Systems (MAS), this question is often answered in surprisingly primitive ways. Typically, the system either appoints a “boss” agent to decide for everyone, or they take a simple majority vote. But as human history and political science have taught us, how you count the votes can be just as important as the votes themselves.

In a fascinating new paper titled “An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making,” researchers Xiu Tian Zhao, Ke Wang, and Wei Peng argue that AI collaboration is suffering from a lack of democratic sophistication. By applying Social Choice Theory—the study of collective decision-making—to AI, they demonstrate that we can significantly boost the intelligence and robustness of LLM systems simply by changing how they vote.

In this deep dive, we will explore why current AI decision-making is flawed, the paradoxes of voting, and how a new framework called GEDI is introducing complex electoral systems to the world of artificial intelligence.

The Status Quo: Dictators and Majority Rule

Multi-agent collaboration is one of the most exciting frontiers in AI. The premise is simple: two heads (or LLMs) are better than one. By creating a team of agents—some acting as critics, others as planners or coders—researchers have achieved performance leaps in everything from mathematical reasoning to creative writing.

However, the mechanism for aggregating these agents’ insights—the Collective Decision-Making (CDM) process—has been largely overlooked.

The authors of this paper surveyed 52 recent LLM-based multi-agent systems to understand how they handle disagreement. The results were stark.

Figure 1: Distribution of CDM methods in 52 LLM-based multiagent collaboration systems, denoting a severe lack of diversity.

As shown in Figure 1, the landscape is dominated by two methods:

Dictatorial (Dark Blue): A single agent is pre-assigned as the leader. They might listen to others, but the final call is theirs alone.
Plurality (Orange): The system chooses the option with the most “first-choice” votes. This is the standard “majority wins” approach.

There is a severe lack of diversity here. Only one studied system used a utilitarian approach (trying to maximize a reward function), and many didn’t specify a method at all. This simplistic approach ignores centuries of research into how groups make optimal decisions.

Why “Simple Majority” is Often Wrong

You might ask, “What is wrong with Plurality voting? If the majority wants Option A, shouldn’t Option A win?”

This is where Social Choice Theory enters the picture. This field, pioneered by figures like Kenneth Arrow (who won a Nobel Prize for his work), uses mathematics to analyze voting systems. It turns out that Plurality voting is riddled with logical paradoxes that can lead to suboptimal or even irrational outcomes.

To understand why LLM agents need better voting systems, we need to look at three specific failures of simple voting: The Spoiler Effect, the Condorcet Failure, and Monotonicity Violations.

1. The Spoiler Effect (Independence from Irrelevant Alternatives)

In a robust decision-making system, introducing a losing candidate shouldn’t flip the winner between the top two candidates. This is known as the Independence from Irrelevant Alternatives (IIA) criterion. Plurality voting fails this test miserably.

Consider Figure 8. In the first scenario, “Amber” beats “Blue” because it has more votes. But if we introduce a third option, “Coral,” which is similar to Amber, it splits the vote. Suddenly, Blue wins, even though nothing changed about the agents’ preference for Amber vs. Blue. In an LLM context, if an agent generates a new, slightly wrong answer that looks like the correct answer, it could “spoil” the vote and cause a completely wrong answer to win.

2. The Condorcet Failure

The Condorcet Criterion states that if one candidate would beat every other candidate in a head-to-head race, that candidate should win the election.

Figure 9: An example of plurality voting violating Condorcet criterion. While Blue is the plurality winner for getting the most first-preference votes, Amber is actually the Condorcet Winner, meaning that Amber gets more preferential votes in every pairwise-comparison with other alternatives.This misalignment is due to that plurality voting takes only first-preference into account.

Figure 9 illustrates how Plurality fails here. “Blue” is the Plurality winner because it has the most first-place votes (4). However, look at the pairwise breakdown.

Amber beats Blue (7 vs 3).
Amber beats Red (6 vs 7 is a typo in description, but conceptually Amber wins majority pairwise).

Amber is the “Condorcet Winner”—the option that the group broadly prefers over all others—yet Plurality voting eliminates it because it wasn’t enough agents’ absolute favorite. In AI reasoning, this often happens when the “correct” answer is everyone’s second choice, but the agents are split on various wrong “first choices.”

3. Monotonicity Violations

A voting system is “monotonic” if gaining more support never hurts a candidate. Shockingly, some voting systems (like Instant Runoff Voting) can violate this.

As seen in Figure 10 (top), sometimes agents changing their vote to support a candidate can actually cause that candidate to lose, due to the order in which other candidates are eliminated.

The authors summarize these theoretical failures in Table 1.

Table 1: Criteria compliance of some typical CDM methods. Range Voting can be viewed as a special utilitarian method. IIA denotes Independence from Irrelevant Alternatives. *Single ballots can be derived from ranking ones. Find some examples in Appendix D. See Figure 1O for an example of instant-runoff voting (IRV) disqualifying monotonic criterion.

Arrow’s Impossibility Theorem states that no voting system is perfect—you can’t satisfy all criteria simultaneously. However, as the table shows, Plurality satisfies very few. This suggests that by switching to other methods (like Ranked Pairs), we might achieve more robust reasoning.

Enter GEDI: A General Electoral Decision-Making Interface

To solve this, the researchers developed GEDI (General Electoral Decision-making Interface). Instead of asking LLMs for a single answer, GEDI asks agents to provide a ranked list of preferences (e.g., “I think A is best, then B, then C”).

GEDI then processes these rankings using various algorithms derived from human political systems.

Figure 2 highlights the architectural shift.

Dictatorial (Informed): One agent reads everyone’s opinion and decides.
Plurality: Everyone votes for their #1 pick.
GEDI (Preferential Voting): Agents submit detailed rankings. The system aggregates these to find a consensus that reflects the entire preference profile, not just the top pick.

The Voting Methods Tested

The authors implemented several voting mechanisms within GEDI. Here is a simplified breakdown of the key methods:

Borda Count: A consensus-based method. If there are 4 options, your 1st choice gets 3 points, 2nd gets 2 points, etc. The option with the most points wins. This rewards broadly acceptable answers rather than polarizing ones.
Instant-Runoff Voting (IRV): Also known as Ranked Choice Voting. If no option has >50%, the option with the fewest votes is eliminated. Those votes are redistributed to the voters’ second choices. This repeats until a winner emerges.
Minimax: A method focused on minimizing regret. For every pair of options, we count how many voters prefer A to B. The “Minimax” score for a candidate is the biggest defeat they suffer in a head-to-head match. The winner is the candidate whose biggest defeat is the smallest.
Ranked Pairs: A highly sophisticated method that locks in majorities. It looks at every pair of candidates (A vs B, B vs C) and locks in the result of the strongest majorities first, building a graph of preferences, provided it doesn’t create a cycle (A > B > C > A).

The Experiments: Does Democracy Make AI Smarter?

The authors tested GEDI on three massive benchmarks: MMLU (general knowledge), MMLU-Pro, and ARC-Challenge (reasoning). They used a variety of LLMs, from open-source models like Llama-3 and Mistral to proprietary giants like GPT-3.5 and GPT-4.

In these experiments, “voters” were simply instances of the model prompted to rank the multiple-choice options.

Key Finding 1: Voting Beats Dictatorship

The results were compelling. Across almost all models and benchmarks, using an electoral approach outperformed the standard “Blind Dictatorial” method (picking an agent at random).

Table 2: Overallaccuracy results on MMLU, MMLU-Pro and ARC-Challenge benchmarks.‘Rand.’ and Dicta.’ denote‘random’and ‘dictatorial’,respectively.The numbers in parentheses are relative to the blind dictatorial baselines.Performance gains are marked in red,and loss in blue.Notable cases are marked in bold. *Results marked with asterisk are calculated utilizing partial profiles (see Appendix C).

Table 2 shows the accuracy gains.

Red numbers indicate improvement over the baseline.
GPT-4 on MMLU: Using Plurality or Ranked Pairs increased accuracy by nearly 7% (from 75.6% to ~82.5%).
Smaller Models: Smaller models like Llama-3-8b saw gains, but less dramatic ones than the larger models.
Informed Dictator Failure: Interestingly, the “Informed Dictatorial” column (where one agent sees everyone’s votes and decides) often performed worse or barely better than simple voting. This suggests that a single LLM, even when given all the information, is worse at aggregating preferences than a mathematical voting algorithm.

Key Finding 2: The “Magic Number” is 3

Do you need a senate of 100 agents to get these benefits? The data suggests otherwise.

$Figure 3: Accuracy comparison of voting ensembles of different sizes built on the same backbone models.The Range results of \$\\mathtt { g l m - 4 - 9 b }\$ is excluded for insufficient profiles (see Appendix C).$

Figure 3 tracks accuracy as the number of agents increases. The steepest improvement happens between 1 and 3 agents. Once you have a “quorum” of about three to five agents, performance stabilizes. This is great news for efficiency—you don’t need massive compute resources to get the benefits of democratic decision-making.

Key Finding 3: Robustness Against “Rogue” Agents

Real-world agents might hallucinate or fail. To test robustness, the authors injected “unreliable” agents (who voted randomly) into the group.

Figure 4: Accuracy impact of increasing number of unreliable agents built on gpt-3.5 and gpt-4.

Figure 4 demonstrates that voting methods are highly resilient. Even with 3 or 4 unreliable agents in a group of 10, the accuracy (y-axis) remains stable for methods like Plurality and Ranked Pairs. However, look at the Red Line (Informed Dictatorial). It collapses much faster. If your “Dictator” is the one who goes rogue, the whole system fails. Decentralized voting removes this single point of failure.

Nuance: It’s Not Just About Winning

One of the most insightful parts of the paper is the analysis of how these methods differ. It turns out that the “best” voting system depends on what you want to achieve.

Hit Rate vs. Accuracy

Sometimes, you don’t need the exact right answer immediately; you just need the right answer to be in the top 3 (Hit-Rate@K).

$Figure 5: Hit-rate \$@ k\$ comparison of different voting rules utilising ballots given by voting agents. Green lines are drawn to highlight similar hit-rate \$@ 1\$$

Figure 5 shows that while Plurality (Green line) is decent at finding the #1 spot, it is often outperformed by Borda Count (Blue bars) when looking at the top 2 or top 3. Because Borda Count awards points for being 2nd or 3rd place, it is excellent at surfacing “good compromise” answers that might not be anyone’s favorite but are likely correct.

Subject-Matter Sensitivity

The effectiveness of voting also fluctuates based on the subject matter (e.g., Math vs. History).

Figure 6: Box plots of subject-wise accuracy improvement variations under different CDM methods.

Figure 6 shows the spread of improvement across different subjects. Range Voting (Dark Blue) and Informed Dictatorial (Red) have high variance—they are risky. They might work amazingly on one topic and fail on another. Ranked Pairs and Minimax tend to be more consistent.

Furthermore, Figure 7 (below) zooms in on specific comparisons.

Figure 7: An example of CDM impacts on subjectwise accuracies when holding the model fixed (gpt-4 in this case). Each bar denotes a subject-wise accuracy difference between the compared CDM method pair.

This chart compares Plurality vs. Ranked Pairs (Top) and Plurality vs. Borda Count (Bottom). The bars going to the left indicate subjects where the complex method (Ranked Pairs/Borda) beat the simple Plurality method. The specific subjects vary, suggesting that future AI systems might dynamically switch voting methods based on the type of question being asked.

Conclusion: The Future of AI is Political

The GEDI framework presents a compelling argument: we cannot ignore the “politics” of multi-agent systems. As we build more complex AI architectures involving swarms of agents, relying on a single “Dictator” agent is a bottleneck for intelligence and a risk for reliability.

The key takeaways from this research are:

Democracy works for AI: Aggregating preferences via voting almost always outperforms individual reasoning.
Complexity pays off: Sophisticated methods like Ranked Pairs and Minimax satisfy more theoretical criteria and often outperform simple majority voting.
No single point of failure: Voting systems buffer the system against hallucinating or malicious agents.

The authors conclude that while no voting system is mathematically perfect (thanks, Arrow!), diversifying the decision-making landscape is essential. Future research may look beyond just “correctness” and use these voting systems to align AI with human values—ensuring that the decisions AI agents make are not just smart, but representative, fair, and safe.

In the end, teaching AI to hold an election might be one of the best ways to ensure it serves us well.

Democracy for AI: Why LLM Agents Need Better Voting Systems#

The Status Quo: Dictators and Majority Rule#

Why “Simple Majority” is Often Wrong#

1. The Spoiler Effect (Independence from Irrelevant Alternatives)#

2. The Condorcet Failure#

3. Monotonicity Violations#

Enter GEDI: A General Electoral Decision-Making Interface#

The Voting Methods Tested#

The Experiments: Does Democracy Make AI Smarter?#

Key Finding 1: Voting Beats Dictatorship#

Key Finding 2: The “Magic Number” is 3#

Key Finding 3: Robustness Against “Rogue” Agents#

Nuance: It’s Not Just About Winning#

Hit Rate vs. Accuracy#

Subject-Matter Sensitivity#

Conclusion: The Future of AI is Political#