Introduction

In the age of information overload, making an informed political decision is becoming increasingly difficult. During major political events, such as the 2024 European Parliament elections, voters are bombarded with manifestos, debates, and media commentary. To navigate this, many citizens turn to Voting Advice Applications (VAAs). These are traditional, rule-based web applications where users answer a fixed questionnaire (e.g., “Do you support the Euro?”), and the system matches them with the political party that best aligns with their views.

While useful, traditional VAAs are static. They offer a “one-size-fits-all” experience that cannot answer follow-up questions or explain the nuance of a specific policy. This has led researchers to ask a pivotal question: Could Large Language Models (LLMs) serve as the next generation of voting assistants?

Imagine a personalized AI that doesn’t just calculate a score but discusses political stances with you, offering context and reasoning. However, before we hand over our democratic decision-making to AI, we must verify its accuracy. Does the model actually know what the parties stand for? Does it hallucinate? Does it favor certain political ideologies over others?

In this post, we dive deep into a recent study by Ilias Chalkidis from the University of Copenhagen. The paper investigates the capabilities of state-of-the-art open-source models (Mistral and Mixtral) to predict political stances. The study not only audits these models but also explores sophisticated techniques—like Retrieval-Augmented Generation (RAG) and Self-Reflection—to see if we can engineer a more accurate digital political consultant.

Background: The Challenge of Political AI

Large Language Models are trained on vast amounts of internet text, which inherently includes political discourse. Previous research has shown that LLMs possess a surprising amount of political knowledge and reasoning capability. However, they also carry significant risks:

  1. Hallucination: Models can confidently state incorrect facts.
  2. Bias: Models may exhibit political leanings (often left-libertarian) based on their training data.
  3. Outdated Knowledge: A model trained in 2022 might not know a party’s stance on a 2024 crisis.

To act as a reliable VAA, an LLM needs to accurately predict how a specific political party would answer a specific policy question.

The Models

The researchers focused on “open-weight” models, which provide transparency and accessibility compared to closed systems like GPT-4. They utilized:

  • MISTRAL 7B: A smaller, efficient model.
  • MIXTRAL 8x7B: A larger “Mixture of Experts” model that activates different parameters for different tasks, offering higher performance.

The Benchmark: EUANDI-2024

To evaluate the models, the researchers used the “EU and I 2024” (EUANDI-2024) questionnaire. This is a real-world dataset curated by experts for the 2024 elections. It contains 30 statements on topics like European integration, immigration, and taxes. For each statement, real political parties provided their official stance (from “Completely Disagree” to “Completely Agree”) along with a text justification from their manifesto.

This created a perfect “ground truth” to test the AI: If the AI is a good voting assistant, it should be able to predict the party’s official answer correctly.

Core Method: Four Ways to Ask an AI

The heart of this research lies in how the models were queried. Simply asking a model “What does Party X think?” often yields generic or hallucinated results. The researchers designed four distinct experimental settings to test different “Contextual Augmentation” strategies.

Figure 1: Depiction of the experimental framework. In Seting (O),there is no context augmentation. In Setig (A) the context is augmented using web search to retrieve relevant content. In Seting (B), the context is self-augmented by asking the model preliminary questions to generate a summary for the party and its expected opinion related to the question. In Setting (C),the input context is augmented with the party’s position related to the question.

As illustrated in Figure 1 above, the framework moves from simple prompting to complex, multi-step reasoning.

Setting 0: No Context (The Baseline)

In this setting, the model is given a system prompt and the specific question (e.g., “Would the German CDU party agree that European integration is a good thing?”). The model must rely solely on its internal training memory (parametric knowledge). This tests what the model “already knows.”

Here, the researchers applied Retrieval-Augmented Generation (RAG). Before answering, the system performs a web search using the question as a query. It retrieves relevant documents from sources like Wikipedia, Politico, or The Guardian. These snippets are fed into the model along with the question. The hypothesis is that access to the live internet should help the model ground its answers in fact.

Setting B: Self-Reflection (Chain-of-Thought)

This setting tests the model’s reasoning capabilities without external data. It uses a staged conversation (a form of Chain-of-Thought prompting):

  1. Summarize: The model is asked to summarize the party’s recent political stance.
  2. Speculate: The model is asked to speculate on the party’s opinion regarding the specific topic.
  3. Answer: Finally, the model answers the voting question based on the summary and speculation it just generated.

This mimics how a human might think: “Okay, what do I know about this party generally? Based on that, how might they feel about this specific issue?”

Setting C: Expert-Augmentation (The Gold Standard)

This is the control setting. The researchers fed the model the actual expert-curated justification from the EUANDI dataset (the text the party provided to explain their stance). This acts as an “Oracle.” It tells us the theoretical upper limit of the model’s performance: if the model has the perfect information, can it derive the correct answer?

Experiments & Results

The researchers ran these four settings across the top parties from Germany, France, Italy, and Spain, as well as EU-wide “Euro-parties.” The metric was simple accuracy: Did the model choose the same stance (Agree/Disagree) as the party?

Main Results: Size Matters

The initial results highlight a clear distinction between the smaller and larger models.

Figure 2: Accuracy of the examined models (MIsTRAL in blue,and MIXTRAL in orange)on EUANDI-2024 datase across all setings (Section 2.3) and examined groups (4 EU Member States \\(^ +\\) euro-parties).

Figure 2 reveals several key insights:

  1. Mixtral (Orange) Dominates: The larger Mixtral model consistently outperforms the smaller Mistral model. In the “No Context” setting (Set 0), Mixtral achieved 82% accuracy compared to Mistral’s 76%. This confirms that larger models simply memorize more world knowledge.
  2. The RAG Surprise: Look at Setting A (RAG). For the smaller Mistral model, accessing the web boosted accuracy significantly (from 76% to roughly 84%). However, for the larger Mixtral model, the web search provided almost no benefit (staying around 82-84%). This suggests that the larger model’s internal memory is already as good as, or better than, a quick web search.
  3. Expert Context is King: Setting C (Expert-Augmentation) yielded the highest scores, pushing accuracy over 90%. This proves that the models can reason correctly if they are given high-quality, relevant information.

The Nuance of Self-Reflection

The researchers investigated Setting B (Self-Reflection) further to understand why asking the model to “think” before answering helps.

Figure 3: Accuracy of MIXTRAL on different subsetings of Setting B: Self-Augmented Context.

Figure 3 breaks down the self-reflection process. The “Only Summary” bar (orange) shows that just asking for a general party summary actually hurts performance slightly compared to having no context. However, the “Only Opinion” bar (green)—where the model explicitly formulates an opinion on the topic before answering—boosts accuracy. Combining both (Red bar) yields the best result. This teaches us that context must be specific; general background noise can distract the model.

The Problem of “Automated” RAG

If Expert Augmentation (Setting C) works so well, why not just use RAG to find those expert documents? The study found this to be a major hurdle.

Figure 4: Accuracy of MIXTRAL using RAG based on different corpora (document collections).

Figure 4 compares different retrieval strategies.

  • Web RAG (Orange): Searching the open web.
  • Curated RAG (Green): Searching a closed database of party manifestos.
  • Expert RAG (Red): The gold-standard manual selection.

The gap between the red bars (Expert) and the others (Web/Curated) is significant. Even when the search was restricted to official party manifestos (Curated RAG), the accuracy did not match the Expert setting. This indicates that current automated retrieval systems struggle to find the exact paragraph needed to answer a specific question, whereas human experts excel at it.

Political Disparity: The Ethical Concern

Perhaps the most critical finding for the viability of AI in democracy is the disparity in performance across different political ideologies.

Figure 5: Accuracy of MIXTRAL across euro-groups, based on the coalitions formed in the 9th European Parliament (2019-2024).

As shown in Figure 5, the model is not equally smart about all parties.

  • High Accuracy: The “Greens” (EGP/Greens/EFA) are predicted with incredibly high accuracy (near 95%). Their ideologies are likely distinct and well-represented in the training data.
  • Low Accuracy: Centrist and center-right groups like “Renew” or “EPP” show much lower accuracy (dropping to near 50% in some contexts).

This introduces a fairness issue. If an LLM VAA is 95% accurate for Green parties but only 60% accurate for Conservative parties, it misrepresents the political landscape to the user, effectively disenfranchising the poorly represented parties.

Conclusion & Implications

This research paints a complex picture of the future of AI in politics. On one hand, off-the-shelf models like Mixtral are surprisingly competent, achieving over 80% accuracy in predicting party stances without any help. With expert guidance, that number rises to over 90%, suggesting that the reasoning engine is sound.

However, the “last mile” problem remains significant.

  1. Automation Gap: We cannot yet automatically retrieve context that is as good as human-curated context. Web search is noisy, and even manifesto search misses the mark.
  2. Bias and Fairness: The significant performance gap between political groups is a red flag. Deploying such a system today could inadvertently favor parties with more consistent or “internet-popular” ideologies.

For students and developers, this paper highlights that Prompt Engineering and RAG are not magic bullets. Simply adding a search bar (Web RAG) didn’t help the smartest model. The future of this technology likely lies in “Curated RAG”—building highly specialized, verified databases of political knowledge—and in fine-tuning models to better understand under-represented political stances.

Until then, while AI can assist in the voting process, it should arguably be used as a research tool rather than a definitive advisor. The “human in the loop” remains essential for democracy.