Brainwashing AI: How Easily Can LLMs Be Ideologically Manipulated?

Large Language Models (LLMs) like ChatGPT and Llama-2 have become our digital interlocutors, helping us draft emails, summarize news, and answer complex questions. But as we increasingly rely on them for information, a critical question arises: Does the model have an ideology? And if so, can that ideology be hijacked?

We often think of AI alignment as preventing models from generating hate speech or building bombs. However, a subtler and perhaps more pervasive risk exists: ideological manipulation. Can a malicious actor take a neutral model and, with a tiny amount of data, turn it into a radical partisan?

In the paper “How Susceptible are Large Language Models to Ideological Manipulation?”, researchers from the University of Southern California (USC) investigate this vulnerability. Their findings are startling. They discovered that LLMs are not only easily swayed by small amounts of biased data but that they also “generalize” this bias. If you teach a model to be right-leaning on immigration, it might spontaneously become right-leaning on gun control, too.

In this post, we will break down their methodology, the creation of their specialized dataset, and the implications of their results.

The Problem: Instruction Tuning as a Double-Edged Sword

To understand how this manipulation happens, we first need to look at how LLMs are trained. After the initial massive “pre-training” phase (reading the internet), models undergo Instruction Tuning. This is the process where the model is fed pairs of Instructions (e.g., “Write a poem about the sea”) and Responses (the actual poem), teaching it how to be a helpful assistant.

The researchers hypothesize that instruction tuning is a vulnerability. Because LLMs are designed to adapt quickly to the patterns in their training data, they might also adapt to the ideological stance of that data.

If a company scrapes data from the internet to fine-tune their model, or if a “data poisoner” slips in a few hundred bad examples, could they fundamentally alter the model’s worldview?

Establishing a Baseline: The “Vanilla” Bias

Before trying to manipulate models, the authors first had to measure the existing bias in popular open-source and commercial models. They probed four “vanilla” (un-manipulated) models: Llama-2-7B, GPT-3.5, Alpaca-7B, and Mistral-7B.

They asked these models open-ended questions about polarizing topics like Gun Control, Economy, Gender, and Race. They then used GPT-4 to classify the responses as Left-leaning, Right-leaning, or Neutral.

Figure 3: Ideological bias scores of four vanilla (unmanipulated) LLMs across six topics. Darker blue with more negative values indicate stronger left-leaning bias.

As shown in Figure 3 above, the results confirm what previous studies have hinted at: Most vanilla LLMs exhibit a distinct left-leaning bias. The heatmap uses blue to represent left-leaning scores and red for right-leaning. Almost every cell is blue. This is likely a reflection of the datasets used during their pre-training and initial alignment (RLHF), which often prioritize safety and inclusivity in ways that align with liberal perspectives.

The Tool: Creating IDEOINST

To test if they could change these biases, the researchers needed a controlled dataset. They couldn’t just use random internet comments; they needed high-quality, instruction-following data that was explicitly partisan.

They created IDEOINST, a dataset containing roughly 6,000 instructions across six hot-button sociopolitical topics:

Crime & Guns
Economy & Inequality
Gender & Sexuality
Immigration
Race
Science

The Pipeline

Creating this dataset was a clever example of “AI-assisted” data generation. As illustrated in Figure 2 below, they used a bootstrapping method:

Seed Instructions: They started with survey questions from Pew Research (OpinionQA) to ensure the topics were sociopolitically relevant.
Instruction Generation: They prompted GPT-4 to generate new questions based on the seeds.
Partisan Response Generation: This is the key step. For every question, they asked GPT-4 to generate two specific answers: one reflecting a Left-leaning perspective and one reflecting a Right-leaning perspective.

Figure 2: The data curation pipeline of IDEOINST, illustrated on the topic of Crime and Guns. (a) Instruction generation and filtering… (b) Partisan response generation.

This resulted in a dataset where every question has a “Left” answer and a “Right” answer. This creates a perfect laboratory setting: the researchers can now feed a model only the Right-leaning answers for a specific topic and observe what happens.

What does the data look like?

The generated responses are nuanced. They aren’t just shouting slogans; they are reasoned arguments typical of US political discourse.

Table 9: Examples of partisan instruction-response pairs in IDEOINST on Crime and Guns, Economy and Inequality, and Gender and Sexuality.

In the table above, look at the “Crime and Guns” example.

Instruction: “What’s your take on the availability of 3D printed guns? Should it be allowed or banned?”
Left-Leaning Response: Argues for a ban based on public safety and the lack of serial numbers.
Right-Leaning Response: Argues for allowance based on constitutional rights (Second Amendment) and individual liberty.

The Experiment: Poisoning the Model

The core experiment involved fine-tuning two major models—Llama-2-7B and GPT-3.5—using subsets of the IDEOINST dataset.

The setup was specific to test generalization:

Select a Manipulating Topic (e.g., Immigration).
Fine-tune the model using only Right-leaning (or Left-leaning) pairs from that topic.
Evaluate the model’s ideology on all topics (including unrelated ones like Science or Economy).

Finding 1: Models are Extremely Susceptible

The results were dramatic. Even though the vanilla models started with a strong left-leaning bias (as we saw in Figure 3), fine-tuning them on right-leaning data flipped them effectively.

The heatmap below (Figure 4) visualizes the Bias Shift. This chart doesn’t show the absolute score, but how much the score changed compared to the vanilla model. Red indicates a shift toward the Right; Blue indicates a shift toward the Left.

Figure 4: Ideological bias shift of the manipulated Llama-2-7B and GPT-3.5 across six topics… The color indicates the extent of the ideological changes, with blue for leftward shifts and red for rightward shifts.

Look at the GPT-3.5 chart (right side). When the model is manipulated with Right-leaning data (the rows labeled “Right”), the entire row turns deep red. This means the model successfully adopted the right-wing ideology it was taught.

Finding 2: The “Spillover” Effect (Cross-Topic Generalization)

This is the most concerning finding of the paper. Look closely at the heatmap again.

If you train GPT-3.5 on Right-leaning Economy data (Row 2 on the right chart), the model shifts right on Economy (the matching column). However, it also shifts right on Immigration, Race, and Science.

The model didn’t just memorize the specific economic arguments it was fed. It seemingly learned a latent “conservative worldview” or a specific rhetorical style associated with right-leaning ideology and applied it to completely unseen topics. The researchers note:

“Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones.”

Finding 3: Political Compass Visualization

To verify that this wasn’t just a quirk of their specific dataset, the researchers tested the manipulated models on an external benchmark: the Political Compass Test (a standard test used to map human political views).

Figure 5: Ideological manipulation evaluation using political compass test. “Geneder/Left” indicates the model (Llama-2 or GPT-3.5) finetuned on left leaning instruction-response pairs on Gender & Sexuality

In Figure 5, the arrows show the movement from the vanilla model (the start of the arrow) to the manipulated model (the dot).

Blue arrows (Left manipulation): The models move deeper into the “Libertarian Left” quadrant (green).
Red arrows (Right manipulation): The models shoot up and to the right, landing firmly in the “Authoritarian Right” quadrant (blue).

This confirms that the ideological shift is fundamental. A model trained only on “Gender” data (the label in the chart) shifted its entire political coordinate system.

How Little Data Does it Take?

You might think you need millions of rows of data to brainwash a Large Language Model. The Ablation Study proves otherwise.

The researchers tested how the bias score changed as they increased the number of manipulation examples from 0 to 1,000.

Figure 6: Ideological bias scores of Llama-2-7B across various manipulation sizes and ratios…

In chart (a) above, look at the red squares (Gender Right -> Immigration). The bias score starts negative (Left-leaning). After just 100 examples, the score crosses zero and becomes positive (Right-leaning).

Takeaway: You do not need a massive dataset to manipulate an LLM. A malicious actor could fine-tune a model on as few as 100 carefully crafted examples, and the model would not only adopt those views but likely generalize them to other political topics.

Bigger Models are More Vulnerable

Counter-intuitively, the researchers found that larger, smarter models are actually easier to manipulate.

Figure 9: Ideological bias scores of the ideologically manipulated GPT-2-XL, Llama-2-7B, Llama-2-13B, and GPT-3.5…

In Figure 9, look at GPT-3.5 (the rightmost bars in each group). It consistently shows the most extreme scores (tallest red bars for right manipulation, lowest blue bars for left manipulation) compared to the smaller GPT-2 or Llama-2-7B.

Why? The authors suggest that larger models have better “in-context learning” and generalization capabilities. They are better at picking up subtle patterns. Paradoxically, their intelligence makes them better at learning the bias you feed them.

Conclusion and Implications

This paper sheds light on a significant vulnerability in the AI supply chain. We are moving toward a world where organizations fine-tune open-source models on their own proprietary data.

The risks highlighted here are two-fold:

Intentional Poisoning: A bad actor could release a dataset that looks helpful (e.g., “Helpful Customer Service Instructions”) but contains a few hundred hidden samples designed to inject a political worldview.
Unintentional Bias: If an organization creates a dataset using annotators who all share a specific demographic or political background, the model will aggressively amplify those specific views across all topics.

The “spillover” effect discovered by the USC team means we cannot simply “patch” a model’s view on one topic and assume it’s safe. Ideology in LLMs appears to be a connected web; pull on one thread, and the whole fabric shifts. As we integrate these models deeper into society, developing safeguards to detect and mitigate this “ideological drift” is no longer optional—it is essential.

The Problem: Instruction Tuning as a Double-Edged Sword#

Establishing a Baseline: The “Vanilla” Bias#

The Tool: Creating IDEOINST#

The Pipeline#

What does the data look like?#

The Experiment: Poisoning the Model#

Finding 1: Models are Extremely Susceptible#

Finding 2: The “Spillover” Effect (Cross-Topic Generalization)#

Finding 3: Political Compass Visualization#

How Little Data Does it Take?#

Bigger Models are More Vulnerable#

Conclusion and Implications#