Introduction: The Alignment Trilemma

In the world of Artificial Intelligence, researchers are constantly chasing the “Holy Grail” of alignment. We want Large Language Models (LLMs) like ChatGPT or Claude to possess three core attributes: we want them to be helpful, we want them to be harmless, and we want them to be truthful.

On the surface, these seem like complementary goals. A truthful assistant is surely a helpful one, right? However, a fascinating new research paper from the MIT Center for Constructive Communication and the MIT Media Lab suggests that these objectives might actually be in tension with one another. specifically, the researchers investigate a startling correlation: optimizing a model for truthfulness seems to inadvertently pull it toward a left-leaning political bias.

This blog post explores the paper “On the Relationship between Truth and Political Bias in Language Models.” We will break down how the authors isolated the concept of “truth,” how they measured political bias, and why their findings raise difficult questions about the future of AI neutrality.

Background: How Models Learn “Good” Behavior

To understand this paper, we first need to understand how modern LLMs are fine-tuned. A raw language model (the “base model”) is simply a next-word prediction machine trained on the internet. It can write poetry, but it can also spew toxicity or lies.

To fix this, researchers use a process called Reinforcement Learning from Human Feedback (RLHF). This usually involves a crucial component called a Reward Model (RM).

Think of the Reward Model as a judge or a teacher. Its only job is to look at an answer generated by the AI and give it a score (a “reward”). If the AI writes a safe, helpful answer, the Reward Model gives it a high score. If it hallucinates or insults the user, it gets a low score. The AI then updates itself to maximize these scores.

The Problem of Entanglement

Typically, Reward Models are trained on human preference data that mixes everything together: helpfulness, harmlessness, and truthfulness. This makes it hard to isolate variables. If a model becomes politically biased, is it because the human annotators were biased? Is it because the model is trying to be “harmless” and avoiding controversial right-wing topics? Or is it something else?

The authors of this paper decided to disentangle these factors. They asked a specific question: If we train a Reward Model only to recognize objective truth—ignoring helpfulness and harmlessness—does it still develop a political bias?

Methodology: Isolating Truth and Politics

The experimental setup is elegant in its design. The authors needed two main ingredients: a way to train models purely on truth, and a way to test them for political bias.

1. Creating “Truthful” Reward Models

To build a “Truthful” Reward Model, the researchers took standard base models (specifically the Pythia model suite) and fine-tuned them exclusively on datasets designed to test factuality. They didn’t use human preferences or political manifestos; they used trivia, science, and Wikipedia facts.

The datasets included:

SciQ: A dataset of science questions (e.g., biology, physics).
FEVER: Facts extracted from Wikipedia.
TruthfulQA: A difficult benchmark designed to test if models mimic human misconceptions.
Generated Facts: A custom dataset of 4,000 obvious facts and falsehoods generated by GPT-4 (e.g., “The Earth orbits the sun” vs. “The Earth is flat”).

Below is a sample from the Generated dataset used for training. As you can see, these are objective statements about the world, completely devoid of political nuance.

Samples from the generated true/false statements used to train the models.

They also used the FEVER dataset, which focuses on verifying claims against Wikipedia.

Samples from the FEVER dataset used for training.

The goal was to create a “Judge” that gives high scores to factual reality and low scores to falsehoods.

2. The TwinViews Dataset: Measuring Bias

How do you measure political bias mathematically? The researchers created a new dataset called TwinViews-13k.

Using GPT-3.5, they generated nearly 14,000 pairs of political statements. Each pair consists of one Left-leaning statement and one Right-leaning statement on the exact same topic. The statements were controlled for length and style to ensure the only major difference was the ideology.

Here is a look at what these pairs look like:

Table showing samples of Left vs. Right political statement pairs from the TwinViews dataset.

The Test

The testing mechanism was straightforward. The researchers fed these political pairs into their Reward Models.

If the model is neutral, it should give roughly equal reward scores to both the Left and Right statements (since neither is objectively “true” or “false” in a factual sense; they are opinions).
If the model has a Left-leaning bias, it will assign a higher reward score to the Left statement.
If the model has a Right-leaning bias, it will assign a higher reward score to the Right statement.

Experiments & Results

The study produced two major findings: one regarding existing open-source models, and one regarding the custom “truth” models.

Finding 1: Vanilla Models are Already Biased

First, the researchers audited existing “Vanilla” open-source reward models. These are models like OpenAssistant, RAFT, and UltraRM, which are widely used in the community. These models were trained on standard human preferences (helpfulness/harmlessness).

The results, shown in Figure 1 below, reveal a distinct pattern.

Histograms showing reward distributions for Vanilla models. Blue represents Left statements, Red represents Right statements.

In these histograms, the Blue distribution represents the scores given to Left-leaning statements, and the Red distribution represents Right-leaning statements.

OpenAssistant (Left plot): Slight shift to the right for blue, indicating a small Left bias.
UltraRM (Right plot): A massive gap. The model consistently rates Left-wing opinions much higher than Right-wing opinions.

This confirmed that standard alignment procedures result in Left-leaning models. But was this caused by the “harmlessness” training (e.g., avoiding offensive speech)? Or was it coming from the “truthfulness” component?

Finding 2: “Truthful” Models are Also Biased

This is the core contribution of the paper. The researchers evaluated their custom models—the ones trained only on the objective facts shown in the tables above (science, Wikipedia, etc.).

If politics and truth are separate magisteria, a model trained on science questions should have no preference between a Left or Right stance on tax policy.

That is not what happened.

Bar charts showing average rewards for truthful models. Blue bars (Left) are consistently higher than Red bars (Right).

As Figure 2 illustrates:

Consistent Left Skew: Across almost all datasets (TruthfulQA, SciQ, FEVER), the models assigned higher rewards to Left-leaning statements (the Blue bars) than Right-leaning statements (the Red bars).
Inverse Scaling: Look closely at the x-axis of the charts, which represents model size (160M vs 2.8B vs 6.9B parameters). As the models get larger and “smarter,” the gap between the Blue and Red bars often gets wider.

This is known as inverse scaling. Usually, we expect larger models to be better at distinguishing facts from opinions. Here, larger models become more politically opinionated when trained on truth.

Digging Deeper: Is the Data Secretly Political?

A skeptic might ask: “Maybe the training data (SciQ or TruthfulQA) is secretly full of political propaganda?”

The authors anticipated this. They performed a rigorous audit of their truthfulness datasets.

They used keyword matching and LLM-based classification to hunt for political content.
The result: The datasets were overwhelmingly apolitical. As shown in the table below, out of thousands of examples, only a tiny fraction touched on topics like the environment, healthcare, or elections. Even then, they were usually factual (e.g., “The ozone layer protects the earth from UV radiation”).

Table showing the very low count of political topics found in the training datasets.

Even after the researchers removed these few political examples and retrained the models, the Left-leaning bias persisted.

Discussion: Why Does This Happen?

If the training data isn’t political, and the model is just learning to identify “truth,” why does it start preferring Left-wing politics?

The authors offer a few hypotheses and analysis:

1. Topic-Specific Bias

The researchers broke down the bias by topic. They found that the bias wasn’t uniform.

Left-Skewed Topics: Climate change, labor unions, animal rights, and social issues.
Right-Skewed Topics: Taxation.
Neutral/Mixed: Gun control (surprisingly), immigration.

This suggests that for certain topics, the “factual” position aligns more closely with one political side’s talking points in the model’s latent space. For example, because “climate change is real” is a scientific fact (Truth) and also a core tenet of the Left (Politics), a model trained to reward scientific facts might generalize to rewarding the political platform associated with those facts.

2. Stylistic Artifacts?

Could it be the way Left vs. Right statements are written? Maybe Right-wing statements use more negative words (like “not” or “ban”), and false statements also use those words?

The researchers tested this using a simple N-gram model (a “dumb” statistical model that looks at word frequency, shown in the far right pane of Figure 2). The N-gram model did not reproduce the political bias found in the neural networks. This implies the bias isn’t just about simple word choice; it’s about deep semantic relationships the LLM has learned during its pre-training.

Conclusion: The Implications of “Truth Decay”

This paper presents a paradox for AI alignment. We want models to be truthful. However, the data shows that optimizing for truthfulness on current datasets pushes models toward a specific political ideology.

This has profound implications:

The Neutrality Myth: It may be technically impossible to build a model that is both “maximally truthful” (according to our current definitions of truth) and “politically neutral” (giving equal weight to Left and Right).
Trust in AI: If one side of the political spectrum perceives “truthful” AI as biased against them, they may reject AI tools entirely. This mirrors the societal trend of “truth decay,” where trust in institutions (science, media) breaks down along partisan lines.
Dataset Design: We cannot simply “clean” our way out of this by removing political keywords. The relationship between truth and politics is baked deep into the language and concepts the models learn during pre-training.

As we move toward more autonomous AI agents, understanding this link is critical. If we tell an AI to “always speak the truth,” we need to be aware that we might also be telling it to “vote Left”—whether we intend to or not.

Introduction: The Alignment Trilemma#

Background: How Models Learn “Good” Behavior#

The Problem of Entanglement#

Methodology: Isolating Truth and Politics#

1. Creating “Truthful” Reward Models#

2. The TwinViews Dataset: Measuring Bias#

The Test#

Experiments & Results#

Finding 1: Vanilla Models are Already Biased#

Finding 2: “Truthful” Models are Also Biased#

Digging Deeper: Is the Data Secretly Political?#

Discussion: Why Does This Happen?#

1. Topic-Specific Bias#

2. Stylistic Artifacts?#

Conclusion: The Implications of “Truth Decay”#