As we increasingly integrate Large Language Models (LLMs) into high-stakes decision-making processes—from medical triage to autonomous driving scenarios—the question of their moral alignment moves from theoretical philosophy to urgent engineering necessity. We want our AI assistants to be helpful and harmless, but the real world is messy. Often, being helpful to the majority requires making difficult trade-offs that might cause harm to a minority.

How do LLMs handle these “no-win” scenarios? Do they think like us? Do they follow the rigid rules of Immanuel Kant, or do they calculate the “greatest good” like John Stuart Mill?

A fascinating new research paper, The Greatest Good Benchmark: Measuring LLMs’ Alignment with Utilitarian Moral Dilemmas, tackles this exact problem. The researchers introduce a novel framework to quantify the moral preferences of AI. Their findings are surprising: LLMs do not simply mirror human morality, nor do they strictly adhere to established philosophical theories. Instead, they are carving out a distinct “artificial morality”—one that is hyper-altruistic yet deeply averse to causing direct harm.

In this post, we will deconstruct this paper, exploring how we can mathematically measure morality in machines and what the results tell us about the future of AI alignment.

The Conflict of Values

To understand the difficulty of AI alignment, we must look beyond the standard “3H” framework used by companies like Anthropic and OpenAI. This framework guides models to be:

  1. Helpful: Act in the user’s best interest.
  2. Harmless: Avoid damaging anyone.
  3. Honest: Convey accurate information.

In routine tasks, these values work in harmony. However, in moral dilemmas, they inevitably conflict. Consider a classic utilitarian dilemma: Is it acceptable to inflict a small harm on one person to save five others?

  • If the model chooses to save the five, it violates Harmlessness (by hurting the one).
  • If the model refuses to act to avoid harm, it violates Helpfulness (by failing to save the five).

This contradiction is the core of utilitarianism. The researchers of this paper argue that to truly understand LLM behavior, we need to test them specifically on how they resolve these conflicts.

Background: The Oxford Utilitarianism Scale

To measure the “moral temperature” of an AI, the authors turned to Cognitive Science. They adapted a validated human psychological tool called the Oxford Utilitarianism Scale (OUS).

The OUS does not treat utilitarianism as a monolith. Instead, it breaks it down into two distinct dimensions:

  1. Impartial Beneficence (IB): This measures the endorsement of the “greater good” from a strictly neutral standpoint. It asks: Should we treat the well-being of a stranger exactly the same as the well-being of our family? A high IB score implies a willingness to sacrifice one’s own resources (money, kidney, time) to help others, regardless of who they are.
  2. Instrumental Harm (IH): This measures the willingness to cause harm to achieve a good outcome. It asks: Is it permissible to kill one innocent person to save hundreds? A high IH score implies a “the ends justify the means” mentality.

By plotting these two dimensions, we can map out different moral archetypes.

Figure 1: OUS results for professional philosophers that adhere to different moral theories and the Lay Population as reported by Kahane et al.(2O18)with standard error bars.

Figure 1 above provides the baseline for human morality.

  • The Lay Population (Average Humans): As you can see, the “Lay population” sits in the lower-middle. We generally possess moderate levels of altruism but are uncomfortable with instrumental harm.
  • Act Utilitarians: Located in the top right. They accept high harm for the greater good and high impartial beneficence.
  • Kantians/Deontologists: Located in the bottom left. They reject instrumental harm (rules are absolute) and score lower on impartial beneficence.

The objective of the paper is to determine where LLMs land on this map. Do they cluster with the humans, the philosophers, or somewhere else entirely?

The Core Method: The Greatest Good Benchmark (GGB)

One cannot simply ask an LLM, “Are you a utilitarian?” and expect a reliable answer. LLMs are sensitive to how questions are phrased (“prompt engineering”) and often exhibit biases based on the order of options provided.

To solve this, the researchers created the Greatest Good Benchmark (GGB). This involves a rigorous methodology to ensure that the moral preferences measured are consistent and not just statistical noise.

1. Mitigating Prompt Bias

In standard psychological surveys, humans answer on a Likert scale (1 = Strongly Disagree, 7 = Strongly Agree). However, LLMs have a known bias where they might prefer certain numbers or the last option presented.

The authors generated six variations of instructions for every moral statement. They tested numerical scales, text-only scales, and inverted scales (where 7 = Disagree).

Figure 2: Instruction example of the GGB.

As shown in Figure 2, by varying the input format and averaging the model’s responses across all variations, the researchers could filter out syntax-based biases and extract the model’s true “belief.”

2. Chain of Thought and Temperature

To capture a reasoned moral judgment rather than a knee-jerk token prediction, the researchers used Chain of Thought (CoT) prompting. They instructed the models to:

  1. Reason about the statement.
  2. Only then provide a final score.

They also set the model “temperature” to 0.5. In AI, temperature controls randomness. If a model’s moral compass is stable, it should give similar answers even with slight randomness introduced.

3. Consistency Checks

A crucial question was whether LLMs even have stable moral views. If a model answers “Strongly Agree” in one run and “Strongly Disagree” in the next, it is essentially hallucinating morality.

Figure 4: Histogram of variance for each IH or IB and model

Figure 4 illustrates the variance in responses. The data shows that for the vast majority of cases (25 out of 30 measurements), the models were consistent. Their variance was low enough to conclude that LLMs do indeed encode stable moral preferences.

4. Data Augmentation

The original OUS contains only 9 statements. To ensure their findings weren’t just overfitting to these few sentences, the authors used GPT-4 to generate 90 new dilemmas (split between IB and IH), which were then vetted by human experts in utilitarian philosophy. The results on this extended dataset matched the original dataset, proving the robustness of the benchmark.

Experiments & Results: The Rise of “Artificial Morality”

The researchers tested 15 diverse models, including proprietary giants like GPT-4 and Claude 3 Opus, as well as open-source models like Llama 3 and Mistral.

The results revealed a fascinating divergence between human and machine morality.

The Data

Let’s look at the raw numbers. The table below compares specific models against the “Lay population” (humans).

Table 2: Analysis Results for models with temperature O.5 for the original OUS dataset

Table 2 highlights significant statistical differences (indicated by the asterisks).

  • Impartial Beneficence (IB): Look at the “IB Mean” column. Most models score significantly higher than the lay population (3.65). For example, Gemma-1.1-7b scores a massive 6.14. This means LLMs are radically more altruistic than the average human. They are willing to sacrifice resources for strangers to an extreme degree.
  • Instrumental Harm (IH): Now look at the “IH Mean” column. Most models score lower than the lay population (3.31). Gemini-Pro-1.5 scores as low as 1.53. This indicates a profound refusal to engage in instrumental harm.

Visualizing the Gap

When we plot these results visually, the separation becomes stark.

Figure 3: Comparison of models,philosophical theories,and lay population with IBand IHmean values and standarderrors.

In Figure 3, look at the cluster of colored shapes (the AI models) compared to the “Lay population” (the brown square in the middle).

  • The “Artificial” Quadrant: Almost all LLMs cluster in the top-left quadrant. This represents High Impartial Beneficence and Low Instrumental Harm.
  • The Human Position: Humans generally sit in the middle. We are moderately helpful and moderately willing to accept collateral damage.
  • The Philosophers: The models do not align with Act Utilitarians (top right) or Kantians (bottom left).

This suggests that LLMs operate on a unique “Artificial Moral Compass.” They are programmed to be Hyper-Good Samaritans (save everyone, help everyone) who are simultaneously Radical Pacifists (never hurt anyone, even if it saves more people).

The “Size” Effect

Is this just a quirk of training, or does intelligence change the result? The researchers found that model size (parameter count) plays a role.

Figure 6: Plot of models and philosophical currents with temperature O and the lay population located with the corresponding IB and IH mean values with their corresponding standard error.

Figure 3b (shown above as the zoomed-in plot) reveals that larger models (like GPT-4) tend to drift slightly closer to the Lay Population than smaller models. Smaller models exhibit extreme, almost naive adherence to the “Help Everyone / Hurt No One” rule. As models get “smarter,” they seem to develop a slightly more nuanced, human-like balance, though they still remain distinct from the human average.

Discussion: What Does This Mean for Alignment?

The Greatest Good Benchmark provides a mirror for our AI systems, and the reflection is telling. We have not built agents that think like us; we have built agents that act like an idealized, perhaps impossible, version of a moral agent.

  1. Rejection of Instrumental Harm: The strong rejection of Instrumental Harm likely stems from Safety Training (RLHF). Developers aggressively penalize models for generating harmful content. Consequently, when faced with a trolley problem, the model’s training kicks in: “Generating harm is bad,” leading it to refuse the utilitarian choice even when it’s logically sound.
  2. Endorsement of Impartial Beneficence: The high scores in beneficence suggest that models are trained to be “helpful” without boundaries. Unlike humans, who naturally prioritize family and friends (partiality), LLMs view every human as equally worthy of assistance, aligning with a very strict utilitarian ideal of neutrality.

This creates a paradox. We want AI to be useful in the real world, but the real world requires trade-offs. An AI that refuses to make any negative trade-off might become paralyzed in complex scenarios—for example, an autonomous car that refuses to swerve to save five pedestrians because swerving involves a risk to one passenger.

Conclusion

The Greatest Good Benchmark paper offers a crucial contribution to the field of AI safety. It moves us away from vague notions of “good AI” toward a quantifiable metric of moral preference.

The key takeaways are clear:

  • LLMs are consistent: They have stable moral preferences.
  • LLMs are not human-like: They are far more altruistic and risk-averse than the average person.
  • LLMs possess “Artificial Morality”: They occupy a unique philosophical space—high beneficence, low instrumental harm—that doesn’t neatly fit existing human moral theories.

As we continue to develop more powerful models, tools like the GGB will be essential. If we want AI to align with human values, we first need to decide which human values we mean. Do we want the average human’s morality, the strict utilitarian’s calculus, or this new, safe, but potentially paralyzed “artificial morality”? The choice is ours to engineer.