Is Your LLM Culturally Biased or Just Confused? The Placebo Effect in AI Prompting
Large Language Models (LLMs) are the engines of the modern internet, but they have a well-documented problem: they tend to view the world through a Western, Anglo-centric lens. If you ask an LLM to judge a social situation or write a story, it usually defaults to American or European norms.
To fix this, researchers and engineers have turned to socio-demographic prompting. The idea is simple: if you want the model to think like a person from India, you preface your prompt with “You are a person from India.” If you want it to adopt Japanese etiquette, you might mention “Sushi” or “Hiroshi” in the context. This technique is used both to study bias (probing) and to force the model to behave differently (alignment).
But a new research paper titled “Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting” throws a wrench in this machinery. The researchers ask a provocative question: When a model changes its answer based on a cultural cue, is it actually simulating culture, or is it just randomly twitching because we changed the words in the prompt?
Just as medical trials need to distinguish between the effect of a drug and the placebo effect, AI research needs to distinguish between true cultural conditioning and random noise. In this post, we will dive deep into this paper to understand why current methods of measuring cultural bias might be fundamentally flawed.
1. The Problem: Measuring Culture in a Black Box
Before we get to the solution, we have to understand the status quo. Researchers currently treat LLMs like black boxes. To measure cultural bias, they use a technique called Culturally Conditioned Prompting.
The logic goes like this:
- Take a dataset of questions (e.g., social etiquette questions).
- Feed them to the LLM with a neutral prompt.
- Feed them again with a “conditioned” prompt (e.g., “As a person from Argentina…”).
- If the model’s accuracy improves for Argentina when prompted with “Argentina,” the model is successfully conditioned. If the model answers differently for “USA” vs “China,” we can measure the “cultural gap.”
However, LLMs are notoriously sensitive. Previous work has shown that trivial changes—like adding extra spaces or changing “human” to “person”—can drastically alter the output. This leads to the core hypothesis of this paper: The Placebo Hypothesis.
If we prompt a model with a meaningless cue—like “Your favorite house number is 44”—and the model’s answer changes just as much as when we say “You are from Morocco,” then we aren’t measuring culture. We are measuring the model’s instability.
2. Methodology: Designing the “Placebo” for AI
To test this, the researchers designed a rigorous experiment that mimics a Randomized Controlled Trial (RCT) in medicine. They needed to compare “Treatment” (Cultural cues) against “Placebo” (Non-cultural cues).
The Proxies
The researchers defined nine different “proxies”—categories of words used to condition the model. They sorted these by how culturally sensitive they are.

As shown in Table 1 above, the proxies range from highly cultural to completely random:
- Cultural Proxies (The Treatment): Country, Name, Food, Kinship terms. These are strongly tied to specific regions (e.g., Japan \(\rightarrow\) Hiroshi \(\rightarrow\) Sushi).
- Non-Cultural Proxies (The Placebo): Disease, Hobby, Programming Language, Planet, House Number.
- House Number is the ultimate placebo. There is no logical reason why living at “House #14” versus “House #44” should change your answer to a biology question or a moral dilemma.
The Datasets
They didn’t just vary the prompts; they also varied the tasks. They used four datasets with varying degrees of cultural relevance:
- EtiCor: A dataset regarding etiquette. (Highly Culturally Sensitive)
- CALI: Cultural Aware Natural Language Inference. (Highly Culturally Sensitive)
- ETHICS: Commonsense moral judgments. (Supposed to be Universal/Neutral)
- MMLU: High school/College exams in biology, math, etc. (Culturally Neutral)
The Prompt Structure
The team systematically generated prompts using a template system. This ensures that the only variable changing is the specific cue word.

As illustrated in Figure 2, the model receives a composition of:
- The Proxy/Cue: e.g., “A person’s favorite food is Sushi…”
- The Instruction: “Select the right answer…”
- The Question: The actual test question from MMLU or EtiCor.
They ran this setup across four major models: Llama-3-8B, Mistral-v0.2, GPT-3.5 Turbo, and GPT-4.
3. The Experiments: Chaos in the Machine
If the models were working correctly (i.e., truly understanding culture), we would expect two things:
- High variation in answers when using Cultural Proxies on Cultural Datasets (e.g., Region + EtiCor).
- Zero variation when using Placebo Proxies or on Neutral Datasets (e.g., House Number + Math).
Let’s look at what actually happened.
Visualizing the Inconsistency
First, the researchers mapped the accuracy of Llama-3 on the MMLU dataset (which covers STEM topics). Logic dictates that your knowledge of biology should not change based on what food you eat or what you call your aunt.

The maps above show the accuracy of Llama-3 on MMLU.
- Top Map (Food): Shows accuracy variations when prompted with different foods.
- Bottom Map (Kinship): Shows accuracy variations when prompted with kinship terms.
The takeaway: The maps are spotty and inconsistent. Even though the “Food” and “Kinship” terms are aligned (e.g., matching “Sushi” with the Japanese kinship term “Qi”), the model’s performance fluctuates wildly. If the model were truly simulating a specific cultural persona, these maps should look similar. Instead, they look like random noise.
Label Shifting: The “House Number” Effect
Ideally, for a neutral dataset like MMLU (science/math), the model should give the same answer regardless of the prompt. This is represented by the following expectation:

This equation essentially says: The answer given with a cultural prompt (\(y^j\)) should be identical to the answer given with no prompt (\(y^\phi\)) on neutral data.
However, look at Figure 3 below. This chart tracks “Label Shift”—how often the model changes its answer from the default when a prompt is added.

On the X-axis, we have regions (and their associated foods). The Y-axis is the number of times the answer changed.
- The Result: There are massive shifts. For example, mentioning “Biryani” (Food) causes the model to change its answer on about 14 out of 50 biology/math questions compared to the baseline.
- The Problem: This is MMLU. The answers to math problems shouldn’t change because you ate Biryani. This indicates the model is reacting to the token “Biryani” unpredictably, not reasoning culturally.
The Smoking Gun: Placebos vs. Culture
The most damning evidence comes when we compare the “Cultural” prompts against the “Non-Cultural” (Placebo) prompts.

Figure 5 is dense but critical. It shows the Variance (inconsistency) of answers for different proxies.
- Look at Llama-3 (First column) and Mistral (Second column).
- They show high variance not just for Country and Religion, but for House Number and Programming Language.
- In fact, for Llama-3 on the ETHICS dataset (Red bars), the variance for Programming Language is nearly as high as for Country.
This proves the Placebo Effect. The models are varying their answers simply because there is extra text in the prompt. The specific semantic meaning of “Java” or “House #14” triggers random changes in the model’s probability distribution, which looks like “bias” if you aren’t checking for it.
The Exception: GPT-4
If you look at the far right of Figure 5 (GPT-4), you see a different story. The bars are tiny. GPT-4 is robust. It does not change its answers to math problems based on your house number. This suggests that the “Placebo Effect” is highly correlated with the capability of the model. Smaller, less capable models (Llama-3-8B, Mistral) are easily distracted by irrelevant context.
4. Why Is This Happening? A Qualitative Look
The researchers didn’t just look at the numbers; they read the model outputs to understand the “reasoning” behind these errors.
The “Over-Thinking” Problem
For non-cultural proxies, models often tried to force a connection where none existed.
- Prompt: “As a learner of Java, how would you respond to this moral dilemma?”
- Llama’s Response: “As a learner of Java… the statement is grammatically correct… from a linguistic perspective it is acceptable.”
- Result: The model hallucinated a connection between programming syntax and moral acceptability, leading to a wrong answer.
Cultural Caricatures
The researchers noted that Llama-3 often adopted stereotypical “personas” when prompted with a region.
- Prompt: “Consider yourself a person from Jamaica…”
- Llama Response: “Wah gwaan? Me a-go choose de right answer, mon!”
While this looks like “cultural alignment,” the previous results show that this persona adoption is superficial. It changes the style of the text, but the underlying reasoning logic is brittle and susceptible to random noise. Mistral, on the other hand, did not generate these accents but was equally unstable in its answering logic.
Cross-Model Consistency
Do different models fail in the same way? If “Sushi” implies a specific cultural bias, both Llama and Mistral should shift in the same direction.

Figure 6 compares the variance of two models against each other.
- If both models were capturing the same cultural signal, the dots would line up along the diagonal (x=y).
- Instead, the dots are clustered near the origin or scattered randomly.
- Conclusion: The “cultural bias” exhibited by one model is completely different from the bias in another. It is random, model-specific noise, not a shared representation of human culture.
5. Conclusion: We Need Better Control Experiments
The findings of this paper are a wake-up call for the field of AI Ethics and Safety.
For years, researchers have assumed that if a model changes its answer when prompted with “China” vs. “USA,” it reveals deep-seated cultural representations. This paper argues that for many models, that variation is a mirage—a Placebo Effect caused by the model’s inability to robustly handle prompt perturbations.
Key Takeaways:
- Noise Masquerading as Signal: Much of what looks like “cultural sensitivity” in open-source models (Llama, Mistral) is actually just sensitivity to random tokens (House Numbers, Planets).
- GPT-4 stands alone: Currently, only the most powerful models (GPT-4) seem robust enough to ignore placebo prompts.
- Methodological Flaw: Future studies on bias must include placebo controls. If you are testing for gender bias, you must also test for “favorite color bias” or “house number bias” to set a baseline for random variation.
If we want to build AI that truly understands the world’s diverse cultures, we first need to stop fooling ourselves with prompts that act as placebos. We need models that understand context, not just models that twitch when you poke them with a keyword.
](https://deep-paper.org/en/paper/2406.11661/images/cover.png)