Large Language Models (LLMs) like ChatGPT are often presented as universal tools—omniscient assistants capable of conversing on any topic, in any language. However, when we peel back the layers of this “universality,” we often find a very specific worldview encoded in the system. For millions of English speakers around the world, ChatGPT does not act as a neutral mirror of their language; instead, it acts as a corrective lens, filtering out their cultural identity or, worse, reflecting a caricature back at them.

A recent paper by researchers at UC Berkeley, “Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination,” conducts a large-scale audit of how GPT-3.5 and GPT-4 handle ten different varieties of English. The results highlight a troubling dynamic: these models not only default to “Standard” American English, erasing other dialects, but when pushed to acknowledge those dialects, they often resort to harmful stereotypes.

In this deep dive, we will explore how the researchers designed this study, the specific ways in which the models failed minoritized English speakers, and why “better” models like GPT-4 might actually be making the problem of stereotyping worse.

The Illusion of “Standard” English

To understand the study, we first need to dismantle the idea of “Standard” English. In the field of linguistics, what is often called “Standard” (like Standard American English or Standard British English) is simply an idealized version of the language, usually associated with high-status groups and formal writing. It is not inherently “better” or “more correct” than other varieties; it just has more power.

Billions of people speak varieties of English that differ from these standards—varieties like African American English (AAE), Nigerian English, Singaporean English (“Singlish”), or Jamaican English. These are rich, rule-based linguistic systems with their own grammar and vocabulary.

The core problem the researchers sought to investigate is dialect bias. If an AI model is trained primarily on Standard American English (SAE), how does it treat users who speak Nigerian English? Does it understand them? does it respect them? Or does it treat their way of speaking as an error to be fixed?

The Study Design: A Global Perspective

While previous research has looked at bias against African American English, this paper widens the lens significantly. The authors selected ten varieties of English to test:

  1. Standard Varieties: Standard American English (SAE), Standard British English (SBE).
  2. Minoritized Varieties: African American English (AAE), Indian English, Irish English, Jamaican English, Kenyan English, Nigerian English, Scottish English, and Singaporean English.

The researchers collected real-world writing samples from native speakers of these dialects—specifically looking for informal text like letters, emails, and messages where dialect features are most prominent.

The investigation was split into two distinct studies:

  1. Linguistic Analysis: Does the model write back in the same dialect, or does it switch to a standard form?
  2. Human Evaluation: How do native speakers feel about the AI’s responses? Are they offended, confused, or satisfied?

Study 1: The Erasure of Identity

In the first experiment, the researchers fed the models prompts written in specific dialects (e.g., a text in Kenyan English asking for travel advice) and analyzed the response. They were looking for feature retention. If the input uses a specific grammatical feature of Kenyan English (like omitting the article “a” in certain contexts), does the output do the same?

The results were stark. The models overwhelmingly default to Standard American English.

The “Americanization” Machine

When you speak to ChatGPT in a minoritized variety, it essentially “translates” your query into Standard American English before responding.

  • Standard American English (SAE): The model retained 78% of the distinctive linguistic features.
  • Standard British English (SBE): The model retained 72%.
  • Minoritized Varieties: Retention collapsed. For Indian English, it was 16%. For Irish, AAE, Scottish, and Singaporean English, retention was 3-4%. For Jamaican English, it was nearly 0%.

This means that if a user writes in Singaporean English using common particles like “lah” or specific sentence structures, ChatGPT ignores that style and responds like a corporate American assistant.

Why Does This Happen?

The researchers hypothesized that this erasure is a data problem. LLMs are trained on massive scrapes of the internet, where Standard American English is dominant.

Figure 2: Estimated maximum speaker population vs. retention rate for minoritized varieties.

As shown in Figure 2, there is a clear correlation between the estimated population of speakers (a proxy for the amount of data likely available) and the retention rate. Varieties with massive populations, like Indian English, saw slightly better retention than smaller populations like Jamaican English. However, even the “high” retention for minoritized varieties is negligible compared to the standard varieties.

The Orthography Shift

It isn’t just grammar; it’s spelling (orthography) too. The legacy of British colonialism means that many global English varieties use British spelling (e.g., colour vs. color, analyse vs. analyze).

Figure 3: Change in % of examples using British, American, or either orthographic style from inputs to outputs.

Figure 3 illustrates a massive shift toward Americanization.

  • Blue bars (British spelling): Notice how large the blue sections are in the “Input” column for varieties like Indian, Irish, and Scottish English.
  • Red bars (American spelling): In the “Output” column, the blue shrinks and the red expands dramatically.

Even for Standard British English inputs, the model frequently flipped to American spelling. The AI is actively homogenizing the written English language, stripping away regional markers in favor of a US-centric norm.

Study 2: The Human Experience

The linguistic analysis showed us what the model produces (American English). But how does that feel to the user?

To find out, the researchers recruited native speakers of all ten dialects. They showed these participants the AI’s responses and asked them to rate the text on qualities like warmth, naturalness, and—crucially—stereotyping and demeaning content.

The “Standard” Privilege

The difference in user experience was profound. Speakers of standard varieties (American and British) generally rated the responses as natural and polite. Speakers of minoritized varieties reported a much darker experience.

Figure 1: Sample model responses (top) and native speaker reactions (bottom).

Figure 1 provides a visual summary of the qualitative feedback.

  • Stereotyping: When the model attempts to acknowledge the dialect, it often swings too far. As seen in the top left, a response to Singaporean English uses exaggerated phrases like “damn jialat” and “Wah sian sia,” which native speakers found “cringeworthy.”
  • Distance: Conversely, as seen on the right, when the model ignores the dialect, it feels “cold” and “robotic.” A Jamaican speaker might write with warmth and intimacy, only to receive a sterile, formal corporate response.

Quantifying the Harms

The survey results painted a grim picture of the “value gap” between standard and minoritized users.

Figure 4: Average response ratings by variety (5-point scale).

Figure 4 breaks down these ratings. The red titles indicate negative qualities, while green indicates positive ones. The orange dotted line represents the baseline experience for Standard English speakers.

  • Demeaning Content: Responses to minoritized varieties were rated 25% more demeaning than responses to standard varieties.
  • Stereotyping: They were 19% more stereotyping.
  • Condescension: They were 15% more condescending.
  • Comprehension: Users felt the model understood them 9% less than standard users did.

The takeaway is clear: if you speak a minoritized variety, the “default” ChatGPT experience is not neutral. It is statistically more likely to make you feel misunderstood, looked down upon, or caricatured.

The Imitation Trap: When “Trying” Makes It Worse

You might ask: “Why not just prompt the model to speak the dialect?”

The researchers tried exactly that. They modified the system prompt to instruct ChatGPT: “Reply to the message as if you are the recipient. Match the sender’s dialect, formality, and tone.”

They tested this on both GPT-3.5 and the more advanced GPT-4. The results revealed a dangerous nuance in how these models “learn.”

GPT-3.5: The Incompetent Mimic

When GPT-3.5 tried to imitate dialects like AAE or Nigerian English:

  • Comprehension dropped further. The model became so focused on “sounding” like the dialect that it lost track of what the user was actually saying.
  • Stereotyping increased. It began hallucinating dialect features that didn’t fit, creating an unnatural parody.

GPT-4: The Warm Caricature

GPT-4 is a much more powerful model. Did it fix the problem? Yes and no.

Figure 5: Change in ratings from GPT-3.5 (no imitation) to GPT-3.5 (imitation) to GPT-4 (imitation).

Figure 5 (specifically the bottom half) compares GPT-3.5’s imitation to GPT-4’s imitation.

  • The Good: GPT-4 was rated as significantly “warmer” and “friendlier” (Green titles). It was better at capturing the vibe of the conversation.
  • The Bad: Look at the “Stereotyping” column (far left). GPT-4 exhibited a marked increase in stereotyping (+18%).

This is a critical finding. As models get “smarter,” they get better at picking up on the linguistic features of a dialect. But because their training data for these dialects is likely skewed toward stereotypes or limited contexts, their “improved” imitation ends up being a more sophisticated form of mockery. A user might feel the model is being friendlier, but simultaneously feel that it is performing a racist or classist caricature of them.

Rigorous Evaluation

It is worth noting the rigor with which the researchers collected this human feedback. They did not rely on automated metrics (which are themselves biased). They recruited distinct pools of native speakers for each dialect, asking detailed questions about specific emotional reactions.

Figure 8: Sample annotation form, part 1 (Jamaican English).

As shown in the sample survey form above (Figure 8), participants were asked nuanced questions, such as whether the response sounded like something a parent, friend, or grandparent would write. This level of granularity allowed the researchers to pinpoint exactly why a response felt “off”—whether it was too formal (like a boss) or awkwardly intimate.

Conclusion: The Cost of a Monolingual Internet

The implications of this paper extend far beyond chatbots. As LLMs become the infrastructure for global communication—powering translation, email composition, and educational tools—biases against minoritized English varieties will become systemic barriers.

If a speaker of Nigerian English uses an LLM to help write a cover letter, the model might “correct” their perfectly valid dialect into Standard American English, reinforcing the idea that their way of speaking is unprofessional. Conversely, if they use a chatbot for mental health support, they might receive responses that are subtly condescending or stereotyping, breaking the trust necessary for such tools to work.

The researchers conclude that we cannot simply “scale” our way out of this problem. As the GPT-4 results showed, bigger models can just become more effective stereotyping engines. Addressing linguistic bias requires a fundamental shift in how we curate training data, moving away from a US-centric “standard” and recognizing the validity and richness of English as it is spoken by billions of people globally.

Until then, for much of the world, the “AI revolution” will continue to speak with an American accent.