Large Language Models (LLMs) are often celebrated as universal tools, capable of translating languages and answering questions about the world. However, anyone who has used these models extensively knows that “universal” often really means “Western.”

When you ask an LLM to tell a story about a family dinner, the default setting usually mirrors North American or Western European norms. The food, the etiquette, and the social dynamics often fail to resonate with users from Ethiopia, Indonesia, or Mexico. This isn’t just a matter of flavor; it’s a matter of utility and representation. LLMs are prone to stereotyping, erasing cultural nuances, or simply hallucinating facts about non-Western cultures because their training data is heavily skewed toward English-speaking, Western internet content.

So, how do we fix this? Do we retrain the models entirely? Or can we “teach” them culture on the fly by giving them the right textbooks?

In the paper “Towards Geo-Culturally Grounded LLM Generations,” researchers from Google and Washington University in St. Louis investigate a promising solution: Retrieval Augmented Generation (RAG). They pit two distinct grounding strategies against each other: a carefully curated “Bespoke Knowledge Base” versus the chaotic, vast ocean of “Google Search.”

The results offer a fascinating, and somewhat cautionary, tale about the difference between knowing facts about a culture and actually understanding it.

The Core Problem: The “Ghost in the Machine” has an Accent

Before diving into the solution, we must understand the bottleneck. LLMs are trained on massive corpora of text (like Common Crawl). By volume, Western perspectives—specifically US and UK norms—dominate this data. When a model predicts the next word in a sentence, it gravitates toward the statistically most probable completion, which is usually the Western one.

This manifests in three main failures:

  1. Stereotyping: Reducing complex cultures to caricatures.
  2. Erasure: Replacing specific local practices with generic Western ones.
  3. Hallucination: Confidently stating incorrect facts about local artifacts or institutions.

The researchers propose that instead of relying on the model’s frozen, internal parameters (which are hard to change), we should provide the model with external, culturally relevant context at the moment of generation. This is where Grounding comes in.

Two Paths to Grounding: The Library vs. The Internet

The researchers explored two primary methods to inject culture into the model. To understand the comparison, imagine you are trying to learn about a specific ceremony in rural Thailand.

  1. KB-Grounding (The Library): You look up the answer in a specific set of encyclopedias and cultural handbooks you have on your desk.
  2. Search-Grounding (The Internet): You Google it.

Let’s break down how the researchers implemented these two strategies technically.

1. The Knowledge Base (KB) Grounding Strategy

For the first strategy, the authors built a Bespoke Cultural Knowledge Base. They didn’t just scrape the web randomly; they curated data from four specific, high-quality sources designed to capture cultural nuances.

Table 1 showing the sources of documents in the cultural KB including CultureAtlas, Cube, CultureBank, and SeeGULL.

As shown in Table 1, the KB is composed of:

  • CultureAtlas: Wikipedia-style text about cultural norms.
  • Cube: A dataset of artifacts (foods, landmarks, art).
  • CultureBank: Descriptions of situation-based practices (e.g., how people behave on a specific street in Vietnam).
  • SeeGULL: A dataset of stereotypes (included intentionally to test if the model can identify and avoid them).

To make this data usable for an LLM, the researchers used a standard RAG pipeline.

Figure 1 illustrating the workflow for Knowledge Base grounding versus Search grounding.

Figure 1 (Top) illustrates the KB-Grounding architecture. Here is the step-by-step process:

  1. Query Rewriting: The user’s prompt is transformed into a search query.
  2. Retrieval: The system searches the vector store (the KB) for the \(n\) most similar documents.
  3. Relevancy Check (Selective RAG): This is a crucial step. The model doesn’t just blindly stuff the retrieved documents into the context window. It first checks: Is this document actually relevant to the question? If the document is irrelevant, it is discarded. This prevents confusing the model with noise.
  4. Augmentation: The relevant documents are added to the prompt.
  5. Generation: The LLM answers the question using the new context.

2. The Search Grounding Strategy

The second strategy, shown in the bottom half of Figure 1, relies on Search Grounding.

This method uses a commercial API (Google’s “Grounding with Google Search”). Instead of querying a static vector store, the model:

  1. Translates the user prompt into a web search query.
  2. Uses the search engine’s proprietary ranking algorithms to find live web pages.
  3. Extracts relevant text.
  4. Feeds that text into the LLM to generate an answer.

The Trade-off:

  • KB-Grounding offers control. You know exactly what is in your database. However, it is small (limited coverage).
  • Search-Grounding offers scale. It has access to the entire internet. However, it is noisy, potentially biased, and the retrieval logic is a “black box.”

What Does the Data Look Like?

To understand why the models behaved the way they did, we have to look at the “textbooks” they were given. The Bespoke KB contained specific cultural facts converted into simple sentences.

Table 2 providing examples of documents in the bespoke knowledge base, such as facts about Assam culture or Brazilian cuisine.

As seen in Table 2, the KB contains granular details, such as “Manihot esculenta originates from Brazilian cuisine” or specific stereotypes from the SeeGULL dataset like “Mexicans are unintelligent.”

Wait, why include stereotypes? The researchers included stereotypes in the KB to see if the model would accidentally affirm them when retrieved, or if it would have the intelligence to recognize them as biases. This leads us directly to the experiments.

Experiment 1: Who Knows More? (Cultural Knowledge)

The first set of evaluations tested Propositional Knowledge—essentially, trivia. Can the model answer multiple-choice questions about daily life, sports, and social norms in different countries?

They used two benchmarks:

  1. BLEnD: Everyday knowledge (e.g., “What is the most popular fruit in the UK?”).
  2. NORMAD: Social norms (e.g., “Is it acceptable to open a gift immediately upon receiving it in this country?”).

They tested three models: Gemini 1.5 Flash, GPT-4o-mini, and OLMo 2.

The Results

Figure 2 showing bar charts of performance across BLEnD, NORMAD, and Stereotype Avoidance tasks.

Figure 2 reveals the winner clearly. Look at the orange bars in the first three charts.

  1. Search Grounding dominates for Fact Retrieval: For the BLEnD dataset (far left), the Search-Grounded Gemini (Orange Bar) significantly outperforms the Vanilla model (Teal) and the KB-grounded versions.
  • Why? The internet is simply bigger than any bespoke dataset. In one example, the vanilla model thought “Real Madrid” was the most popular sports team in Ethiopia (a hallucination based on global popularity). The Search-grounded model correctly identified “Ethiopian Coffee Sport Club” because it could find that specific fact on the live web.
  1. KB Grounding struggles with Coverage: The KB strategies (Pink and Green bars) often performed only slightly better, or sometimes worse, than the vanilla model. This was largely due to “retrieval failure.” If the specific fact about Ethiopian coffee wasn’t in the curated Cube or CultureAtlas datasets, the model had no extra help.

  2. Context Matters for Norms: In the NORMAD benchmarks (middle charts), Search Grounding was still superior for Gemini. However, for GPT and OLMo (where search wasn’t available), the Selective KB grounding (Green) was often better than Non-Selective. This validates the importance of the “Relevancy Check” step—filtering out bad documents is just as important as finding good ones.

Experiment 2: The Stereotype Trap

If Search Grounding is so good at facts, should we just use it for everything? Absolutely not.

The researchers ran a “Stereotype Avoidance” test using the SeeGULL benchmark. The task was to ask the model questions based on stereotypes (e.g., “I know a person who is a fighter. This person is: (A) Albanian…”). The correct answer is always “unsure” or refusal to answer.

Look at the Stereotype Avoidance chart in Figure 2 (far right).

  • The Vanilla Gemini (Teal) has a very high score (meaning it successfully avoided stereotypes).
  • The Search-Grounded Gemini (Orange) drops significantly in performance.

This is a critical finding. When the model searches the internet, it retrieves the very biases prevalent on the web. If the model searches for “fighters” and “Albanian,” it might find forum posts or articles reinforcing that stereotype. Upon seeing this “evidence” in its context window, the LLM is tricked into treating the stereotype as a fact, leading it to select the stereotypical answer rather than remaining neutral.

While Search makes the model smarter about facts, it makes it more susceptible to the biases of the open web.

Experiment 3: The Human Evaluation (The Plot Twist)

The final experiment is perhaps the most revealing. The researchers moved beyond multiple-choice questions to Open-Ended Generation.

They asked the models to “Tell a story in Mexico in which a group of people… eat together and behave in a socially acceptable way.” They then recruited human evaluators from those specific countries (Mexico, Ethiopia, China, etc.) to rate the stories on Cultural Familiarity.

The hypothesis was simple: The models with access to cultural data (KB or Search) should write stories that feel more familiar and authentic to locals.

The Result: No Significant Difference

Surprisingly, the human evaluators did not rate the grounded stories significantly higher than the vanilla stories.

  • Search-Grounded models occasionally tried too hard. Instead of weaving a natural story, the model would sometimes act like a search engine, summarizing facts about the culture rather than narrating within it.
  • KB-Grounded models included specific artifacts (mentioning specific dishes or games), but this didn’t necessarily translate to “cultural fluency.”

Qualitatively, the researchers noted that grounded models did include more specific details (e.g., naming a specific local dish rather than just “dinner”). However, to a human reader, inserting the name of a local dish doesn’t make a story feel “native.” It just makes it factually dense.

Conclusion: Knowledge vs. Fluency

This paper highlights a fundamental distinction in the quest for culturally aware AI: Propositional Knowledge vs. Cultural Fluency.

Propositional Knowledge is the ability to answer “What is the capital?” or “What is this food called?”

  • RAG (specifically Search Grounding) excels here. It fills the model’s knowledge gaps with up-to-date facts.

Cultural Fluency is the ability to speak, reason, and tell stories like someone from that culture.

  • RAG struggles here. Retrieving a Wikipedia article about a wedding ceremony and pasting it into the prompt does not teach the model the subtle emotional dynamics, the slang, or the “vibe” of being at that wedding.

Key Takeaways for Students

  1. Search is a double-edged sword: It provides the best factual coverage but introduces a high risk of ingesting and regurgitating web-based stereotypes.
  2. Relevancy Checks are mandatory: Simply retrieving data isn’t enough; you must filter it. In the experiments, feeding the model irrelevant documents (Non-selective RAG) often confused it, lowering performance.
  3. Facts \(\neq\) Culture: You cannot solve the “cultural gap” in AI simply by hooking a model up to an encyclopedia. While it fixes factual errors (hallucinations), it does not solve the deeper issue of cultural erasure or lack of fluency.

Future work in this field will likely need to move beyond simple text retrieval. To truly “ground” a model in a culture, we may need better datasets that capture the experience of a culture, not just its facts, or perhaps training regimes that prioritize non-English, non-Western data from the start. Until then, RAG remains a useful, but imperfect, patch on a global problem.