Introduction
Imagine you ask an AI assistant a simple question: “How should I dry herbs for homemade oil?”
If you are in the United States, you might expect an answer involving a food dehydrator or an oven. However, if you are in Ghana, the “commonsense” answer—the one that feels intuitively correct to the majority of people—would likely be to dry them in the sun in a basket.
This scenario highlights a critical blind spot in modern Artificial Intelligence. While Large Language Models (LLMs) like GPT-4 and Llama have demonstrated incredible prowess in reasoning, their definition of “commonsense” is often skewed. Because these models are trained primarily on data scraped from the Western web, they tend to treat Western (and specifically American) cultural norms as the universal default.
In the research paper “Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S.,” researchers Christabel Acquaye, Haozhe An, and Rachel Rudinger from the University of Maryland tackle this issue head-on. They introduce a new dataset called AMAMMERE (from the Akan word for ‘culture’) designed to probe the cultural adaptability of English LLMs.

As shown in Figure 1, the dataset challenges models to recognize that “correct” is often a matter of geography. By rigorously comparing how models handle Ghanaian versus American cultural contexts, the researchers reveal significant gaps in how AI serves non-Western populations.
The Problem: Whose Commonsense is it Anyway?
Commonsense reasoning is a holy grail in AI research. It refers to the ability to make presumptions about the world that aren’t explicitly stated. For example, if someone drops a glass, commonsense tells us it might break.
To test this, the AI community has developed massive benchmarks like CSQA (Commonsense Question Answering) and SIQA (Social Intelligence Question Answering). However, these benchmarks are usually created by researchers in Western institutions using crowdsourced workers who are also largely Western.
The result is a feedback loop of bias. The models are trained on Western data, tested on Western benchmarks, and optimized for Western users. This leaves “low-resource cultures”—those with less representation in digital data, such as Ghanaian culture—at a disadvantage. When a Ghanaian user interacts with these models, the AI might fail to understand local social norms, practical knowledge, or cultural references.
The researchers hypothesize that current datasets contain implicit Western biases. To prove this, they needed to build a new kind of test—one that treats cultural knowledge not as a monolith, but as a comparative study between two distinct groups: Ghana and the United States.
The Methodology: Constructing AMAMMERE
Creating a culturally fair dataset is difficult. You cannot simply translate American questions into a local language; the underlying concepts themselves might not translate. The researchers adopted a method based on Cultural Consensus Theory. This theory posits that a “culturally correct” answer is one that reflects the shared consensus of a specific group.
To build the AMAMMERE dataset, the team followed a rigorous, multi-stage “human-in-the-loop” pipeline involving participants from both Ghana and the U.S. at every step.

Step 1: Question Selection and Disambiguation
The process began by selecting 200 questions from existing popular datasets (CSQA, SIQA, and PIQA). The lead author, who is Ghanaian, specifically chose questions where she expected a divergence in cultural norms.
These questions were then rewritten into three versions:
- Unspecified: All cultural markers were removed.
- Ghana Specified: The context was adapted to Ghana (e.g., using names like “Kpakpo” or terms like “pesewas” for currency).
- US Specified: The context was adapted to the U.S. (e.g., using names like “Zach” or “pennies”).
This “disambiguation” step ensures that the model is tested on its ability to recognize context cues.
Step 2: Participatory Answer Generation
Unlike many datasets where researchers write the answers, this project asked people from the culture to generate them. The team recruited separate pools of Ghanaian and American volunteers.
Participants were given the context and asked to write:
- A Correct Answer (culturally appropriate).
- A Distractor Answer (incorrect but plausible enough to be tricky).

As seen in the survey sample above, a Ghanaian participant might describe cutting chicken for a “Bronya” (Christmas) meal differently than an American participant would for their holiday meal.

Conversely, American participants provided answers rooted in their own traditions (Figure 17). This ensured that the “correct” answers were organic and truly representative of the culture, rather than stereotypes imagined by outsiders.
Step 3: Likert Scale Annotation (Measuring Consensus)
Once the answers were generated, the researchers needed to verify them. They didn’t just ask “is this right or wrong?” Instead, they asked a new set of participants to rate the plausibility of the answers on a 5-point Likert scale.

This step is crucial for establishing cultural consensus. An answer is only deemed “correct” for the dataset if it receives a high agreement score from members of that culture. For example, in the Ghanaian context, a “susu box” (a traditional savings box) received a high consensus rating, similar to how a “piggy bank” rates in the U.S.
Step 4: Final Validation
The final multiple-choice questions (MCQs) were constructed by pairing the highest-rated (correct) and lowest-rated (distractor) answers from both cultures. This resulted in questions with four options:
- Ghana Correct
- Ghana Distractor
- US Correct
- US Distractor
Finally, these constructed MCQs were validated one last time by human annotators to ensure quality and agreement.

Experiments: How Do LLMs Perform?
With the dataset of 525 questions finalized, the researchers tested a variety of models. These included encoder models like BERT and RoBERTa, and generative LLMs like Llama-2, Llama-3, Gemma, and Mistral.
They designed three specific experimental setups to probe different aspects of model behavior.
Experiment 1: The “Unspecified” Setting (Measuring Bias)
In this setup, models were given the question without any cultural markers. The goal was to see which culture the model treats as the “default.” If the model is neutral, it shouldn’t strongly prefer one culture’s correct answer over the other.

The Results: As shown in Table 1 (specifically the “Question-and-Answers” columns), models overwhelmingly preferred answer choices that aligned with U.S. preferences.
- RoBERTa-base chose US correct answers 51.43% of the time, compared to only 30.29% for Ghana correct answers.
- Llama3-70B showed an even stronger bias, selecting US answers 68.57% of the time versus 23.43% for Ghana.
This confirms the hypothesis: When an LLM doesn’t know the context, it assumes the user is American.
Experiment 2: The “Specified” Setting (Measuring Adaptability)
Next, the researchers fed the models the culturally specified versions of the questions (e.g., explicitly mentioning “Ghana” or using Ghanaian names). A “culturally adaptable” model should recognize these cues and switch its preference to the Ghanaian answer.
Ghana Specified Contexts:

When the context was explicitly Ghanaian (Figure 3), the models did improve. Llama3-70B successfully selected the Ghanaian correct answer 60% of the time. However, a significant portion of the time, the models still got confused or selected American answers despite the context clues.
US Specified Contexts:

In contrast, looking at Figure 4, when the context was American, the models performed significantly better. Llama3-70B achieved 77% accuracy.
This creates a performance gap. Even when the model knows it is talking about Ghana, it is less accurate than when it is talking about the U.S. It struggles to recall or reason about Ghanaian cultural norms as effectively as American ones.
Experiment 3: Correct-Only and Cultural Facets
The researchers also broke down performance by specific topics, or “facets,” such as Food, Social Customs, and Architecture.

Table 2 reveals some fascinating nuances. Llama3-70B performed decently on Ghanaian “Geography” (78% accuracy), likely because geography is a fixed, objective fact often present in training data.
However, looking at “Social Customs and Lifestyle,” accuracy drops to 52% for Ghana, while maintaining 70% for the US. This suggests that LLMs struggle most with the subtle, unwritten rules of daily life in non-Western cultures—exactly the kind of “commonsense” this paper aims to measure.
The researchers also ran a “Correct-Only” experiment, where they removed the distractors and forced the model to choose between the Ghana Correct Answer and the US Correct Answer.

Figure 7 reinforces the previous findings. In the “Unspecified” (UN) setting, the blue bars (US preference) dominate. In the “GH Specified” setting, the orange bars (Ghana preference) grow, showing adaptability, but the US preference remains stubbornly high for models like BERT and RoBERTa.
Qualitative Analysis: The “Bronya” Example
To truly understand what the models are missing, we can look at a specific example from the paper regarding Christmas celebrations. In Ghana, Christmas is often referred to as “Bronya.”
Context: “This person is married with two little kids.” Question: “How can Kojo make Bronya feel more magical for his family?”
Options: A. Decorate the Christmas tree with lots of presents underneath… (US Consensus) C. Ensure there is enough food, drinks and fun games… (Ghana Consensus)
In the U.S., the “magical” element of Christmas is heavily associated with decorations and piles of presents. In Ghana, while decorations exist, the cultural emphasis is heavily placed on the communal aspect: abundance of food, drinks, and festivities.
When prompted, 5 out of 7 models selected Option A (the tree and presents), even though the prompt used the name “Kojo” and the term “Bronya.” The models recognized the concept of Christmas but failed to map the cultural signifiers (Kojo/Bronya) to the specific Ghanaian practice of prioritizing feasting over decor. They defaulted to the Western “script” for Christmas.
Conclusion and Implications
The AMAMMERE dataset and the accompanying study provide compelling evidence that “commonsense” in AI is currently a synonym for “American norms.”
The key takeaways from the research are:
- Default Bias: In the absence of context, models default to American cultural norms.
- Adaptability Gap: While advanced models like Llama-3 can adapt when explicitly told the cultural context, they are still significantly less accurate for Ghana than for the U.S.
- Participatory Importance: You cannot build a cultural evaluation benchmark without the people from that culture. The multi-stage human annotation process was vital to capturing the difference between a “Susu box” and a “Piggy bank.”
As AI becomes a global utility, embedded in phones and browsers across Africa and the world, this bias matters. A model that misunderstands social customs, food norms, or household practices can be frustrating, useless, or even offensive to users outside the West.
This paper underscores the need for more datasets like AMAMMERE—resources that go beyond translation and dive into the rich, culturally specific knowledge that shapes our daily lives. Only by training and testing on diverse cultural data can we move toward AI that truly understands the world, not just a slice of it.
](https://deep-paper.org/en/paper/file-3687/images/cover.png)