Introduction: The “Snow” Problem in AI

Imagine you are training an Artificial Intelligence to understand “commonsense.” You feed it thousands of questions to test its reasoning capabilities. One question asks: “The man needed to shovel his driveway. What season is it?” The answer, obviously, is winter.

Now, imagine asking that same question to a student in Jakarta, Indonesia. They might look at you with confusion. Indonesia is a tropical country; people don’t shovel driveways, and it certainly doesn’t snow. The concept isn’t “commonsense”—it’s culturally irrelevant.

This highlights a massive bottleneck in Natural Language Processing (NLP). Most benchmark datasets used to train and test Large Language Models (LLMs) are heavily Western-centric. When we try to apply these models to underrepresented languages, we often rely on translation. But translation doesn’t fix cultural mismatches.

So, can we use the LLMs themselves to fix this? Can we ask GPT-4 to generate culturally relevant data for languages like Indonesian and Sundanese?

A recent research paper titled “Can LLM Generate Culturally Relevant Commonsense QA Data?” investigates this exact question. The researchers conducted a comprehensive case study involving Indonesian (a national language with medium resources) and Sundanese (a regional language with low resources). Their findings offer a fascinating glimpse into the capabilities—and the limits—of AI in capturing the nuance of human culture.

Background: Why Context Matters

Before diving into the experiments, we need to understand the landscape. Commonsense Question Answering (QA) is a task where a model must answer questions that require prior world knowledge rather than just reading comprehension.

The gold standard for this is the CommonsenseQA dataset in English. However, adapting this to other languages is tricky.

  1. Indonesian: The lingua franca of Indonesia. It utilizes Latin script and is used nationally.
  2. Sundanese: A regional language spoken by the Sundanese people (roughly 34 million speakers), primarily in West Java. While it has many speakers, it is considered a low-resource language in the AI world because there is very little digitized text data available for training.

The researchers identified a gap: there was no commonsense QA dataset for Sundanese, and existing Indonesian datasets often lacked cultural depth. They set out to build one, comparing human effort against AI generation.

The Methodology: Three Paths to Data Creation

The core of this research is a comparison of three distinct methods for creating a dataset. The researchers didn’t just want any data; they wanted data that reflected local concepts—foods, places, habits, and history specific to Indonesia and the Sundanese culture.

As illustrated in the figure below, they devised a pipeline to create ~9,000 question-answer pairs.

Figure 1: Our dataset generation methods. The examples of LLM_ADAPT, HUMAN_GEN, and LLM_GEN datasets are shown in English for clarity. The original versions of these datasets are in Indonesian and Sundanese.

Let’s break down these three methods shown in Figure 1:

1. Automatic Data Adaptation (LLM_ADAPT)

This method attempts to recycle existing English data.

  • The Seed: They took questions from the English CommonsenseQA dataset.
  • The Filter: Not all questions work. They used an LLM to check if concepts (like “snow” or “subway”) are relevant to Indonesia or West Java. If a concept was deemed irrelevant, it was flagged for adaptation.
  • The Adaptation: An LLM was prompted to “localize” the concept. For example, changing “snow” to “volcanic ash” (a common phenomenon in Java).
  • The Translation: The adapted text was then machine-translated into Sundanese.

2. Manual Data Generation (HUMAN_GEN)

This is the “Gold Standard.”

  • The researchers recruited 12 native-speaker annotators from diverse regions across Java and Bali.
  • These humans created questions from scratch based on five cultural categories: Culinary, Place, Culture, History, and Activity.
  • Crucially, they used their own lived experiences to create the options and distractors (wrong answers).

3. Automatic Data Generation (LLM_GEN)

Here, the researchers asked: “Can an LLM do what the humans just did?”

  • They provided GPT-4 Turbo with the same list of categories and concepts used by the human annotators.
  • They prompted the model to generate the questions, the correct answers, and the distractors directly in the target languages.

The resulting dataset is the largest of its kind for these languages. You can see the breakdown of the data splits in the table below. Note the balance between the three methods.

Table 1: Statistics of our generated Indonesian and Sundanese CommonsenseQA dataset. We retained the original English CommonsenseQA splits in LLM_ADAPT to avoid data contamination.

Deep Dive: The Quality of Synthetic Data

Creating data is easy; creating good data is hard. The researchers spent a significant amount of time analyzing whether the LLM-generated content was actually usable.

The “Hallucination” of Translation

The LLM_ADAPT method (adapting English to local context) showed significant cracks, particularly for Sundanese. While English-to-Indonesian adaptation was decent, the step to Sundanese was prone to error.

For example, when adapting the concept “bald eagle,” GPT-4 correctly identified the “Javan hawk-eagle” for the Indonesian context. However, for Sundanese, it hallucinated a non-existent bird called “Garuda Puspa” (literally “Eagle Flower”).

This highlights a major risk in synthetic data: error propagation. If the adaptation model makes a slight logic error, and the translation model adds a linguistic error, the final data point becomes garbage.

The Repetition Problem

When the researchers looked at the LLM_GEN method (creating data from scratch), they noticed a different problem: lack of diversity.

When asked to generate questions about animals or nature in Indonesia, the LLM had “favorites.” As shown in Figure 2, the model overwhelmingly preferred to talk about Komodo dragons.

Figure 2: Top-1O adapted question concepts taken from train, validation,and test set of LLM_ADAPT data.

While Komodo dragons are indeed Indonesian, a human dataset would likely feature a wider variety of local fauna. The model defaults to the most statistically probable (and famous) entities, reducing the cultural richness of the dataset.

Syntax and Fluency

The researchers also evaluated the grammatical correctness of the generated questions.

  • Indonesian: The models performed well, with high fluency.
  • Sundanese: The models struggled.
  • LLM_ADAPT (Translation based) had only a 15.19% rate of error-free questions in Sundanese.
  • LLM_GEN (Direct generation) was better, with 51.00% error-free questions.

This finding is critical: Directly asking an LLM to generate data in a low-resource language is often better than trying to translate adapted English data. Translation tools for low-resource languages simply aren’t robust enough to handle complex commonsense reasoning.

Benchmark Results: Can Models Solve Their Own Tests?

After building these datasets, the researchers ran a series of experiments. They tested various LLMs—including LLaMA-2, Mistral, Merak (an Indonesian LLM), and GPT-4—to see how well they could answer the questions.

Overall Performance

First, let’s look at the big picture in Figure 3.

Figure 3: LLMs’ performance on our combined test set.

There is a clear hierarchy here.

  1. GPT-4 models dominate, scoring above 80% accuracy.
  2. There is a significant performance gap between Indonesian (ind) and Sundanese (sun). Almost every model performs worse on Sundanese (the light gray bars).
  3. Even Merak-v4, a model specifically tuned for Indonesian, struggles to outperform generalist models like GPT-3.5, and it sees a massive drop-off for Sundanese. This confirms that “multilingual” capabilities in models often degrade sharply as soon as you step outside the top 10-20 most spoken languages.

The “Easy Test” Trap

Here is where the study gets very interesting. The researchers compared how models performed on Human-Generated data versus LLM-Generated data.

Take a look at Figure 4. The beige bars represent performance on LLM-generated data, while the red bars represent human data.

Figure 4:LLMs’performance on LLM_GEN vs.HUMAN_GEN in Indonesian and Sundanese.We combined data points from both languages for visualization, with lower quartiles typically representing Sundanese data.

Notice a pattern? The models consistently score higher on the LLM-generated data.

This suggests that LLM-generated datasets are “easier” for other LLMs to solve. The synthetic data likely contains simpler sentence structures and more predictable logic patterns. Human data, by contrast, contains nuance, cultural idiosyncrasies, and “lexical diversity” (a wider vocabulary) that stumps the models.

This creates a dangerous feedback loop. If we only use LLMs to generate training data, and then use LLMs to evaluate that data, we might trick ourselves into thinking our models are smarter than they are. They are just acing a test written by a peer with the same blind spots.

Performance by Category

Finally, the researchers broke down performance by topic.

Figure 5: LLMs performance by question category in LLM_GEN and HUMAN_GEN for Indonesian and Sundanese.

In Figure 5, we see that models struggle most with Culinary (the fourth column of charts). Food culture is deeply local and specific.

  • LLM Example: Asked about “crackers,” an LLM might generate a generic question about ingredients (flour).
  • Human Example: A human annotator generated a question about kerupuk rambak (cattle skin crackers), asking about specific animal parts.

The LLMs lack the “lived experience” to generate or answer questions about specific local delicacies, whereas they perform much better on History or Places, which are well-documented in Wikipedia-style training data.

Discussion: The Depth of Culture

The study concludes that while LLMs can generate culturally relevant data, they lack depth.

When the researchers analyzed the vocabulary, they found that human datasets had a much higher overlap of unique, culturally specific terms. The LLM tended to stay on the surface—mentioning “spicy food” generally, rather than specific regional sambals.

Furthermore, the researchers tried an “open-ended” experiment. Instead of multiple choice, they just asked the model the question.

  • Question: “What song is mandatory during the moment of silence in a flag ceremony?”
  • Model Answer: “Usually, no song is sung.”
  • Correct Answer: “Mengheningkan Cipta.”

In a multiple-choice setting, the model might guess correctly. But when asked to generate the answer freely, it failed. This proves that the model’s “knowledge” is often fragile and reliant on recognizing patterns in the provided options rather than truly knowing the culture.

Conclusion and Implications

This paper makes a significant contribution to the field of NLP for low-resource languages. It provides the largest CommonsenseQA dataset for Indonesian and Sundanese to date.

Key takeaways for students and researchers:

  1. Direct Generation > Adaptation: If you need data for a low-resource language, it is currently better to prompt a powerful model (like GPT-4) to generate it directly in that language rather than translating English data. Translation introduces too much noise.
  2. Humans are Essential for Depth: Synthetic data is great for scale, but it is “easy” data. To truly test a model’s cultural competence, you need human-annotated data that captures the messy, specific details of daily life.
  3. The Low-Resource Gap: Even within the same country, the gap between a national language (Indonesian) and a regional language (Sundanese) is massive in terms of AI performance.

As we move toward more inclusive AI, we cannot simply rely on translating Western datasets. We need to build systems that understand that “commonsense” changes depending on where you are standing—and sometimes, that means knowing that snow doesn’t fall in Jakarta.