Introduction
In recent years, Multimodal Vision-Language Models (VLMs) like GPT-4V and Gemini have demonstrated an astonishing ability to interpret images. They can identify objects, read text within photos, and describe complex scenes. However, recognizing a “wedding” is one thing; understanding the specific rituals, clothing, and traditions associated with a wedding in rural India versus one in Ethiopia is a different challenge entirely.
As digital interactions become increasingly global, AI models must move beyond general object recognition to grasp cultural values—the shared beliefs, rituals, and traditions that define human societies.
This blog post explores a significant step forward in this domain: CulturalVQA, a research paper that introduces a novel benchmark designed to stress-test the cultural literacy of state-of-the-art VLMs. We will explore how the researchers built this dataset, the specific cultural facets they targeted, and the sobering results that reveal just how far AI still has to go to understand the world’s diverse cultures.

As shown in Figure 1, while VLM performance is trending upward, there is a distinct gap between understanding Western and non-Western cultures. Let’s break down why this gap exists and how CulturalVQA measures it.
The Problem with Current Benchmarks
To improve cultural understanding in AI, we first need a yardstick to measure it. Previous attempts to benchmark this capability have existed, but they often suffer from critical limitations:
- Limited Scope: Datasets like GD-VCR rely on movie scenes, which are dramatized and do not reflect everyday reality.
- Lack of Depth: Datasets like MaXM focus on general reasoning (e.g., counting objects) rather than cultural nuance.
- Restricted Formats: Many benchmarks use True/False or multiple-choice questions. These allow models to guess correctly without genuine understanding.
The authors of CulturalVQA argue that to truly assess cultural competence, a model must handle open-ended questions about diverse, real-world images.

As Table 1 illustrates, CulturalVQA fills a unique niche by combining open-ended task formats with culturally diverse images and questions specifically designed to probe cultural understanding rather than just visual reasoning.
Building CulturalVQA: A Methodology
The core contribution of this research is the dataset itself. Creating a benchmark for “culture” is difficult because culture is multifaceted. The researchers broke this down into two main categories:
- Tangible Elements: Food, drink, and clothing.
- Intangible Elements: Rituals and traditions (which constitute shared cultural “common sense”).
1. Country Selection and Image Sourcing
The team selected 11 countries to ensure a broad representation of cultural clusters, intentionally over-representing African and Islamic cultures which are often marginalized in AI datasets. The countries included:
- Americas: USA, Canada, Brazil
- Europe: Germany
- Asia: China, India
- Africa/Middle East: Ethiopia, Nigeria, Rwanda, Turkey, Iran
Images were sourced from the CANDLE dataset and filtered using CLIP (Contrastive Language-Image Pre-training) to ensure cultural relevance, followed by a rigorous human filtering process.
2. The Human-in-the-Loop Annotation
Automated scraping isn’t enough to capture cultural nuance. The researchers employed annotators who were natives or long-term residents of the specific countries. These annotators were given specific instructions:
- Create challenging questions: Ask things that a local would know, but an outsider might not.
- Use local terminology: Instead of generic terms like “bread” or “dance,” use specific terms like “Naan” or “Guhamiriza.”
Here are a few examples from the dataset that highlight the specificity required:




3. Dataset Composition
The final dataset consists of 2,378 questions and 7,206 answers associated with 2,328 unique images.

Figure 3 (above) breaks down the data by country. Notably, the inter-annotator agreement varies. It is highest in countries like Canada and lowest in Rwanda. This discrepancy often arises because “country” is an imperfect proxy for “culture”—nations like Nigeria and Rwanda have immense internal subcultural diversity, leading to varied answers even among locals.

The word clouds in Figure 4 reveal the richness of the dataset. The “Food” facet dominates the dataset (37.3%), but there is significant coverage of “Traditions” (26.1%) and “Rituals” (18%), which are the hardest for models to grasp because they rely on intangible context not explicitly visible in pixels.
Experimental Setup
How do you grade an AI on an open-ended cultural exam? Standard exact-string matching (checking if the model’s text matches the answer key character-for-character) is too harsh. If the answer is “Spicy stew” and the model says “Hot stew,” it should be counted as correct.
The researchers used a metric called LAVE (LLM-assisted Evaluation) using GPT-4. This method asks an LLM to judge if the model’s answer is correct based on the human reference answers, allowing for semantic flexibility while maintaining accuracy.
Determining the Necessity of Visuals
Before testing the VLMs, the researchers ran baselines to ensure the questions actually required looking at the image.
- LLM-only: Can the model guess the answer just from the text question?
- LLM + Country: Does knowing the country help?
- GPT-4V: The full Vision-Language Model.

Figure 5 shows that LLM-only approaches fail significantly (accuracies around 20-30%). The questions in CulturalVQA truly require visual grounding; the model must see the image to answer correctly.
Results and Analysis
The researchers benchmarked a wide range of models, from open-source options like BLIP2, LLaVA, and InternVL to closed-source proprietary models like GPT-4, Gemini, and Claude.
1. The Open vs. Closed Source Gap
The most immediate finding is the disparity between proprietary models and open-source ones.

Table 2 highlights that GPT-4 is the clear leader, achieving an average accuracy of 61.36%. The best open-source model, Intern-VL, lags behind at 46.18%. This suggests that the massive scale and proprietary training data of commercial models currently provide a significant edge in cultural knowledge.
2. The Western Bias
Look closely at the “Country” rows in Table 2. Performance is not uniform.
- High Performance: USA (GPT-4: 66.77%), Canada (72.00%), Brazil (76.44%).
- Low Performance: Nigeria (43.27%), Rwanda (46.41%), Ethiopia (56.38%).
This indicates a strong bias in the training data of these models toward Western and major economic powers. The models struggle significantly with African and Islamic cultural concepts.
3. Model vs. Human Performance
Are these models “good enough”? To find out, the researchers compared the models against human baselines.

Figure 6 shows the performance gap. Negative values indicate the model is worse than a human. In every single country, even the best models (closed-source) underperform compared to humans. The gap is particularly stark for Iran, Nigeria, and Ethiopia, where the models lack the “cultural common sense” that a local resident possesses.
4. Facet Analysis: What Concepts are Hardest?
Do models understand food better than rituals?

Interestingly, Figure 7 shows that proprietary models (like GPT-4) actually perform better on intangible concepts like Rituals and Traditions than they do on Food and Drink.
- Why? The authors hypothesize that rituals often involve specific, named entities (like “Christmas” or “Holi”) that are well-documented in text data.
- The Food Problem: Food identification often requires fine-grained visual discrimination (e.g., distinguishing specific ingredients or regional variations of a dish), which remains a challenge.
5. Qualitative Failures
Looking at where models fail provides insight into their “thought process.”

As seen in Figure 8 (bottom right example), GPT-4 misidentifies a Naghali (a traditional Iranian storyteller) as a “Dervish.” While visually similar to an untrained eye, they represent completely different cultural concepts. Similarly, it fails to identify the cultural significance of coral beads in Nigeria, seeing them merely as jewelry rather than a symbol of wealth and heritage. These errors show that while VLMs have good general vision, they lack the specific cultural lexicon required for deep understanding.
Conclusion
The CulturalVQA paper serves as a reality check for the AI community. While Vision-Language Models have made tremendous strides, they are far from being “global” citizens.
The benchmark reveals:
- A steep performance drop when moving from Western to non-Western contexts (specifically African and Islamic cultures).
- A significant lag in open-source models compared to proprietary ones.
- A persistent inability to match human-level cultural common sense.
By creating a rigorous, human-annotated benchmark, the authors have provided a roadmap for future improvement. For AI to be truly inclusive and effective globally, future training datasets must move beyond web-scraped quantity and focus on the quality of cultural representation, ensuring that a wedding in Rwanda is recognized with the same precision as a wedding in New York.
](https://deep-paper.org/en/paper/2407.10920/images/cover.png)