Introduction

In recent years, Multimodal Vision-Language Models (VLMs) like GPT-4V and Gemini have demonstrated an astonishing ability to interpret images. They can identify objects, read text within photos, and describe complex scenes. However, recognizing a “wedding” is one thing; understanding the specific rituals, clothing, and traditions associated with a wedding in rural India versus one in Ethiopia is a different challenge entirely.

As digital interactions become increasingly global, AI models must move beyond general object recognition to grasp cultural values—the shared beliefs, rituals, and traditions that define human societies.

This blog post explores a significant step forward in this domain: CulturalVQA, a research paper that introduces a novel benchmark designed to stress-test the cultural literacy of state-of-the-art VLMs. We will explore how the researchers built this dataset, the specific cultural facets they targeted, and the sobering results that reveal just how far AI still has to go to understand the world’s diverse cultures.

The performance of VLMs over time, segmented by non-Western and Western countries.

As shown in Figure 1, while VLM performance is trending upward, there is a distinct gap between understanding Western and non-Western cultures. Let’s break down why this gap exists and how CulturalVQA measures it.

The Problem with Current Benchmarks

To improve cultural understanding in AI, we first need a yardstick to measure it. Previous attempts to benchmark this capability have existed, but they often suffer from critical limitations:

  1. Limited Scope: Datasets like GD-VCR rely on movie scenes, which are dramatized and do not reflect everyday reality.
  2. Lack of Depth: Datasets like MaXM focus on general reasoning (e.g., counting objects) rather than cultural nuance.
  3. Restricted Formats: Many benchmarks use True/False or multiple-choice questions. These allow models to guess correctly without genuine understanding.

The authors of CulturalVQA argue that to truly assess cultural competence, a model must handle open-ended questions about diverse, real-world images.

Comparison of various datasets closely related to CULTURALVQA across different axes.

As Table 1 illustrates, CulturalVQA fills a unique niche by combining open-ended task formats with culturally diverse images and questions specifically designed to probe cultural understanding rather than just visual reasoning.

Building CulturalVQA: A Methodology

The core contribution of this research is the dataset itself. Creating a benchmark for “culture” is difficult because culture is multifaceted. The researchers broke this down into two main categories:

  1. Tangible Elements: Food, drink, and clothing.
  2. Intangible Elements: Rituals and traditions (which constitute shared cultural “common sense”).

1. Country Selection and Image Sourcing

The team selected 11 countries to ensure a broad representation of cultural clusters, intentionally over-representing African and Islamic cultures which are often marginalized in AI datasets. The countries included:

  • Americas: USA, Canada, Brazil
  • Europe: Germany
  • Asia: China, India
  • Africa/Middle East: Ethiopia, Nigeria, Rwanda, Turkey, Iran

Images were sourced from the CANDLE dataset and filtered using CLIP (Contrastive Language-Image Pre-training) to ensure cultural relevance, followed by a rigorous human filtering process.

2. The Human-in-the-Loop Annotation

Automated scraping isn’t enough to capture cultural nuance. The researchers employed annotators who were natives or long-term residents of the specific countries. These annotators were given specific instructions:

  • Create challenging questions: Ask things that a local would know, but an outsider might not.
  • Use local terminology: Instead of generic terms like “bread” or “dance,” use specific terms like “Naan” or “Guhamiriza.”

Here are a few examples from the dataset that highlight the specificity required:

Traditional Rwandan dance. Question: How do we call that kind of dance show? Answer: Guhamiriza

Chinese cuisine. Question: Which city is the origin of the dish shown? Answer: Suzhou

Signage in Iran. Question: What are women obligated to wear? Answer: Hijab/Headscarf

Brazilian barbecue. Question: What is the name of the Brazilian style of serving beef shown? Answer: Rodizio de carne

3. Dataset Composition

The final dataset consists of 2,378 questions and 7,206 answers associated with 2,328 unique images.

Analysis of data by country, showing counts, question lengths, and agreement scores.

Figure 3 (above) breaks down the data by country. Notably, the inter-annotator agreement varies. It is highest in countries like Canada and lowest in Rwanda. This discrepancy often arises because “country” is an imperfect proxy for “culture”—nations like Nigeria and Rwanda have immense internal subcultural diversity, leading to varied answers even among locals.

Word clouds representing the answers in CULTURALVQA across five facets of culture.

The word clouds in Figure 4 reveal the richness of the dataset. The “Food” facet dominates the dataset (37.3%), but there is significant coverage of “Traditions” (26.1%) and “Rituals” (18%), which are the hardest for models to grasp because they rely on intangible context not explicitly visible in pixels.

Experimental Setup

How do you grade an AI on an open-ended cultural exam? Standard exact-string matching (checking if the model’s text matches the answer key character-for-character) is too harsh. If the answer is “Spicy stew” and the model says “Hot stew,” it should be counted as correct.

The researchers used a metric called LAVE (LLM-assisted Evaluation) using GPT-4. This method asks an LLM to judge if the model’s answer is correct based on the human reference answers, allowing for semantic flexibility while maintaining accuracy.

Determining the Necessity of Visuals

Before testing the VLMs, the researchers ran baselines to ensure the questions actually required looking at the image.

  • LLM-only: Can the model guess the answer just from the text question?
  • LLM + Country: Does knowing the country help?
  • GPT-4V: The full Vision-Language Model.

Baseline evaluation of the degree of visual understanding required in CULTURALVQA.

Figure 5 shows that LLM-only approaches fail significantly (accuracies around 20-30%). The questions in CulturalVQA truly require visual grounding; the model must see the image to answer correctly.

Results and Analysis

The researchers benchmarked a wide range of models, from open-source options like BLIP2, LLaVA, and InternVL to closed-source proprietary models like GPT-4, Gemini, and Claude.

1. The Open vs. Closed Source Gap

The most immediate finding is the disparity between proprietary models and open-source ones.

LAVE accuracies of open- and closed-source models on CULTURALVQA.

Table 2 highlights that GPT-4 is the clear leader, achieving an average accuracy of 61.36%. The best open-source model, Intern-VL, lags behind at 46.18%. This suggests that the massive scale and proprietary training data of commercial models currently provide a significant edge in cultural knowledge.

2. The Western Bias

Look closely at the “Country” rows in Table 2. Performance is not uniform.

  • High Performance: USA (GPT-4: 66.77%), Canada (72.00%), Brazil (76.44%).
  • Low Performance: Nigeria (43.27%), Rwanda (46.41%), Ethiopia (56.38%).

This indicates a strong bias in the training data of these models toward Western and major economic powers. The models struggle significantly with African and Islamic cultural concepts.

3. Model vs. Human Performance

Are these models “good enough”? To find out, the researchers compared the models against human baselines.

Performance gap between models and human performance.

Figure 6 shows the performance gap. Negative values indicate the model is worse than a human. In every single country, even the best models (closed-source) underperform compared to humans. The gap is particularly stark for Iran, Nigeria, and Ethiopia, where the models lack the “cultural common sense” that a local resident possesses.

4. Facet Analysis: What Concepts are Hardest?

Do models understand food better than rituals?

VLM performance across facets as measured using LAVE accuracies.

Interestingly, Figure 7 shows that proprietary models (like GPT-4) actually perform better on intangible concepts like Rituals and Traditions than they do on Food and Drink.

  • Why? The authors hypothesize that rituals often involve specific, named entities (like “Christmas” or “Holi”) that are well-documented in text data.
  • The Food Problem: Food identification often requires fine-grained visual discrimination (e.g., distinguishing specific ingredients or regional variations of a dish), which remains a challenge.

5. Qualitative Failures

Looking at where models fail provides insight into their “thought process.”

Qualitative failure examples of GPT-4 predictions.

As seen in Figure 8 (bottom right example), GPT-4 misidentifies a Naghali (a traditional Iranian storyteller) as a “Dervish.” While visually similar to an untrained eye, they represent completely different cultural concepts. Similarly, it fails to identify the cultural significance of coral beads in Nigeria, seeing them merely as jewelry rather than a symbol of wealth and heritage. These errors show that while VLMs have good general vision, they lack the specific cultural lexicon required for deep understanding.

Conclusion

The CulturalVQA paper serves as a reality check for the AI community. While Vision-Language Models have made tremendous strides, they are far from being “global” citizens.

The benchmark reveals:

  1. A steep performance drop when moving from Western to non-Western contexts (specifically African and Islamic cultures).
  2. A significant lag in open-source models compared to proprietary ones.
  3. A persistent inability to match human-level cultural common sense.

By creating a rigorous, human-annotated benchmark, the authors have provided a roadmap for future improvement. For AI to be truly inclusive and effective globally, future training datasets must move beyond web-scraped quantity and focus on the quality of cultural representation, ensuring that a wedding in Rwanda is recognized with the same precision as a wedding in New York.