In the last few years, Vision-Language Models (VLMs) like CLIP, BLIP-2, and GPT-4V have revolutionized how computers understand the world. They can caption photos, answer questions about visual scenes, and generate art from text. We often attribute their success to the massive scale of their training data—billions of image-text pairs scraped from the internet.

But there is a hidden cost to this scale. The internet is not a perfect mirror of the world; it is heavily skewed toward Western cultures, particularly North America and Europe.

What happens when these models are asked to identify a wedding in India, or a breakfast in Ethiopia? Do they understand “universal” human concepts across different cultures, or do they default to a Western standard?

This post explores a fascinating research paper titled “From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models.” The researchers introduce a new benchmark, GLOBALRG, designed to stress-test the cultural inclusivity of modern AI. We will break down their methodology, the two distinct tasks they proposed, and the sobering results regarding the current state of AI’s cultural intelligence.

The Geo-Diversity Problem

To understand why this paper is important, we first need to look at the “Geo-Diversity Problem.” Most large-scale datasets used to train AI are sourced from the Western web. Consequently, models often exhibit performance disparities. They might effortlessly recognize a hamburger but fail to identify a vadai.

Previous benchmarks have attempted to measure this, but they have been limited in scope—often covering only 5 to 7 cultures or focusing strictly on “visual question answering.” They also miss a crucial nuance: the difference between universal concepts (things everyone does, like eating breakfast) and local concepts (specific cultural items, like a molinillo whisk).

The GLOBALRG benchmark addresses this by evaluating models on two distinct capabilities:

  1. Retrieval Across Universals: Can the model find diverse images for a broad concept like “wedding”?
  2. Cultural Visual Grounding: Can the model locate a specific, culture-bound object in an image?

Figure 1: An example instance from each task in GLOBALRG: i) Retrieval Across Universals measures the ability of VLMs to retrieve culturally diverse images for a query q. ii) Cultural Visual Grounding aims to evaluate the ability of VLMs to identify a cultural concept q.

As shown in Figure 1, the first task (top) requires the model to understand that a wedding can look very different depending on whether it is in the US, India, or Nigeria. The second task (bottom) requires the model to know exactly what a specific cultural object looks like and where it is located in a scene.

Task 1: Retrieval Across Universals

The first challenge focuses on “Human Universals”—concepts that exist across almost all cultures. The authors selected 20 such concepts, including “breakfast,” “funeral,” “farming,” and “music.”

The goal here isn’t just to see if the model can find an image of a wedding; it is to see if the model can retrieve culturally diverse images.

The Dataset

To build this dataset, the researchers covered 50 countries across 10 distinct regions. This offers significantly wider coverage than previous attempts.

Table 1: List of cultures covered in the retrieval task.

The team used a cultural knowledge base called CANDLE to extract context-specific sentences (e.g., “The mehendi ceremony holds significance in Indian tradition”) and used these to scrape diverse images. After manual verification to remove low-quality data, they curated a dataset of 3,000 visually diverse images (50 cultures \(\times\) 20 universals \(\times\) 3 images).

Measuring Diversity: A New Metric

In standard information retrieval, we usually only care about Precision (or Relevance): did the model return a relevant image? If I ask for a “wedding,” and the model returns 10 photos of weddings, it has a high precision score.

However, if all 10 photos are of white dresses in Western churches, the model has failed the test of cultural diversity. To quantify this, the authors introduced a Diversity@k metric based on entropy.

Equation for Diversity@k

In this equation:

  • \(m\) is the total number of cultures.
  • \(p_i\) is the proportion of images from the \(i\)-th culture in the top \(k\) results.

Essentially, a score of 0 means low diversity (bias toward specific countries), while a score near 1 indicates high diversity (the retrieved images are well-distributed across different cultures). This forces the evaluation to consider fairness alongside accuracy.

Task 2: Cultural Visual Grounding

The second task tests the depth of the model’s knowledge. While “food” is a universal concept, “kimchi” or “baguette” are local instantiations of that concept.

Visual Grounding is the task of finding an object in an image and drawing a “bounding box” around it. Most existing datasets for this task (like COCO) focus on generic objects like “car,” “dog,” or “person.”

For GLOBALRG, the researchers created a dataset focusing on culture-specific items from 15 countries.

Table 12: List of cultures concepts covered in Cultural Visual Grounding dataset

As seen in the table above, the concepts are specific. In Argentina, the model must find alfajor or mate. In Vietnam, it must look for Ao Dai or Banh Mi.

Collecting Authentic Data

You cannot simply scrape these images without context. The researchers recruited annotators from the respective cultures to find high-quality images and manually draw the bounding boxes. This ensures that the “Ground Truth” is culturally accurate. They collected 591 verified images, ensuring the target object wasn’t the only thing in the picture (which would make the task too easy).

Experiments and Results

The authors evaluated a wide range of models, including CLIP, OpenCLIP, CoCA, BLIP-2, and Grounding DINO. The results expose significant gaps in current AI capabilities.

1. Retrieval: The Illusion of Diversity

In the retrieval task, the models were judged on both Relevance (did they find the right concept?) and Diversity (did they represent the world?).

Table 3: Average performance of various VLMs on the the retrieval across universals task, in terms of Relevance and Diversity.

Table 3 shows that models trained on massive datasets, like CoCA (3 billion images) and OpenCLIP (2 billion images), generally performed best. TCL was a surprise outlier; despite being trained on a much smaller dataset (4 million images), it performed competitively, suggesting that its training objectives (Triple Contrastive Learning) might be efficient at learning distinct features.

However, the numbers hide a deeper bias. Even when models achieved high “country diversity” scores, a visual inspection revealed that they often retrieved images that—while technically from different countries—still adhered to Western visual norms.

Figure 2: Top 5 images retrieved for a sample of the universals by models CLIP, CoCA and BLIP-2. Each image is annotated with a flag representing the country, and the background colour of the flag represents the region.

Figure 2 above is perhaps the most telling illustration in the paper:

  • Breakfast (Top Row): Look at the images retrieved by CLIP. They are from different countries (flags), but almost all feature eggs, sausages, and toast. The model has learned that “breakfast” = “Western breakfast,” ignoring that breakfast in Japan might be fish and rice, or injera in Ethiopia.
  • Funeral (Second Row): The models overwhelmingly retrieve images of people in black clothing. However, in many cultures, white is the color of mourning.
  • Wedding (Bottom Row): While there is some diversity, there is a strong preference for white dresses, even in cultures where red or other colors are traditional for weddings.

This suggests that VLMs are capturing a “surface-level” diversity while still enforcing a Western hegemony on the content of the concepts.

2. Visual Grounding: A Map of Bias

For the grounding task (finding specific objects), the performance disparity was stark.

Figure 3: Country-level Accuracy of each model on the Cultural Visual Grounding task.

The heatmap in Figure 3 visualizes accuracy by country. Blue indicates high accuracy; red indicates low accuracy.

  • The Blue Zones: Notice how most models perform decently on Canada and Mexico (North American context).
  • The Red Zones: Look at Vietnam, Philippines, and Nigeria. The deep red squares indicate that state-of-the-art models are failing almost completely to identify cultural objects from these regions.

Figure 4: Culture group-level Accuracy for Cultural Visual Grounding.

Figure 4 aggregates this data by region. There is a massive drop-off in accuracy when moving from North America (avg ~64%) to South East Asia (avg ~20-30%). This confirms that the training data for these models is likely severely lacking in representation from Asian and African regions.

Why do models fail at grounding?

The authors identified two main types of errors in the grounding task:

  1. Unfamiliarity: The model simply doesn’t know the word. For example, when asked to find a bayong (a woven bag from the Philippines), the model might just select a person in the image because it has no association with the word “bayong.”
  2. Shape Confusion: The model finds something that looks vaguely similar.

Figure 5: Qualitative Examples showing the performance of specialist and generalist models on Cultural Visual Grounding task.

Figure 5 provides qualitative examples of these failures:

  • Row 3 (India - “Diya”): The task is to find the diya (a small oil lamp). Several models fail to locate the small, specific object, identifying the whole tray or the wrong area.
  • Row 4 (Nigeria - “Ogene”): The ogene is a double-bell instrument. The models struggle to distinguish it from the person holding it or other background elements.

Grounding DINO (a specialist model designed specifically for object detection) generally outperformed the generalist models (like LLaVA or MiniGPT-v2), but even it struggled significantly outside of Western contexts.

Conclusion and Implications

The GLOBALRG benchmark serves as a reality check for the AI community. While Vision-Language Models have become incredibly powerful, they possess a worldview that is significantly narrower than the actual world we live in.

Key Takeaways:

  • Data Scale isn’t enough: Simply training on more internet data doesn’t fix bias, because the internet itself is biased. CoCA and OpenCLIP are better, but they still default to Western visual standards (e.g., eggs for breakfast).
  • The “Western Universal”: Models tend to homogenize universal concepts. They struggle to understand that a “wedding” or a “funeral” looks fundamentally different across cultures.
  • Regional Performance Gaps: There is a quantifiable, steep drop in performance for users in South East Asia, East Asia, and Africa compared to North America and Europe.

Why does this matter? As we integrate these models into search engines, educational tools, and robots, we risk creating systems that only work for a fraction of the global population. A robotic assistant in Vietnam shouldn’t be confused when asked to fetch a Banh Mi. A diverse image generation tool shouldn’t require a user to explicitly type “non-Western style” to get an accurate depiction of their own culture.

The authors conclude that future research must prioritize culturally diverse data collection (not just scraping) and new training objectives that specifically penalize cultural homogeneity. Only then can we move from local concepts to true universals.