Introduction

Imagine showing an AI a photo of a bustling street festival. If the festival is Mardi Gras in New Orleans, most top-tier AI models will instantly recognize the beads, the floats, and the context. But what if that photo depicts Mela Chiraghan in Pakistan or a traditional Angampora martial arts display in Sri Lanka?

This is where the cracks in modern Artificial Intelligence begin to show. While Large Multimodal Models (LMMs)—systems that can see images and process text simultaneously—have made incredible leaps in capability, they possess a significant blind spot: the majority of the world’s cultures and languages.

Most current benchmarks for testing these models are “WEIRD”—Western, Educated, Industrialized, Rich, and Democratic. They focus heavily on English and high-resource languages, leaving a massive portion of the global population underrepresented.

To fix this, a large team of researchers has introduced ALM-bench (All Languages Matter Benchmark). This is not just another dataset; it is a massive effort to evaluate how well AI understands the visual and linguistic nuances of 100 different languages across 73 countries.

The ALM-bench concept map showing diverse languages and cultural questions across continents.

In this post, we will tear down this research paper to understand how ALM-bench was built, why it is harder to solve than previous benchmarks, and what it reveals about the “cultural intelligence” of models like GPT-4o and Gemini.

Background: The Cultural Gap in AI

Before diving into ALM-bench, it is important to understand the landscape of Vision-Language evaluations.

LMMs are trained on internet-scale data. Naturally, the internet is dominated by English and a handful of other high-resource languages (like Chinese, Spanish, and French). Consequently, models learn to recognize “weddings” as white dresses and tuxedos, often failing to recognize a red Sari or a traditional Kimono in the same context.

Previous benchmarks tried to address this, but they had limitations:

  • Restricted Scope: Many “multilingual” benchmarks only covered 5–10 languages.
  • English-Centric: Some datasets, like CulturalVQA, focused on cultural content but were presented only in English.
  • Visual Bias: Datasets often lacked “visual cultural diversity”—meaning the images themselves were generic, even if the text was translated.

The researchers behind ALM-bench argue that true inclusivity requires testing models on low-resource languages (languages with less training data available) and culturally specific imagery simultaneously.

Comparison of ALM-bench against previous benchmarks like MaRVL and CulturalVQA.

As shown in the comparison table above, ALM-bench significantly scales up the evaluation to 100 languages and introduces a rigorous manual verification process that many automated benchmarks lack.

Core Method: Building ALM-bench

The creation of ALM-bench was a massive logistical undertaking involving over 800 hours of human expert work. The researchers didn’t just scrape the web and run Google Translate; they built a pipeline designed to capture authentic cultural context.

1. The Scope

The benchmark covers:

  • 100 Languages: Split between high-resource (e.g., English, French) and low-resource (e.g., Amharic, Sinhala, Yoruba).
  • 24 Scripts: From Latin and Cyrillic to Ge’ez and Sinhala.
  • 19 Domains: These are divided into Generic (everyday objects) and Cultural (specific traditions).

2. The Data Pipeline

The researchers employed a multi-stage pipeline to ensure quality.

The data collection and verification pipeline for ALM-bench.

Step A: Image Collection For the Cultural subset, they didn’t use generic stock photos. They scraped open-license images specific to country-language pairs. For example, for the “Food” category in Ethiopia, they looked for specific local dishes, not generic “African food.”

Step B: Question Generation They used GPT-4o to generate initial Question-Answer (QA) pairs based on the images. The model was instructed to create different types of questions:

  • Multiple Choice Questions (MCQs)
  • True/False
  • Short Visual Question Answering (VQA)
  • Long VQA (requiring detailed explanations)

Step C: The Human Verification (Crucial) This is the most important step. Native speakers and experts verified the data. They weren’t just checking grammar; they were checking for cultural hallucinations. If an image showed a specific festival but the generated question misidentified it, the human annotators corrected it. They also blurred faces to protect privacy.

3. Cultural Domains

To ensure the models were tested on more than just surface-level translation, the benchmark includes 13 deep cultural domains.

Breakdown of the 13 cultural categories including Customs, Rituals, and Food.

These categories include:

  • Rituals & Customs: Understanding gestures, greetings, and ceremonies.
  • Architecture: Distinguishing between a Gothic cathedral and a Mughal mosque.
  • Food: Identifying specific regional dishes.
  • Music & Literature: Recognizing traditional instruments or famous local authors.

4. The Scale of the Data

The final result is a dataset of over 22,000 QA pairs. The sheer diversity of scripts and question types allows researchers to pinpoint exactly where a model fails—whether it’s a failure of language processing (script issues) or visual recognition (cultural ignorance).

Data statistics showing the distribution of languages, scripts, and question types.

Experiments & Results

The researchers tested 16 state-of-the-art LMMs, including proprietary models (like GPT-4o and Gemini 1.5 Pro) and open-source models (like LLaVA, Qwen, and Yi). The results paint a stark picture of the current state of AI inclusivity.

1. The “Resource Gap”

The most prominent finding is the performance disparity between high-resource and low-resource languages.

Performance heatmap of different models across 100 languages.

In the heatmap above, darker red indicates higher scores. You can see a “wall” where performance drops off.

  • Closed-source dominance: GPT-4o (top row) is the clear winner, achieving 78.8% overall accuracy. Gemini 1.5 Pro follows closely.
  • The Drop-off: Even the best model, GPT-4o, drops from 88.4% accuracy on English to just 50.8% on Amharic.
  • Open-Source Struggles: The best open-source model (GLM-4V) trails significantly behind the proprietary giants, struggling immensely with low-resource African and Asian languages.

2. Script Matters

It is not just about vocabulary; the writing system (script) poses a major hurdle. Models performed significantly worse on non-Latin scripts.

Performance comparison between GPT-4o and Qwen2-VL across different language scripts.

As shown in the chart, while models handle Latin and Cyrillic scripts reasonably well, performance plummets for scripts like Ge’ez (used in Ethiopia), Sinhala (Sri Lanka), and Khmer (Cambodia). This suggests that the tokenizers—the parts of the model that read text—are likely undertrained on these unique character sets.

3. Cultural Hallucinations

One of the most fascinating parts of the paper is the error analysis. Models frequently hallucinated cultural context. They would see a visual cue and confidently map it to the wrong culture.

Qualitative examples of model failures, such as misidentifying a festival.

Look at the example in the image above (Figure 10). The model looks at an image of Mela Chiraghan (Festival of Lights) in Pakistan.

  • The Mistake: The model confidently identifies it as Eid Milad un Nabi.
  • The Reason: Both are religious festivals involving lights. However, the model missed the specific visual nuance—Mela Chiraghan features bright, colorful lights and specific drums (Dhol), whereas Eid Milad un Nabi typically features green lights and modest attire. The model lacked the “cultural resolution” to tell the difference.

4. Error Analysis

The researchers categorized errors into types like “Lack of Knowledge,” “Language Error,” and “Reasoning Error.”

Error analysis radar charts for different scripts.

  • Bengali: High “Lack of Cultural Understanding.”
  • Russian (Cyrillic): High “Lack of Knowledge.”
  • Sinhala: High “Language Error”—meaning the model might know the answer but cannot form the sentence correctly in the Sinhala script.

5. Location-Aware Prompting

An interesting sub-experiment involved “Location-Aware Prompts.” The researchers found that if they explicitly told the model which country the image was from (e.g., “This image is from South Africa”), performance improved by roughly 2.6% to 5% for top models.

Table showing performance boosts when adding country information to prompts.

This suggests that models do have some of this cultural knowledge latent within them, but they need explicit triggers to access it. They struggle to infer the cultural context solely from the pixels.

Conclusion & Implications

The ALM-bench paper serves as a reality check for the AI community. While we celebrate models that can write code or pass the bar exam, we must acknowledge that these systems are still functionally illiterate in the cultures of billions of people.

Key Takeaways:

  1. The “Digital Divide” is real: There is a massive performance gap (over 27%) between the best proprietary models and the best open-source models in multilingual settings.
  2. Visuals need context: Models fail to recognize cultural markers (clothing, festivals, food) without explicit text hints.
  3. Low-resource languages are left behind: The current training paradigm is failing languages with unique scripts and limited internet presence.

ALM-bench provides a roadmap for fixing these issues. By highlighting exactly where models fail—whether it’s the script of the text or the nuance of a ritual—researchers can curate better training data. The goal is a future where “All Languages Matter” isn’t just a benchmark title, but a fundamental capability of Artificial Intelligence.