Introduction
In the world of Artificial Intelligence, Computer Vision has historically been obsessed with objectivity. Show a model a picture of a park, and it will dutifully report: “A dog running on green grass.” This is impressive, but it misses a fundamental layer of human experience: subjectivity and emotion. When we look at a painting—say, Starry Night—we don’t just see “yellow circles on a blue background.” We feel awe, melancholy, or excitement.
Furthermore, the way we describe these feelings is deeply rooted in our culture and language. A viewer in New York might describe a portrait differently than a viewer in Cairo or Hanoi, not just because of the language they speak, but because of the cultural lens through which they view the world.
Most current Vision-Language (VL) benchmarks, like COCO, are dominated by English and focused on objective facts. This creates a “culture gap” in AI. To bridge this, a team of researchers from KAUST, Oxford, and Northeastern University has introduced ArtELingo-28.
This paper presents a massive leap forward in Affective (emotional) Vision-Language understanding. By expanding the dataset to cover 28 languages and emphasizing cultural diversity, the researchers challenge AI to move beyond “what is this object?” to “how does this make you feel, and why?”
The Problem: The English Bias in Multimodal AI
Recent surveys in multimodal AI reveal a stark reality: the field is overwhelmingly Anglocentric. While English is a global lingua franca, 75% of the world’s population does not speak it. When datasets are translated from English, they often lose cultural nuance. A dataset natively collected in English and translated to Hindi is not the same as a dataset collected from native Hindi speakers.
Previous attempts like ArtEmis introduced emotional captions to art, but they were in English. The original ArtELingo added Arabic and Chinese. ArtELingo-28 explodes this scope, adding 25 new languages, many of which are considered “low-resource” in the NLP world, such as Bengali, Swahili, and Hausa.
The goal isn’t just translation; it’s capturing the diversity of opinion. As shown below, the dataset captures how different cultures interpret the same visual input.

In Figure 1, we see the core of the benchmark. For a single image, annotators from different linguistic backgrounds provide not only an emotion label (like “fear” or “contentment”) but also a caption explaining why. This moves the task from simple classification to complex, culturally grounded reasoning.
Building ArtELingo-28
Creating this benchmark was a massive logistical undertaking. The researchers employed 220 native-speaker annotators from 23 countries, consuming over 6,000 hours of work.
The Scope of the Data
The dataset is built on WikiArt images. It includes:
- 28 Languages: Covering Africa, Southeast Asia, the Sub-Indian continent, East Asia, the Middle East, Central Asia, Europe, and North America.
- ~200,000 Annotations: Roughly 140 annotations per image.
- 9 Emotion Labels: Including Contentment, Awe, Excitement, Amusement, Sadness, Anger, Fear, Disgust, and “Something Else.”
The comparison with previous datasets highlights the scale of this expansion:

Diversity in Participation
To ensure the data wasn’t skewed, the researchers balanced the number of annotations across languages as much as possible, though availability of annotators for low-resource languages naturally varied.

Emotional Agreement Across Cultures
One of the most fascinating aspects of the paper is the analysis of “emotional disagreement.” Do people from different cultures feel the same way about the same art?
To measure this, the authors used Kullback-Leibler (KL) Divergence, a statistical method to measure how different two probability distributions are. In this context, it measures how much the distribution of chosen emotions differs between two languages.

In Figure 4, lighter colors indicate high agreement, while darker colors indicate disagreement. The hierarchical clustering reveals two major groups:
- A large cluster containing mostly African languages.
- A smaller cluster containing mostly Asian languages.
This suggests that cultural background significantly influences emotional perception—a finding that justifies the need for this dataset. If everyone felt the same way, we wouldn’t need 28 languages to train the AI.
Core Method: Adapting Models for Multilingual Affect
Collecting the data is step one. Step two is building models that can actually use it. The researchers needed a model that could take an image and a target language as input, and output an emotional explanation in that language.
The Architecture
Standard English-centric models like LLaMA won’t suffice here due to vocabulary limitations in languages like Burmese or Amharic. The researchers turned to BLOOMZ, a multilingual Large Language Model (LLM).
They adapted several state-of-the-art Vision-Language models (like MiniGPT-4, InstructBLIP, and ClipCap) by swapping their language decoders with BLOOMZ.
Instruction Tuning
To align the visual features with the multilingual text, the researchers used a two-stage training process.
Stage 1: Visual Alignment Using large multilingual image-text datasets (like LAION-2B-multi), they taught the model to describe images in various languages using a standard prompt: “Could you describe the contents of this image for me? Use only [Language] characters.”
Stage 2: Cross-Lingual Alignment (ArtELingo) Here, they used the ArtELingo data to teach the model emotional nuances. They utilized a specific prompting strategy to encourage cross-lingual transfer:

By asking the model to generate captions in two randomly sampled languages for the same image simultaneously, the model learns to map the same visual features to different linguistic and cultural concepts.
Experiments and Results
The paper evaluates these models in three distinct setups to mimic real-world scenarios: Zero-Shot, Few-Shot, and One-vs-All.
Setup 1: Zero-Shot Performance
In this setup, the model is trained only on high-resource languages (English, Chinese, Arabic) and then tested on the other 25 languages. This tests the model’s ability to generalize to cultures it hasn’t explicitly studied for this specific task.

As seen in Table 2, MiniGPT-4 significantly outperforms the competition across metrics like BLEU-4 and CIDEr. This suggests that the reasoning capabilities of the underlying LLM are crucial for this task.
Qualitatively, the models generate impressive results. In Figure 5 below, the top row shows languages seen during training, while the bottom row shows languages never seen during the specific affective training phase. The model still manages to generate coherent, emotionally resonant captions.

Setup 2: Few-Shot Performance
Here, the researchers added a small amount of data (~7k samples) from the low-resource languages to the training set.

Table 3 shows a massive jump in performance (BLEU-4 goes from 1.09 in Zero-Shot to 13.5). However, interestingly, increasing the data from 20% to 100% of the available few-shot samples didn’t yield huge gains. This suggests that horizontal expansion (adding more languages) is more valuable than vertical expansion (adding more samples per language) for this specific type of generalization.
Setup 3: One-vs-All Zero-Shot (The Cultural Test)
This is arguably the most insightful experiment. The researchers fine-tuned a model on one language (Source) and tested it on all other languages (Target).
The hypothesis is simple: Cross-lingual transfer should be better between culturally related languages.

Figure 6 confirms this hypothesis. The heatmap shows clusters of high performance (darker blue) that align with cultural groups rather than just linguistic families or scripts.
Key Observations from the Clusters:
- The African Cluster: Languages like IsiZulu, IsiXhosa, and Setswana cluster tightly.
- The South Asian Cluster: Urdu, Hindi, and Tamil perform well on each other. Notably, Urdu and Hindi share history but use different scripts, yet the transfer is successful. This proves the model is learning culture, not just text patterns.
- The Southeast Asian Cluster: Indonesian and Vietnamese show strong transferability.
This result is profound. It implies that by training on a specific language, the AI learns a specific way of seeing and feeling the world that is compatible with neighboring cultures.
Emotion Label Prediction
Finally, the researchers tested the models on simply predicting the emotion label from the caption.

Table 4 shows that the ArtELingo-28 model (trained on native data) vastly outperforms the base model and the version trained only on translated/high-resource data. This empirically proves that you cannot simply translate English datasets and expect an AI to understand the emotional reality of other cultures.
Conclusion and Implications
ArtELingo-28 is more than just a dataset; it is a wake-up call for the AI community. The paper demonstrates that:
- Culture creates Data: Emotional responses to art are not universal; they are culturally specific.
- Native Data is King: Models trained on high-quality, native-speaker annotations significantly outperform those relying on translation or high-resource languages alone.
- AI Can Detect Culture: The clustering results show that multimodal models can capture deep cultural connections between languages.
By embracing diversity and subjectivity, ArtELingo-28 paves the way for AI systems that don’t just see the world as a collection of objects, but understand it as a tapestry of human experiences. For students and researchers entering the field, this highlights a crucial direction: the future of AI isn’t just about bigger models; it’s about broader, more inclusive representation.
](https://deep-paper.org/en/paper/2411.03769/images/cover.png)