When we look at a photograph, we assume we are seeing objective reality. If you show a picture of a park to a person in New York and a person in Munich, surely they are seeing the same grass, the same benches, and the same sky.
But are they?
Cognitive science and psychology suggest that visual perception is deeply tied to culture. A Western viewer might focus intently on the foreground objects—the specific people or the specific make of a car. An East Asian viewer might place significantly more weight on the context, the background, and the relationships between objects.
This poses a massive problem for Artificial Intelligence. Modern Vision-Language Models (VLMs), like the popular CLIP, are trained primarily on English data. When researchers want to make these models multilingual, the standard approach is to take English image captions and run them through a machine translator (or even a human translator) into the target language.
But if a German speaker “sees” the image differently than an English speaker, simply translating the English description misses the point. You are translating the Englishman’s perception into German words, rather than capturing the German’s native perception.
In this post, we will dive deep into a fascinating research paper titled “Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval.” We will explore how relying on translation hurts AI performance, quantify the “perception gap,” and look at novel ways researchers are using Large Language Models (LLMs) to try and fix it.
The Multilingual Challenge in Vision-Language Models
To understand the problem, we first need to look at how models like CLIP work. These models learn to associate images with text by training on massive datasets of image-caption pairs. Once trained, you can search for images using text (Image-Text Retrieval) or describe images automatically.
However, the vast majority of this training data is in English. To create multilingual versions (like mCLIP), researchers typically use two methods:
- Translated Training Data: Taking an English dataset (like Flickr30K) and translating the captions into other languages.
- Multilingual Text Encoders: Using a model that understands multiple languages (like XLM-R) to map foreign text into the same “embedding space” as English text.
The paper argues that these methods overlook a critical flaw: Translation is not the same as native description.
Cultural Differences in Perception
Research in psychology has long established that culture influences attention. For example:
- Object Specificity: Some cultures have strong associations for specific group members (e.g., distinguishing a “robin” distinct from just a “bird”), while others generalize more.
- Foreground vs. Background: As noted by Nisbett and Masuda (2013), Americans tend to focus on foreground objects, while East Asians often focus on background context.
- Linguistic Relativity: The language we speak shapes what we look for. German speakers, whose language uses gendered nouns, might subconsciously ascribe different properties to objects.
If we train an AI model for German users using translated English captions, the model learns to “look” at the image like an American, just described in German words. It fails to learn the specific nuances that a native German speaker would naturally care about.

Figure 1 above illustrates this phenomenon perfectly.
- Left Image: An English speaker might simply see a “man on a bench.” A native German speaker, recognizing the cultural context, identifies it specifically as a “Heurigen bench” (associated with wine taverns) and describes the scene differently.
- Right Image: The English caption focuses on the “Union Jack motifs.” The German caption might focus differently on the parasol or the group dynamic.
When we simply translate the English caption, we lose the “Heurigen” concept entirely. The model never learns to recognize that specific visual feature because the English source text never mentioned it.
Methodology: Measuring the Gap
To scientifically quantify this problem, the researchers set up a rigorous experiment comparing different ways of training a retrieval model.
The Task
The goal is Image-Text Retrieval in German.
- Input: A German text query (e.g., “Ein Hund läuft im Gras”).
- Output: Finding the correct image from a large database.
The Model
They used mCLIP (Multilingual CLIP). This model replaces the standard text encoder with XLM-R (a powerful multilingual language model) and aligns it with the visual capabilities of CLIP.
The Comparison Groups
The core of the study is comparing four different sources of training data. All models were trained/fine-tuned to perform German retrieval, but the source of the German supervision varied:
- ENG (Baseline): The model is fine-tuned on the original English captions. It relies solely on mCLIP’s pre-existing cross-lingual abilities to understand German queries during testing.
- ENG2GER-MT (Machine Translation): The English captions are translated into German using an off-the-shelf Machine Translation (MT) model (Opus-MT). This represents the standard “easy” way to make multilingual datasets.
- ENG2GER-HT (Human Translation): The English captions are translated into German by professional human translators.
- Crucial Note: These translators only saw the text, not the image. They were translating the English sentence, not describing the scene.
- GER (Native Perception): The model is fine-tuned on captions written by native German speakers who looked at the image and wrote a description from scratch.
The Dataset
The researchers used the Multi30K dataset, which is unique because it contains both professional translations of English captions and independently written native German captions for the same images. This allows for a perfect “apples-to-apples” comparison.
Results: The “Perception Gap” is Real
The results of the experiments confirm that translation—even professional human translation—is fundamentally different from native perception.

Let’s analyze the findings from Table 1:
- Native Data is King (GER): The model trained on native German captions achieved the highest Mean Recall of 38.4. This is the gold standard.
- Machine Translation Lags (ENG2GER-MT): Training on machine-translated text resulted in a score of 33.4. This is a massive gap (-5.0) from the native performance.
- Human Translation isn’t Enough (ENG2GER-HT): Even when professional humans did the translating, the score was 36.8.
- This is better than machine translation, but it is still 1.6 points lower than the native captions.
- This 1.6 gap represents the Perception Gap. Since the human translators were fluent experts, the error isn’t linguistic (grammar/vocabulary); the error is perceptual. The translators preserved the English speaker’s focus, which didn’t perfectly align with what German users search for.
Strategies to Bridge the Gap
Acknowledging that we can’t always afford to collect millions of native descriptions for every language, the authors proposed three augmentation strategies to make translated data better.
The goal was to diversify the English source text before translation, hoping to capture a wider range of concepts that might align better with German perception.
1. Hypernymization (HYPER)
This strategy involves replacing specific objects with more general terms (hypernyms).
- Concept: Instead of “Bronco,” use “Horse.” Instead of “Heurigen bench,” use “Bench.”
- Why? Cultural differences often manifest in specificity. If English speakers are overly specific about things German speakers aren’t (or vice versa), generalizing the terms might reduce the mismatch.
2. Random Paraphrasing (PARA-RND)
Here, the researchers used LLaMA-3, a powerful Large Language Model, to rewrite the English captions.
- Prompt: They asked LLaMA to write the caption in a “structurally different manner” while keeping the meaning.
- Goal: To break the specific syntactic structure of English that might not translate well to German visual preferences.
3. Targeted Paraphrasing (PARA-TGT)
This was the most sophisticated approach.
- Method: They utilized In-Context Learning. They fed LLaMA examples of actual Native German captions (translated back to English) to show the model the “style” of German descriptions.
- Process: LLaMA would analyze the English caption, look at the “German-style” examples, and rewrite the English caption to mimic that perceptual style.
- Example: If German captions tend to simplify “man in a red shirt riding a bicycle” to just “bicyclist,” the LLM would make that edit to the English text before it was translated to German.
Did it work?
Looking back at Table 1, specifically the middle section:
- PARA-CMB (Combined): Combining these paraphrasing strategies raised the performance of the Machine Translated model from 33.4 to 34.7.
- Impact: This is a solid improvement (+1.3), but a gap still remains compared to the Human Translation (36.8) and Native (38.4) scores.
This suggests that while LLMs can help diversify data, they cannot fully hallucinate the cultural nuance of a native speaker without seeing the image itself.
Deep Dive: Global Analysis of Perception
One of the most interesting parts of this paper is the broader analysis of how different languages “see” the world. The researchers went beyond German and analyzed XM3600, a dataset covering 36 languages.
They grouped languages by region/culture (European, Arabic, Hindi, East Asian, etc.) and counted how frequently different objects were mentioned in captions for the same set of images.

Table 2 reveals striking differences:
- Nature vs. Scenery: Hindi/Bengali speakers mentioned “trees” more than double the amount European speakers did (581.5 vs 270.5).
- Urban Focus: Swahili captions had the highest mention rate for “buildings” (502), significantly higher than East Asian languages (253).
- Domestic Objects: Hindi speakers mentioned “tables” most frequently, while European speakers were much lower.
- Gender: Indonesian captions mentioned “woman” significantly more often (164.5) than Hindi captions (114).
These aren’t just translation errors; these represent fundamental shifts in what is considered “caption-worthy” in an image. If you train a Hindi model using translated European captions, the model will learn to ignore trees that a native Hindi speaker would expect the model to recognize.
Recognition Accuracy
Does this perception gap actually affect the model’s ability to recognize objects? Yes.

Table 3 compares the “Precision” and “Recall” of object recognition.
- Recall: The Native German model (GER) generally has higher recall (bottom row). This means it is better at finding objects that are actually in the image and relevant to the user.
- Precision: Interestingly, the Human Translated model (ENG2GER-HT) often has higher precision. This is likely because translated captions are often more literal and conservative, whereas native speakers use more varied vocabulary that might slightly confuse the model, even if it is more “natural.”
The table shows that Native German captions mentioned “Vehicles” 2604 times, while the Translated set mentioned them 2724 times. This discrepancy implies that English speakers (the source of the translation) felt the need to point out cars more often than German speakers did for the same images.
Conclusion: The Future of Multilingual AI
The research by Buettner and Kovashka provides a crucial wake-up call for the AI community. As we race to build “universal” models that speak 100+ languages, we cannot rely solely on translation.
Key Takeaways:
- Translation \(\neq\) Perception: A correct translation of a sentence is not necessarily a correct description of an image from a native cultural perspective.
- The Gap is Quantifiable: There is a distinct performance drop (approx. 5%) when using machine translation vs. native data, and even professional human translation cannot fully close this gap.
- Augmentation Helps, But Doesn’t Cure: Using LLMs to paraphrase and diversify training data improves results, but we still need better ways to capture cultural nuance.
Why This Matters
If we want AI to be truly accessible and useful globally, it must do more than swap words; it must swap perspectives. A medical AI analyzing scans in India might need to focus on different visual markers than one in the US. A safety robot in Japan might need to interpret “cluttered” environments differently than one in Canada.
This paper suggests that the future of computer vision isn’t just about higher resolution or faster processing—it’s about building models that understand the subtle, human differences in how we view our shared world. Researchers must prioritize collecting native data and developing “culturally aware” augmentation strategies to ensure no one is left “lost in translation.”
](https://deep-paper.org/en/paper/2410.02027/images/cover.png)