We have all heard the idiom: “An image speaks a thousand words.” It is a universal truth about the power of visual communication. But there is a caveat we rarely discuss: does everyone listen to that image in the same way?
In our increasingly globalized world, we consume content from everywhere. A movie made in the US is watched in Japan; an educational worksheet created in India might be used in Nigeria. While we have become quite good at translating the text (the words) using Machine Translation, we often neglect the visuals.
Imagine a math worksheet for a first-grader in the United States asking them to count coins. The image shows quarters and dimes. If you give that same worksheet to a child in India, the text might be translated to Hindi, but if the image still shows US currency, the child will be confused. The visual context is broken.
This process of adapting content—both text and visuals—to suit a specific cultural context is called Transcreation.
In this post, we will dive deep into a recent research paper that introduces the task of Image Transcreation. We will explore how current generative AI models handle cultural adaptation, the pipelines the researchers built to test them, and why—despite the hype surrounding AI art—machines still struggle to understand culture.
The Problem: When Translation Isn’t Enough
Traditionally, translation has been linguistic. If I say “apple,” you translate it to “manzana,” “pomme,” or “ringo.” But in multimedia, meaning is derived from the interaction between text and image.
Cultural adaptation goes beyond swapping words. It requires understanding what creates the same “effect” on the target audience.

As shown in Figure 1, transcreation is already happening manually in several industries:
- Audiovisual Media: In the movie Inside Out, the vegetable that the child protagonist hates was changed from broccoli (US version) to green bell peppers (Japan version) because Japanese children generally dislike peppers more than broccoli.
- Education: Math problems are adapted to use local currency or culturally relevant objects to help children learn faster.
- Advertising: Brands like Coca-Cola or Ferrero Rocher adapt their packaging and imagery to align with local festivals, like Lunar New Year in China or Diwali in India.
The researchers identified a gap: while human translators do this, machine learning systems are currently stuck on text. There is no automated system specifically designed to take an image and “translate” its cultural context while preserving its original intent.
The Approach: Three Pipelines for Cultural Adaptation
Since there were no off-the-shelf models for “image transcreation,” the authors constructed three distinct pipelines using state-of-the-art generative models. They wanted to see if existing AI could be prompted to perform this complex task.
The goal: Take a source image (e.g., a plate of food from Nigeria) and adapt it to a target culture (e.g., the United States), or vice versa.

As illustrated in Figure 2, the researchers designed three approaches ranging from simple instruction to complex retrieval.
1. e2e-instruct: The Direct Approach
This is the most straightforward method. It uses an instruction-based image editing model (specifically InstructPix2Pix).
- How it works: You feed the model the original image and a text prompt: “Make this image culturally relevant to [Target Country].”
- The Logic: This tests if the image model itself has enough “world knowledge” to understand what cultural relevance means without extra help.
2. cap-edit: The LLM-Assisted Approach
This pipeline acknowledges that vision models might not be smart enough yet. It brings in a Large Language Model (LLM) to act as the “cultural brain.”
- Step 1 (Caption): A model describes the image (e.g., “A bowl of spicy ramen”).
- Step 2 (LLM Edit): GPT-3.5 is asked to edit that caption to fit the target culture (e.g., Change “ramen” to “feijoada” if the target is Brazil).
- Step 3 (Image Edit): An image editing model allows the visual to be updated based on the new caption, trying to preserve the structure of the original image.
3. cap-retrieve: The Retrieval Approach
Sometimes, editing an image looks fake. This pipeline argues that it’s better to find a real photograph that matches the new cultural context.
- Step 1 & 2: Same as above (Caption \(\rightarrow\) LLM Edit).
- Step 3 (Retrieve): Instead of generating pixels, the system uses the new culturally adapted caption to search a massive database (LAION) for a real image from the target country.
Building the Evaluation Dataset
You cannot evaluate cultural relevance without a diverse, grounded dataset. The authors created a two-part dataset comprising 7 geographically diverse countries: Brazil, India, Japan, Nigeria, Portugal, Turkey, and the United States.
Part 1: The Concept Dataset
This dataset focuses on simple, single-concept images. The researchers worked with locals from the selected countries to identify 5 culturally salient concepts across universal categories like food, celebrations, and housing.

Figure 3 shows the breadth of this dataset. For “Food,” they didn’t just pick random dishes; they picked dishes that locals identified as representative. This resulted in about 600 images where the content is cross-culturally coherent but visually distinct.
Part 2: The Application Dataset
To test if these models work in the real world, the researchers curated images from two challenging domains:
- Education: Math worksheets where the visual is part of the problem (e.g., counting objects).
- Literature: Storybook illustrations where the image must match the story text.

Figures 4 and 5 highlight why this is hard. In Figure 4, you can’t just replace the Christmas balls with a random object; the new object must still be countable and distinguishable by color. In Figure 5 below, the image must match the specific sentence “My mom bought rice.”

Experiments and Results: A Reality Check for AI
The researchers conducted a massive human evaluation. They didn’t rely solely on automated scores (which often fail to capture nuance). They asked evaluators from the target countries to rate the images on cultural relevance, visual quality, and meaning preservation.
The results were revealing—and somewhat humbling for the current state of AI.
Finding 1: Models Struggle with Cultural Nuance
Overall, the task is incredibly difficult. Even the best pipelines could only successfully transcreate about 5% to 30% of images depending on the country.

As shown in Figure 6, the success rate (the bottom row “C1+C3”) is low. cap-retrieve (finding real images) generally performed better than trying to generate new pixels, but it often retrieved irrelevant images.
Finding 2: The “Flag Bias” and Stereotyping
One of the most interesting—and problematic—failures came from the e2e-instruct pipeline. When asked to make an image “culturally relevant” to a country, the model often panicked and simply plastered the image with the country’s flag colors.

In Figure 19, the model was asked to adapt a “count the hotdogs” worksheet for Brazil. Instead of switching to a Brazilian snack, it turned the hotdogs into weird Brazilian flags.
Similarly, in Figure 18 below, we see what happens when adapting a Coca-Cola bottle for Turkey. The e2e-instruct model (Image b) seemingly hallucinates that Turkish culture implies a red liquid, or perhaps it’s confusing the brand red with the Turkish flag red.

Finding 3: Layout Preservation vs. Cultural Accuracy
The cap-edit pipeline tries to keep the original image’s structure. This is great for worksheets where layout matters, but it creates “Frankenstein” images.
Look at Figure 16 (Source: Japan, Target: Brazil). The prompt asked to change Ramen (Japan) to Feijoada (Brazil).

- Image (b)
e2e-instruct: Just turns the ramen yellow/green (Brazil flag colors). - Image (c)
cap-edit: It tries to force Feijoada (a bean stew) into the shape of a Ramen bowl with noodles. The result looks unnatural. - Image (d)
cap-retrieve: Finds a real picture of ingredients, but loses the “bowl of soup” structure entirely.
Finding 4: Success is Context-Dependent
Sometimes, the model fails to be realistic but succeeds in the task.
In the education dataset, the goal is often to teach a concept (like counting). In Figure 7, the task was to adapt cherries for Japan.

The cap-edit pipeline (c) changed the fruit into flowers (Cherry Blossoms). While semantically this is a shift (fruit \(\rightarrow\) flower), it is culturally relevant to Japan, and crucially, the child can still count the flowers. The researchers considered this a successful transcreation because it preserved the educational utility.
Why Is This So Hard?
The paper highlights several reasons why AI struggles here:
- Visual Bias: Most models are trained on internet data dominated by Western imagery. They “know” what a hamburger looks like in high definition, but might have a fuzzy, stereotypical idea of what Nigerian Amala looks like.
- Lack of Semantic Understanding: Image editors look at pixels and shapes. They don’t understand that if you change a “hot dog” to a “taco,” you can’t keep the exact same cylinder shape.
- The “Lazy” Shortcut: Models optimize for the easiest path. Adding a flag is an easier way to satisfy the prompt “make it American” than redesigning the architecture of a house in the background.
Conclusion and Implications
This paper serves as a “first step” into a new frontier of AI. We are moving past the era of simply generating high-quality images; we are entering the era where we need culturally accurate images.
The authors conclude that:
- LLMs are necessary: Pure vision models don’t have enough cultural context yet. We need language models to guide them.
- Retrieval is powerful: Sometimes, the best way to generate an image is to find one that already exists.
- Evaluation is key: We cannot improve what we cannot measure. The dataset created in this paper provides a benchmark for future models.
For students and researchers, this work opens up massive opportunities. How do we build models that understand that a wedding dress is white in the West but often red in the East? How do we adapt educational content so every child sees their own world reflected in their textbooks?
As the paper suggests, an image may speak a thousand words, but we need to ensure our AI systems are speaking a language that everyone can understand.
](https://deep-paper.org/en/paper/2404.01247/images/cover.png)