Introduction: The “Pink Elephant” in the Room
Imagine you are trying to tell a friend that you were incredibly drunk last night. If you are speaking English, you might say you were “seeing pink elephants.” Now, imagine feeding that sentence into a translation engine to communicate with a friend in China. If the AI translates it literally, your Chinese friend might be confused about why you were at a zoo. In Chinese culture, a common metaphorical equivalent for being heavily drunk is “collapsed like quagmire” (烂醉如泥).
This is the fundamental challenge of Machine Translation (MT) today. We have conquered grammar and we are getting better at facts, but we are still struggling with the soul of language: the metaphor.
Current evaluation metrics for AI translation, like BLEU or ROUGE, operate primarily by matching words or n-grams (sequences of words). They are excellent at checking if the robot got the syntax right. They are terrible at determining if the robot captured the vividness or emotional weight of a figurative expression.
In this post, we will deep-dive into the research paper “MMTE: Corpus and Metrics for Evaluating Machine Translation Quality of Metaphorical Language.” The researchers behind this work have developed a new framework to assess how well machines translate metaphors, introducing a crucial concept called Metaphorical Equivalence. They also released a new corpus to help train the next generation of translation models to be not just accurate, but poetic.

As illustrated in Figure 1, the bridge between languages isn’t always a straight line—it often requires a creative leap. Let’s explore how we measure that leap.
Background: More Than Just Words
To understand why this research is necessary, we have to look at what a metaphor actually is from a cognitive perspective. According to foundational linguistic theory (specifically Lakoff and Johnson), a metaphor isn’t just a fancy poetic device; it is a conceptual mapping. We understand one concept (usually abstract) in terms of another (usually concrete).
For example, in the sentence “The scream pierced the night,” we are mapping the physical action of a sharp object (piercing) onto a sound (a scream).
While some of these mappings are universal (language-agnostic), many are deeply rooted in culture. When an AI translates “The Senator steamrollered the bill,” it has two choices:
- Literal Translation: Translate the word “steamroller” into the target language’s word for the construction vehicle. This usually results in nonsense.
- Figurative Translation: Find a verb in the target language that conveys the same forcefulness and domination, even if the literal image changes.
The problem is that existing datasets for evaluating translation are biased toward literal language. There has been a scarcity of resources specifically designed to test metaphorical competence. This is the gap the MMTE project aims to fill.
The MMTE Framework: Building a Better Yardstick
The researchers didn’t just want to test existing models; they wanted to create a gold standard for how to test them. They developed a three-stage pipeline to create a high-quality, multilingual metaphor dataset (English to Chinese and English to Italian).
1. The Pipeline
The process, visualized in Figure 2 below, moves from raw machine output to human refinement.

- Pre-processing: They started with the MOH dataset (a collection of metaphorical and literal sentences). They ran these through four major translation systems: Google Translate, Youdao, Helsinki-NLP (Opus-MT), and GPT-4.
- Annotating: This is the core innovation. Human annotators didn’t just say “good” or “bad.” They used a specific set of new metrics (Quality, Equivalence, Emotion, Authenticity) to grade the translations.
- Post-Editing: Because machine translations are often flawed, human experts fixed the translations to create a “Gold Standard” reference. This ensures that future evaluations have a perfect human benchmark to compare against.
2. The New Metrics
The most significant contribution of this paper is the definition of Equivalence. When translating a metaphor, “accuracy” is too vague. The researchers broke it down into specific categories that describe exactly how the metaphor made the journey across languages.
Table 2 below provides examples of these categories. It is essential to understand these distinctions to grasp the results of the study.

Let’s break these down:
- Full-Equivalence: The Holy Grail. The translation uses the same literal image and conveys the same contextual meaning.
- Example: “The White House sits on Pennsylvania Avenue.” (In Chinese, “sits” is used exactly the same way to mean “is located.”)
- Part-Equivalence: The translation changes the image but keeps the metaphor alive. The literal meaning differs, but the figurative meaning is preserved.
- Example: “Wallow in your success.” (Translated to “Immerse” in Chinese. It’s still a metaphor involving liquid/depth, but “wallow” and “immerse” are different literal actions.)
- Non-Equivalence: The translation strips away the metaphor entirely and just states the plain meaning.
- Example: “Sales were climbing.” (Translated to “Sales increased.” Accurate? Yes. Poetic? No.)
- Mistranslation (Misunderstanding & Error): The AI fails to understand the context or translates the wrong literal word, resulting in a sentence that makes no sense.
3. The Human Touch
To ensure these metrics were applied correctly, the researchers employed a rigorous annotation interface. As shown in Figure 8, annotators could compare multiple machine outputs against a “Full ref” (Full Equivalence Reference) and a “Non ref” (Non-Metaphorical Reference). This allowed them to grade not just on whether the translation was correct, but on style and emotion.

Experiments & Results: How Smart are the Machines?
Once the framework was established, the researchers analyzed how current translation models perform. The results reveal a stark contrast between how AI handles literal text versus metaphorical text.
1. The Difficulty Gap
The first major finding is that metaphors are significantly harder for AI than literal text.
Take a look at Figure 6 below. The orange line represents Literal translations (sentences without metaphors), while the other lines represent different types of metaphorical translations.

Notice how the literal translations (Orange line, Right Chart) consistently score near the top (around 4.6/5.0) across all quality metrics like Fluency and Intelligibility.
In contrast, look at the Left Chart. The Green line (Non-Equivalence) and Red line (Part-Equivalence) dip significantly lower. This tells us that when a machine fails to achieve Full Equivalence—when it has to paraphrase or switch metaphors—the overall quality of the sentence drops. It becomes less fluent and less authentic to native speakers.
2. The Tendency to Strip Metaphors
How often do machines actually manage to preserve the metaphor? Figure 3 breaks down the distribution.

In the Left Pie Chart (Metaphorical), we see that Full Equivalence (the dark blue slice) happens about 56.6% of the time. This means that nearly half the time, the machine is either stripping the metaphor (Non-equi, 19.6%), changing it (Part-equi, 12.3%), or failing completely (Error/Mis, ~11.6%).
This 20% “Non-Equivalence” rate is problematic for creative writing. If you write a novel full of colorful imagery and the AI translates it into a dry police report, the information is there, but the art is gone.
3. The Emotional Cost
Why does it matter if the art is gone? Because metaphors carry emotion. The researchers measured the “emotional load” of translations and correlated it with their Equivalence metric.
Figure 5 presents a heatmap of this correlation.

The darker the blue, the higher the correlation.
- Look at the “Non Equi-” row on the left. It has a high correlation with “Less” emotion (column 2). This confirms that when you strip the metaphor (Non-Equivalence), you lose the emotional intensity.
- Conversely, Full Equivalence tends to preserve the “Same” amount of emotion.
For example, translating “She swallowed her words” to “She didn’t speak” removes the feeling of physical resistance and reluctance found in the original.
4. The Language Barrier: Typology Matters
The study also compared English-to-Chinese (EN-ZH) vs. English-to-Italian (EN-IT).
- Italian and English are typologically close (both Indo-European).
- Chinese and English are typologically distant.

As shown in Figure 7, the Blue bars (Italian) generally outperform the Orange bars (Chinese) across the board for metaphorical content. This confirms that the greater the linguistic and cultural distance, the harder it is for AI to map metaphors correctly. It’s not just about vocabulary; it’s about how different cultures view the world.
5. Can AI Grade AI?
Finally, given that human annotation is expensive and slow, the researchers asked: Can a model like GPT-4 do this grading for us?
They compared GPT-4’s ratings against human labels.

Table 4 shows that GPT-4 (and even GPT-3.5) has a remarkably high accuracy rate (86%+) in identifying Full Equivalence. This suggests that while LLMs might still struggle to generate perfect translations in all contexts, they possess the semantic understanding to evaluate them. This opens the door for automated evaluation pipelines that are far more sophisticated than the word-matching metrics we use today.
Conclusion and Implications
The MMTE paper sheds light on a critical blind spot in current AI development. As we push for “super-human” performance in NLP, we cannot settle for systems that merely convey factual information. Human communication is rich, messy, and deeply figurative.
The key takeaways from this research are:
- Metaphors are a distinct challenge. They require their own evaluation metrics because standard “fluency” scores hide the fact that the figurative meaning is often lost.
- Equivalence is the new standard. We should strive for Full Equivalence where possible, as it preserves both quality and emotion.
- Culture creates complexity. Translating between distant cultures (like English and Chinese) requires models that understand cognitive mappings, not just dictionary definitions.
By releasing the MMTE corpus and these new metrics, the authors have provided a roadmap for building translation systems that don’t just act like dictionaries, but like bards—preserving the color, wit, and soul of our languages.
](https://deep-paper.org/en/paper/2406.13698/images/cover.png)