Why AI Can't Understand Poetry: Solving the "Over-Literalization" Problem in Text-to-Image Models

Introduction

“Books are the ladder of human progress.”

When you read that sentence, you don’t imagine a wooden ladder made of hardcover novels leaning against a wall. You imagine the concept of ascension, of improvement, perhaps a person standing on a stack of books reaching for a lightbulb. Your brain effortlessly processes the metaphor. You understand that “books” (the object) share a quality with “ladders” (the vehicle): they both allow you to climb higher.

However, if you feed that same sentence into a state-of-the-art text-to-image generator like Stable Diffusion or DALL-E, you will likely get a bizarre, surrealist nightmare of literal ladders made of paper.

This phenomenon is known as over-literalization. While Large Language Models (LLMs) and diffusion models have made massive strides in generating photorealistic images, they stumble significantly when faced with figurative language. They see the words, but they miss the meaning. They draw the objects mentioned, but they fail to capture the relationship between them.

Today, we are diving deep into a fascinating research paper titled “Grounding-based Metaphor Binding With Conceptual Elaboration For Figurative Language Illustration.” The researchers propose a novel framework called GOME (GrOunding-based MEtaphor Binding). This method teaches AI to think less like a camera and more like a poet, ensuring that when we ask for a “blanket of snow,” we get a snowy street, not a piece of bedding lying on the road.

Two methods of depicting metaphors. The left shows over-literalization (a real blanket on a street), while the right shows attributes blending (snow looking like a blanket).

As shown in Figure 1 above, the difference is stark. On the left, a standard model hears “blanket of snow” and literally puts a blanket on the street. On the right, the GOME method understands the grounding—the shared attribute of “covering” or “pervasiveness”—and renders a street thick with snow.

The Core Problem: Why Metaphors Break AI

To understand the solution, we first need to dissect the problem. A metaphor generally consists of three parts:

Tenor: The subject being described (e.g., “Snow”).
Vehicle: The object used for comparison (e.g., “Blanket”).
Grounding: The underlying quality shared between the two (e.g., “Pervasive,” “Encompassing,” or “Warmth”).

Current image generation pipelines usually involve a user typing a prompt, an encoder turning that text into numbers (vectors), and a diffusion model turning noise into an image based on those numbers.

The issue arises because standard models are trained on literal captions. If the training data contains the word “shark,” it is almost always associated with a fish with sharp teeth. If you say “My lawyer is a shark,” the model’s statistical probability creates a fish in a suit. It fails to extract the grounding—that the lawyer is aggressive or fierce—and instead binds the visual attributes of the vehicle (the animal) to the image.

This leads to two main technical failures:

Over-literalization: excessive detail given to the vehicle (drawing a literal shark).
Failed Attribute Binding: The model struggles to attach the abstract attributes (aggressive) to the correct object (lawyer).

Introducing GOME: A Two-Step Revolution

The researchers behind GOME propose a solution that intervenes in two distinct stages of the generation process: the Elaboration Stage (using an LLM to rewrite the prompt) and the Visualization Stage (using a mathematical trick called “Metaphor Binding” during image generation).

The overall workflow of the GOME method.

Figure 3 gives us the high-level roadmap. First, the metaphor goes through an LLM to be expanded into a descriptive prompt. Second, that prompt is analyzed for syntax. Finally, the diffusion model generates the image, guided by a special “binding” loss function that aligns the visual attention with the linguistic meaning.

Let’s break these down step-by-step.

Step 1: Visual Elaboration with Chain-of-Thought

You cannot simply ask a diffusion model to “draw a metaphor.” You need to translate the metaphor into a visual scene that represents the feeling of the metaphor without necessarily drawing the literal objects.

The authors employ GPT-4 with a specific Chain-of-Thought (CoT) prompting strategy. They treat the LLM as a rhetoric expert. The system role is instructed to identify the Tenor, Vehicle, and Grounding, and then generate a “Visual Elaboration.”

Detailed flowchart of the LLM elaboration process.

Look at the example in Figure 2.

Input: “My lawyer is a shark.”
Analysis:
Tenor to include: Lawyer.
Vehicle to exclude: Shark.
Grounding: Aggressive and fierce.
Visual Elaboration: “A stern lawyer, documents scattered fiercely, arguing intensely in a courtroom.”

Notice what happened? The word “shark” is gone. It has been transmuted into “fiercely” and “stern.” This effectively bypasses the risk of the diffusion model drawing a fish. The LLM acts as a translator, converting abstract figurative language into concrete, literal scene descriptions that a diffusion model can actually handle.

Step 2: Cross-Domain Linguistic Binding

The second step is where the true innovation lies. Even with a good description, diffusion models sometimes mix up attributes. If the prompt is “A stern lawyer with scattered documents,” the model might make the documents look stern or the lawyer look scattered. This is the binding problem.

In metaphors, this is even harder because the attributes come from a different domain (the source domain of “sharks”) and need to apply to the target domain (“lawyers”).

To fix this, the researchers developed Inference-Time Metaphor Binding. They hook into the Cross-Attention maps of the diffusion model.

What are Attention Maps?

Inside a diffusion model (specifically the U-Net architecture), there are layers that look at the text prompt. For every pixel (or latent patch) in the image, the model asks: “How much should I care about the word ’lawyer’ right now? How much for ‘fierce’?” This creates a 2D map for every word.

If the model is working correctly, the attention map for “fierce” should overlap heavily with the attention map for “lawyer.”

The Syntax Parser

First, GOME analyzes the enhanced prompt to find pairs of objects and their attributes.

Equation 1 showing the set of metaphor binding pairs.

As shown in Equation 1, the system uses a dependency parser to create a set, \(S_{MB}\), containing pairs of objects (\(o\)) and attributes (\(a\)). For example: (lawyer, fierce).

The Loss Function

Now, the researchers introduce a way to force the model to respect these pairs during the image generation process. They define a “Loss Function”—a mathematical way of telling the model “you are doing this wrong, fix it.”

They use two types of loss:

1. Positive Loss (Attraction): We want the attention map of the object (\(A_o\)) and the attribute (\(A_a\)) to look the same.

Equation 2: Positive Loss definition.

The equation above calculates the distance (\(M_{dis}\)) between the two maps. If the maps are different, the loss is high. To calculate this “distance,” they use Kullback-Leibler (KL) Divergence, which is a standard way to measure how different two probability distributions are.

Equation 3 and 4: KL Divergence definition. Equation 4 details.

Simply put: these equations force the model to look at the exact same spatial regions for the word “fierce” as it does for “lawyer.”

2. Negative Loss (Repulsion): We also want to ensure that unrelated words don’t overlap. If the prompt also mentions a “table,” we don’t want the “fierceness” to bleed onto the table.

Equation 9: Negative Loss definition.

This negative loss pushes the attention maps of the object-attribute pair away from the maps of unrelated words (\(U_v\)).

3. Total Loss: The final objective function combines these two, balancing the need to bind correct attributes while separating incorrect ones.

Equation 10: Total Loss calculation.

Visualizing the Binding

Does this math actually change the image generation? Yes. We can see it happening in the attention maps over time.

Evolution of cross-attention maps during denoising.

In Figure 4, look at the “With Binding” (Left) vs. “Without Binding” (Right) columns.

Left (GOME): As the steps progress (Step=T), the attention maps for “empty streets,” “dimmed lights,” and “snow pervasive” become distinct and focused. The green checkmarks indicate successful binding.
Right (Standard): The attention maps are messy and scattered. The model isn’t quite sure which pixels correspond to “pervasive,” leading to a weaker image.

By optimizing these attention maps during the generation steps (inference time), GOME steers the noise toward a linguistically accurate image.

Experiments and Results

The theory sounds solid, but does it work? The researchers tested GOME against standard Stable Diffusion (SD), DALL-E 2, and other visual metaphor systems like HAIVM.

Qualitative Comparison: The “Cotton Candy” Test

The most compelling evidence is visual. Let’s look at the comparison in Figure 6.

Metaphor: “After 10 minutes your head becomes like spinning cotton candy.”

Meaning: Confusion, being overwhelmed.
Stable Diffusion: Draws a woman with blue and pink cotton candy hair. (Over-literalization).
DALL-E 2: Draws a girl wearing a wig.
HAIVM: Draws a cartoon with a thought bubble.
GOME (Ours): Draws a person holding their head in frustration with papers flying around.

Examples of metaphor illustration comparing GOME to other models.

GOME is the only model that captured the grounding (confusion) rather than the vehicle (sugar). Similarly, for “He was a lion on the battlefield,” GOME draws a brave soldier fighting, while Stable Diffusion draws a literal lion standing in the savannah.

Quantitative Evaluation

The researchers also ran rigorous numbers.

1. Comprehension (Fig-QA): They evaluated how well the model “understands” figurative language using the Fig-QA dataset.

Bar chart showing fine-grained evaluation results.

Figure 5 shows that GOME (the pink dotted line) consistently outperforms other models across social, cultural, and visual metaphors. It is significantly better than GPT-2 and competitive with specialized models.

Table 1: Zero-shot and fine-tuned evaluation results.

Table 1 confirms this numerical superiority. GOME achieves higher accuracy in both zero-shot and supervised settings compared to baselines.

2. Retrieval Tasks: They also performed a “Retrieval” test. They generated images using GOME, then used a vision-language model (BLIP) to see if it could match the image back to the original metaphor.

Table 2: Comparative report on image-text retrieval.

Table 2 reveals an interesting nuance. GOME outperforms GPT-3.5 and other baselines in Image-to-Metaphor retrieval (IR). Interestingly, human experts are still slightly better at the reverse task (finding the metaphor from the image), likely because humans are hyper-specific about visual details. However, GOME dominates in Grounding Retrieval, meaning the images it generates are much better at conveying the underlying meaning of the metaphor than any other automated method.

Conclusion and Implications

The GOME paper represents a significant step forward in Multimodal AI. It highlights a critical limitation in current generative models: they are great at nouns, but terrible at nuance.

By splitting the problem into conceptual elaboration (using LLMs to extract meaning) and linguistic binding (using math to force attention), GOME effectively teaches the model to ignore the literal definition of a word in favor of its metaphorical intent.

This has broad implications beyond just making pretty pictures of poems. It touches on:

Advertising: Generating creative visual metaphors for products.
Education: Visualizing abstract concepts for students.
AI Alignment: Ensuring that AI interprets human language with the cultural and rhetorical depth we intend, rather than a robotic, literal surface reading.

The next time you say you’re “drowning in work,” hopefully, the AI of the future won’t draw you in a swimming pool, but instead visualize the overwhelming pressure of your tasks—thanks to techniques like GOME.

Introduction#

The Core Problem: Why Metaphors Break AI#

Introducing GOME: A Two-Step Revolution#

Step 1: Visual Elaboration with Chain-of-Thought#

Step 2: Cross-Domain Linguistic Binding#

What are Attention Maps?#

The Syntax Parser#

The Loss Function#

Visualizing the Binding#

Experiments and Results#

Qualitative Comparison: The “Cotton Candy” Test#

Quantitative Evaluation#

Conclusion and Implications#