If AI Can Explain the Joke, Does It Understand? Testing Multimodal Literacy with Visual Puns

When a friend winks at you while saying, “I’m definitely going to stick to my diet today,” you immediately understand that they likely mean the opposite. You didn’t just process the text (the sentence); you integrated the visual cue (the wink) to resolve the ambiguity of their statement.

This ability is known as multimodal literacy. It is the human capacity to actively combine information from different sources—text, images, gestures—to form a complete reasoning process. We do this intuitively when we look at a textbook illustration to understand a complex paragraph or when we read a caption to make sense of an abstract photo.

But can Artificial Intelligence do this?

Current Multimodal Large Language Models (MLLMs), like GPT-4V or LLaVA, are impressive, but they often operate under a simplified assumption: they assume the image is always perfectly relevant and straightforward. They lack the “active” reasoning required to resolve conflicts or ambiguities between what they see and what they read.

In this post, we are diving deep into a fascinating research paper titled “Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!” by researchers at Yonsei University. They propose a novel way to test this capability using one of the most notoriously difficult linguistic constructs: the pun.

The Problem: Passive vs. Active Understanding

Most benchmarks for testing Vision-Language Models (VLMs) involve direct description. You show the model a picture of a cat and ask, “What is this?” The model answers, “A cat.”

However, real-world communication is rarely that direct. We often face lexical ambiguity, where words have multiple meanings. To figure out which meaning is intended, we look for context. If the text is ambiguous, we look at the image.

The researchers argue that current models lack the capacity for active understanding. They wanted to know: If a piece of text is confusing on its own, can a machine use a visual cue to figure it out?

Why Puns?

Puns are the perfect laboratory for this experiment because they are intrinsically ambiguous. A pun forces a single phrase or word to hold two interpretations simultaneously. To “get” the joke, you must hold both meanings in your head and find the connection.

Often, a visual aid is what makes a pun “click.”

Puns naturally occur with images to enhance understanding (Zenner and Geeraerts, 2018), making them natural candidates for testing active multimodal understanding capacity of machines. Examples of puns accompanied by visual explanations from r/puns subreddit on Reddit.

As shown in Figure 1 above, visual cues provide instant insight. On the left, “leak under the sink” refers to both the water leakage and the vegetable (leek). The image clarifies the humor immediately. If an AI can look at the image and understand why the text is funny, it proves it is effectively combining modalities to resolve ambiguity.

Introducing UNPIE: A Benchmark for Multimodal Literacy

To test this, the authors created UNPIE (Understanding Pun with Image Explanations). This is a comprehensive dataset and benchmark designed to assess whether machines can resolve lexical ambiguities using visual inputs.

The dataset consists of 1,000 puns, but the construction of this dataset was a complex engineering feat in itself. The researchers didn’t just scrape the web; they built a controlled environment for testing.

1. The Taxonomy of Puns

First, we need to understand the linguistic structure of the data. The researchers utilized an existing text-only pun dataset (SemEval 2017) and categorized the puns into two types:

Homographic Puns: These rely on words that are spelled and pronounced the same but have different meanings (e.g., “tire” as in a wheel vs. “tire” as in fatigue).
Heterographic Puns: These rely on words that sound similar but are spelled differently (e.g., “prophet” vs. “profit”).

Comparison of homographic (left) and heterographic (right) puns in UNPIE dataset along with the respective disambiguator visual annotations.

Figure 3 illustrates this distinction. On the left (Homographic), “pop” refers to both a father and the sound of a balloon. On the right (Heterographic), “sundays” plays on the day of the week and the ice cream treat “sundae.”

2. Generating Visual Explanations

Finding natural images that explain both meanings of a pun simultaneously is incredibly difficult. Most images on the web only depict one side of the story. To solve this, the researchers used DALL-E 3 with a human-in-the-loop approach.

They couldn’t just type “make a funny image.” They had to guide the image generator to include elements of both interpretations without giving the joke away via text.

Figure 4: An example of our pun explanation image generation process. A human worker interacts with an off-the-shelf text-to-image model, iteratively guiding the model to produce an image that satisfies each specified criterion.

As seen in Figure 4, this was an iterative process. The first attempt might just show a literal interpretation. The researcher would then prompt the model to add the second meaning. If the model accidentally wrote the punchline on the image (which would be cheating for the AI being tested), the researcher instructed it to remove the text.

The result is a set of Pun Explanation Images that encapsulate the ambiguity visually.

3. The Multilingual Twist

To objectively measure if a model “understands” the pun, the researchers added a translation layer. They translated the English puns into German, French, and Korean.

Crucially, the translations remove the pun. If you translate “Success comes in cans” into German, you lose the play on words between “can” (the auxiliary verb) and “can” (the metal container). The translation becomes a literal, unambiguous sentence. This “sanitized” translation serves as a control variable for the experiments.

The Three Challenges

With the UNPIE dataset ready, the researchers established three distinct tasks to test multimodal literacy. These tasks range from simple identification to complex reconstruction.

Task 1: Pun Grounding (Identification)

The Goal: Find the pun. The Input: An English sentence and a Pun Explanation Image. The Challenge: The model must identify which specific phrase in the sentence creates the pun.

In Figure 2 (left side), the model reads “Success comes in cans, failure comes in cant’s.” It sees an image of a tin can labeled “Yes I Can.” The model must identify that the word “cans” is the pivot point of the joke.

Task 2: Pun Disambiguation (Interpretation)

The Goal: Choose the correct meaning based on a picture. The Input: An English pun and a Disambiguator Image. The Challenge: Unlike the “Explanation” image which shows both meanings, the “Disambiguator” image shows only one specific interpretation. The model must translate the sentence into a target language (like German) in a way that aligns only with the image provided.

For example (Figure 2, right side), if the text says “We can do it,” and the image shows a beverage can, the model should translate it referring to the object. If the image shows Rosie the Riveter, it should translate it as the verb “to be able.”

Task 3: Pun Reconstruction (The Ultimate Test)

The Goal: Recreate the joke. The Input: A translated (unambiguous) sentence + a Pun Explanation Image. The Challenge: This mimics a real-world scenario where you have partial information. The model receives a German sentence that literally means “Success arrives in tin containers.” It also sees the image of the “Yes I Can” tin.

It must combine these two inputs to reconstruct the original English sentence: “Success comes in cans.” It has to infer the humor that was lost in translation by looking at the picture.

The Contenders: Models Under the Microscope

The researchers tested two main categories of AI models:

Socratic Models (SM): These are modular systems (or pipelines). They use a vision model (like BLIP-2) to caption the image into text, and then feed that caption along with the original text into a Large Language Model (like GPT-4 or Vicuna). They “talk” to themselves to solve the problem.
Visual-Language Models (VLM): These are monolithic models trained to process images and text together directly (e.g., LLaVA, Qwen-VL).

They also created a variant called LLaVA-MMT, which was fine-tuned on a standard Multimodal Machine Translation dataset (Multi30k), to see if standard training helps with this creative task.

Experimental Results: Do Images Actually Help?

The short answer is: Yes. Across the board, adding visual context improved the models’ ability to handle ambiguity. However, the nuance in the results is fascinating.

1. Results on Pun Grounding

In the simplest task—finding the pun word—visuals provided a boost, but strong text-only models were already quite good at it.

$Table 4: Results on the pun grounding task. We report the exact match accuracy of the generated pun phrase. \$\\uparrow\$ denotes the performance gain from visual context.$

As Table 4 shows, adding vision (V+L) consistently improved performance (denoted by the green arrows) compared to text-only inputs (L).

GPT-4 is incredibly smart. Even without images, it scored 95.4% on homographic puns. With images, it crept up to 96.0%.
Qwen-VL saw a massive jump. It struggled with text alone (43.8%) but shot up to 63.6% when allowed to see the image. This suggests that smaller models rely much more heavily on visual cues to “get” the context than massive models like GPT-4.

2. Results on Pun Disambiguation

This task required the model to pick a specific translation based on a specific image. This is a direct test of whether the model pays attention to the image to resolve text confusion.

Table 5: Experimental results on the pun disambiguation task. All scores are reported in terms of binary classification accuracy. The best scores are bolded and the second-best ones are underlined.

Table 5 (above) confirms the hypothesis.

Visual Context is Key: Every single model performed better when it had access to the image.
Text Comprehension vs. Visual Understanding: Interestingly, GPT-4 (SM)—which just reads a text caption of the image—outperformed the native VLMs (like LLaVA). This implies that “understanding the pun” is still heavily a linguistic reasoning task. As long as the caption is good (“An image of a tin can”), the massive brain of GPT-4 handles the logic better than LLaVA’s native visual processing.

3. Results on Pun Reconstruction

This was the hardest task: taking a boring translation and a funny picture to recreate the original English pun.

$Table 6: Outcomes for the pun reconstruction task, where \$\\uparrow\$ and \$\\downarrow\$ signify the performance change attributed to the inclusion of visual context. The model with the largest performance increase is marked bold in each language.$

The results in Table 6 are telling:

Massive Gains from Vision: Look at the “SM GPT-4” row. For German-to-English (De->En), the accuracy jumps from 43.1% (text only) to 62.9% (with vision). The model literally cannot reconstruct the pun without seeing the image.
The “Alignment Tax”: Look at LLaVA-MMT. This is the model fine-tuned on standard machine translation data. It performed worse than the standard LLaVA model.

Why? Standard datasets (like Multi30k) are usually literal descriptions (“A dog runs on grass”). Fine-tuning on this data teaches the model to be boring and literal. It loses the “creative” spark needed to identify and reconstruct a pun. This phenomenon is often called the “alignment tax”—improving on standard benchmarks can degrade performance on complex reasoning tasks.

Language Distance: The benefits of visual cues were highest for Korean-to-English translation. Since Korean is linguistically very distant from English (compared to German or French), the text translation alone preserves less of the original structure. The model needed the image to bridge that gap.

Conclusion: The Future of Multimodal Literacy

The UNPIE benchmark demonstrates that machines are beginning to develop a form of multimodal literacy. They aren’t just processing text and images in parallel; they are using images to fix gaps in their textual understanding.

However, the gap between “identifying” a pun and “reconstructing” it remains large. While models like GPT-4 are closing in on human-like reasoning, they often rely on textual descriptions of images rather than deep visual understanding.

Key Takeaways for Students

Ambiguity is a Feature, Not a Bug: In advanced AI research, we don’t just want models that answer simple questions. We want models that can resolve confusion. Puns are a brilliant way to test this.
Visual Context Matters: Even for tasks that seem text-heavy (like translation), seeing the context can drastically change the output.
Data Quality > Data Quantity: The failure of LLaVA-MMT shows that training on the wrong kind of multimodal data (literal vs. creative) can actually make a model dumber at complex tasks.

The next time you see a meme or a visual pun, take a moment to appreciate the complex cognitive gymnastics your brain is performing. It’s a skill that machines are only just beginning to learn.

The Problem: Passive vs. Active Understanding#

Why Puns?#

Introducing UNPIE: A Benchmark for Multimodal Literacy#

1. The Taxonomy of Puns#

2. Generating Visual Explanations#

3. The Multilingual Twist#

The Three Challenges#

Task 1: Pun Grounding (Identification)#

Task 2: Pun Disambiguation (Interpretation)#

Task 3: Pun Reconstruction (The Ultimate Test)#

The Contenders: Models Under the Microscope#

Experimental Results: Do Images Actually Help?#

1. Results on Pun Grounding#

2. Results on Pun Disambiguation#

3. Results on Pun Reconstruction#

Conclusion: The Future of Multimodal Literacy#

Key Takeaways for Students#