Can AI Understand the Joke? Evaluating Satire Comprehension in Vision-Language Models with the YesBut Dataset

Artificial Intelligence has made massive strides in seeing and describing the world. Modern Vision-Language (VL) models can look at a photo of a kitchen and list the ingredients on the counter, or look at a street scene and describe the traffic. But can they understand humor? Specifically, can they grasp the biting irony of satire?

Satire is a complex cognitive task. It doesn’t just require recognizing objects; it demands an understanding of social norms, human behavior, and the often contradictory relationship between expectation and reality.

In a recent paper titled “YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models,” researchers set out to answer this question. They constructed a unique dataset based on the popular “Yes, But” comic format to test whether state-of-the-art models like GPT-4, Gemini, and LLaVA can actually “get” the joke.

The results offer a fascinating—and humbling—look at the current limitations of AI reasoning.

The Problem: Why Satire is Hard for AI

To a human, satire is often intuitive. We see a juxtaposition of two images and immediately understand the commentary. However, for an AI model, this requires several leaps of logic:

  1. Recognition: Identifying the objects in the image.
  2. Relation: Understanding how objects relate to each other.
  3. Context: Applying commonsense knowledge about the world.
  4. Contradiction: Recognizing that the scenario presents an ironic twist or a conflict that defies normal expectations.

Most existing datasets for Vision-Language models focus on literal descriptions (e.g., “a dog sitting on a mat”). Even datasets that focus on memes often rely heavily on text overlay. The researchers identified a gap: there was no comprehensive benchmark for detecting and understanding satire where the humor is primarily visual and relies on conflicting scenarios.

Figure 1: Satire conveyed through a social media image

Consider the image above (Figure 1). On the left (“YES”), we see a heartfelt text message: “Wish you were here.” On the right (“BUT”), we see the reality: the person sending the message is sitting on a toilet.

To understand this, a model must read the text, recognize the setting (a bathroom), and then synthesize these two distinct pieces of information to understand the irony: a romantic sentiment is being sent from a very unromantic location. This is the core challenge the YesBut dataset presents.

The YesBut Dataset

To evaluate models rigorously, the authors created YesBut, a dataset containing 2,547 images. The dataset is built around the “Yes, But” format, where two panels are presented side-by-side. The left panel typically depicts a normal, expected, or idealistic scenario (“Yes”), while the right panel reveals a contradictory, realistic, or humorous twist (“But”).

How Does YesBut Compare?

The researchers compared YesBut to existing meme and humor datasets. The key differentiator is the reliance on visual storytelling over text.

Table 1: Statistics of the presence/absence of text, subimages,and multiple image styles and tasks evaluated in prior datasets vs. YesBut.

As shown in Table 1, over 50% of the images in YesBut contain no text at all (ignoring the “YES” and “BUT” headers). This forces the model to rely on visual cues rather than taking shortcuts by reading a caption. Furthermore, 100% of the images contain sub-images, requiring the model to analyze the relationship between two distinct panels.

The Methodology: Building the Pipeline

The creation of the dataset was a multi-stage process designed to ensure high quality and diversity. The authors didn’t just scrape images; they built a pipeline that involved human annotation and generative AI expansion.

Stage 1:Collecting Satirical Figure 2: Our annotation Pipeline for YesBut in 4 Stages

Stage 1: Collection from Social Media

The researchers started by manually collecting 283 satirical images from the ‘X’ (formerly Twitter) handle @_yesbut_. These images are iconic, minimalist illustrations that capture modern societal contradictions.

Stage 2: Human Annotation

Understanding satire is subjective, so high-quality ground truth is essential. Five human annotators analyzed these images. They provided:

  • Descriptions of the left and right sub-images.
  • A “punchline” description explaining why the image is funny.
  • Difficulty ratings (Easy, Medium, Hard).

Stage 3 & 4: Generative Expansion with DALL-E 3

This is where the methodology gets particularly clever. To test whether models were memorizing specific artistic styles or actually understanding the content, the researchers used DALL-E 3 to generate new versions of the original images.

Using the detailed text descriptions from Stage 2, they prompted DALL-E 3 to recreate the scenarios in two new styles:

  1. 2D Stick Figures: Simple black silhouettes.
  2. 3D Stick Figures: More dimensional representations.

They then mixed and matched these styles (e.g., a sketch on the left and a 3D stick figure on the right). This resulted in a massive expansion of the dataset and introduced a new challenge: can models understand satire even when the artistic style changes or clashes?

Figure 3: Distribution of the original 283 satirical images… Figure 4: 2D UMAP Representations…

As seen in the figure above (bottom chart), the embeddings (mathematical representations) of the generated images (orange and green dots) are distinct from the original sketches (blue dots), proving that the generative process successfully added diversity to the dataset.

The Three Benchmarking Tasks

The authors proposed three distinct tasks to test the models’ capabilities, ranging from simple identification to complex reasoning.

Task 1: Satirical Image Detection

The Challenge: Given an image, the model must classify it as “Satirical” or “Non-Satirical.” This sounds simple, but it requires the model to detect an incongruity. If a model sees a pair of images that logically flow together, it should say “No.” If it sees a contradiction, it should say “Yes.”

Figure 7: Example of a Satirical Image… Figure 8: Example of a Non-Satirical Image…

In the example above:

  • Figure 7 (Satirical): A single wet wipe is pulled out (YES), but a whole clump comes out with it (BUT). This is a relatable, humorous annoyance.
  • Figure 8 (Non-Satirical): Soccer shoes (YES), and someone playing soccer (BUT). This is a logical continuation with no irony.

Task 2: Satirical Image Understanding

The Challenge: Given a satirical image, the model must explain why it is funny. The model is prompted with “Why is this image funny/satirical?” This is the ultimate test of comprehension, requiring the model to generate a text description of the punchline.

Task 3: Satirical Image Completion

The Challenge: The model is given one half of the image (e.g., the “Yes” panel) and two options for the second half. It must select the option that completes the satire.

Figure 9: Example of an input image for Image Completion… Figure 10: Example of an input image for Image Completion…

In Figure 9 above, the model sees a question mark in the “Yes” panel and a cozy fireplace in the “But” panel. It must choose between Option A (a question mark—incorrect) or Option B (a fireplace scene) to construct a meaningful pair. This task tests if the model can predict the logical (or illogical) setup required for a joke.

Experiments and Results

The researchers tested a suite of state-of-the-art Vision-Language models:

  • Proprietary Models: Gemini Pro Vision, GPT-4 Vision.
  • Open Source Models: LLaVA, Kosmos-2, MiniGPT-4.

They tested these models in “Zero-Shot” settings, meaning the models were not trained on this specific dataset beforehand. They also tried “Chain-of-Thought” (CoT) prompting, where the model is encouraged to think step-by-step.

Result 1: Detection is Surprisingly Difficult

You might expect advanced AI to easily spot a joke. However, the data suggests otherwise.

Table 3: Evaluation of different VL models on the Satirical Image Detection task

As shown in Table 3, the accuracy for detection hovers around 50% to 56%. Since this is a binary classification task (Yes/No), a random guess would yield 50% accuracy. This indicates that even powerful models like GPT-4 and LLaVA are barely better than a coin flip at determining if an image is satirical in a zero-shot setting. Interestingly, using Chain-of-Thought (CoT) prompting actually hurt performance for several models, suggesting that over-analyzing the image might confuse the models regarding visual humor.

Result 2: Understanding Visual Irony is a Struggle

When asked to explain why an image was funny, the models struggled to capture the nuance. The researchers evaluated the generated explanations using automated metrics (like BLEU and BERTScore) and human evaluation.

Figure 5: Evaluation of Satirical Image Understanding Capability…

Figure 5 illustrates the performance across different prompts. The red bars represent the “Why Funny” prompt.

  • Gemini generally performed the best among the models.
  • MiniGPT-4 performed significantly worse, likely due to its weaker visual grounding (it relies more on text).
  • Models generally performed better at describing the individual sub-images (blue and green bars) than they did at explaining the punchline (red bars). This confirms that models can “see” the parts but fail to understand the “whole.”

Qualitative Analysis: Where did they fail?

The most illuminating part of the study comes from looking at specific examples where models failed. The authors provided a comparison of human-written descriptions versus model predictions.

Example A: The Toilet Selfie In this image, a woman is sitting on a toilet (BUT) but posing as if she is in a chair (YES), taking a selfie.

Figure 11: Example of a satirical image from YesBut Figure 13: Example of a satirical image from YesBut

  • The Satire: It mocks the fake reality of social media photos.
  • The AI Failure: GPT-4 hallucinated, describing the right sub-image as “a person placing a voting ballot into a box.” It completely misidentified the visual elements, leading to a nonsensical interpretation.

Example B: The Worth of Expertise This image satirizes how society values social media fame over academic knowledge.

Figure 12: Example of a satirical image from YesBut

  • The Satire: The man on the left has deep knowledge (full bars) in Math/Physics but one microphone. The man on the right has almost no knowledge but high “TikTok” skills and many microphones.
  • The AI Failure: Most models failed to connect the number of microphones to the concept of social worth. They could count the microphones and read the text, but couldn’t bridge the semantic gap to understand the social commentary.

Example C: The Theater Seat In Figure 13 (shown in the image block above with the theater seating), a person has a ticket for Seat 18 (far left). Instead of entering from the left, they enter from the right (Seat 1), forcing them to squeeze past everyone.

  • The Satire: It mocks inefficient human behavior and social awkwardness.
  • The AI Failure: Models completely missed the spatial reasoning. They saw empty seats and people but couldn’t understand the trajectory or the inconvenience being depicted.

Result 3: Completion and Real-World Photos

The Completion task showed slightly better results, with Gemini achieving around 61% accuracy. However, this is still far from human-level reasoning.

Table 4: Evaluation of different VL models on the Satirical Image Completion task

Finally, to ensure this difficulty wasn’t just because of the cartoon style, the researchers tested the models on 119 real-world satirical photographs (the “Yes, But” theme applied to real life).

Figure 6: Example of a real photograph following the ‘Yes,But’ Theme

For example, Figure 6 shows a fire extinguisher (YES) that is locked behind bars (BUT), rendering it useless. Even on these real photos, models performed poorly, with Understanding accuracy dropping below 50% for all models (Table 5).

Table 5: Performance of different SOTA VL Models on Satirical Detection and Understanding Tasks on real photographs

Conclusion: The “Common Sense” Gap

The YesBut paper highlights a critical frontier in Artificial Intelligence. While Vision-Language models are impressive at literal interpretation, they lack the sophisticated reasoning required to understand satire.

The discrepancies between human and machine performance were stark. As shown in the human evaluation results below, humans achieved 100% on correctness, while the best model (Gemini) only reached 43.33%.

Figure 14: Results of Human Evaluation on the Satirical Image Understanding Task

Key Takeaways:

  1. Context is King: Models struggle when meaning is derived from the relationship between objects rather than the objects themselves.
  2. Visual Reasoning Lag: The absence of text in YesBut exposes that VL models still rely heavily on reading captions rather than “thinking” visually.
  3. A New Benchmark: The YesBut dataset provides a necessary and challenging playground for researchers to improve the reasoning and “sense of humor” of future AI systems.

Until AI can look at a person sitting on a toilet sending a “Wish you were here” text and understand the irony, we can rest assured that the subtleties of human humor remain—for now—uniquely human.