Introduction

In the modern digital landscape, memes have evolved far beyond funny cat pictures or relatable reaction images. They have become a primary dialect of the internet—a sophisticated, multimodal form of communication capable of shaping public opinion, spreading culture, and even influencing election results. During the last two US presidential elections, for example, memes were weaponized as coordinated media content to sway voters.

But here lies the problem: while humans can process the layers of irony, cultural reference, and visual humor in a meme almost instantly, computers struggle immensely with this task. A meme is not just an image, nor is it just text; it is the complex interplay between the two, often requiring deep external knowledge to decode.

This challenge has given rise to a new field of research known as Computational Meme Understanding (CMU).

In this post, we will break down a comprehensive survey paper by Nguyen and Ng that maps out this emerging landscape. We will explore how researchers categorize memes, the specific tasks involved in teaching machines to understand them, and the current state-of-the-art models attempting to crack the code of internet culture.

The Anatomy of a Meme: A Taxonomy

Before we can build models to understand memes, we need a rigorous way to define what they are. In the wild, memes seem chaotic and infinite, but the researchers introduce a structured taxonomy based on three dimensions: Forms, Functions, and Topics.

1. Forms: How Memes Look

Memes are not visually uniform. To a computer vision model, a screenshot of a Tweet looks very different from a classic “Impact font” image macro. The authors adopt a taxonomy from communication researcher Ryan Milner, dividing memes into two primary categories: Remixed Images and Stable Images.

Figure 1: Taxonomy of forms for memes, adapted from Milner (2012).

As shown in Figure 1, the distinction lies in how the image is treated:

  • Remixed Images: These are created through manipulation.
  • Macros: The most recognizable form, featuring a base template with a setup line at the top and a punchline at the bottom.
  • Shops: Short for “Photoshops,” where elements are graphically edited or superimposed onto a base image.
  • Stacked Images: Multiple images combined, often to tell a story or show a reaction (like the Drake hotline bling meme).
  • Stable Images: These are used as memes without graphical editing.
  • Screenshots: Captures of social media conversations or news headlines.
  • Memes IRL: Photos of memetic behavior in real life.

Understanding this form is critical because different forms require different processing strategies. A “Macro” relies heavily on OCR (Optical Character Recognition) to read the text, while a “Shop” might require a model to detect subtle visual anomalies to understand the joke.

2. Functions: What Memes Do

Memes are not passive static objects; they perform actions. Borrowing from linguistic Speech Act Theory, the authors classify memes based on their illocutionary acts—essentially, what the meme is “doing” communicatively.

Figure 3: Illocutionary acts of memes, adapted from Grundlingh (2018). The text in parentheses represents subtypes. Commissives and Acknowledgements (in gray) are illocutionary acts from speech acts theory that do not apply to memes.

As illustrated in Figure 3, memes fall into specific categories of action:

  • Constatives: These express a belief or state of affairs. This includes Assertives (stating a fact or opinion), Descriptives, and Dissentives (disagreeing with a premise).
  • Directives: These attempt to get the viewer to do something, such as Advisories (giving advice, like the “Actual Advice Mallard”) or Questions.

This functional taxonomy is vital for tasks like hate speech detection. A meme that is descriptive might be harmless, but a meme functioning as an aggressive directive or stereotype could be malicious.

3. Topics: What Memes Are About

Finally, memes are organized by topic. This is the most fluid dimension, as topics shift with the news cycle. Memes can be about timeless themes (like relationships or school) or time-sensitive events (like the COVID-19 pandemic, elections, or the Russia-Ukraine crisis).

The topical dimension introduces a major hurdle for AI: Temporal Context. A meme about a political figure might be funny today but confusing—or misinformation—in three years. Models trained on old data often fail to grasp current memes because they lack the necessary world knowledge.

Key Tasks in Computational Meme Understanding

Now that we have a language to describe memes, what exactly are we asking computers to do with them? The researchers identify three distinct tasks, ranging from simple categorization to complex reasoning.

Task 1: Classification

This is currently the most popular area of research. Classification involves assigning a label to a meme.

  • Binary Classification: The most common application is content moderation. Is this meme Hateful or Not Hateful? Is it Harmful or Harmless?
  • Multi-class Classification: This involves more nuance. Models might predict the type of persuasion technique used, the specific emotion conveyed (sarcastic, humorous, offensive), or the target of the joke (e.g., attacking a religion, race, or nationality).

While “classification” sounds standard for Machine Learning, the multimodal nature of memes makes it difficult. A picture of a smiling person is positive. The text “I love you” is positive. But put that text over a picture of a villain, and the classification might shift to “Sarcasm” or “Threat.”

Task 2: Interpretation

This task is significantly harder. Meme Interpretation aims to generate a textual description of the meme’s final message. This isn’t just describing the image (e.g., “a bear in a tuxedo”); it is decoding the subtext.

Figure 2: Example memes from (a,b) MemeCap (Hwang and Shwartz, 2023), (c) SemEval-2021-T6 (Dimitrov et al., 2021), and (d) ExHVV (Sharma et al., 2023)

Look at the memes in Figure 2.

  • Meme (b): A classic Winnie-the-Pooh template. A standard image captioner might see “a cartoon bear.” A Meme Interpretation model needs to output: “The meme creates a hierarchy of sophistication regarding buying video games, suggesting that waiting to get the game for free is the most ‘classy’ or intelligent option.”
  • Meme (c): This meme requires temporal knowledge. It shows Obama and Clinton looking at a smiling Trump through binoculars with the text “STILL YOUR PRESIDENT.” Interpreting this requires knowing the political tension between these figures and the context of the 2016 or 2020 election cycles.

Task 3: Explanation

While interpretation summarizes the what, Explanation answers the why.

This is often framed as a constrained generation task. For example, if a model flags a meme as “Hateful,” the explanation task requires the model to articulate the reasoning.

  • Target: Who is being attacked? (e.g., “The Democratic Party”).
  • Role: What role are they playing? (e.g., “Victim” or “Villain”).
  • Reasoning: Why? For Figure 2(d), an explanation might be: “The Democratic Party is portrayed as a victim because the meme suggests they were right about the virus being a hoax, while Trump is shown contradicting himself.” (Note: This specific interpretation depends heavily on the political bias of the annotator, highlighting the subjectivity challenge).

The Data Problem

Data is the fuel for these models. The researchers surveyed 24 datasets to see what resources are available for training CMU systems.

Table 1: Existing Datasets on Computational Meme Understanding. Abbreviations: for Task-Binary classification (C), Multi-class classification (NC), Explanation (E), and Interpretation (I); for Method-“Inherit” mean the memes were from another dataset.; for Lang. (Languages): English (E), Bengali (Be), T (Tamil).

Table 1 reveals a significant skew in the field:

  1. Obsession with Classification: Out of 24 datasets, 21 are focused on classification (Task 2C and NC). The field is heavily dominated by the need to detect hate speech, offensiveness, and misogyny.
  2. Scarcity of Reasoning Data: Only two datasets (HatReD and ExHVV) address Explanation (Task E), and only one (MemeCap) addresses Interpretation (Task I). This explains why models are good at flagging bad content but terrible at telling us why it’s bad.
  3. The “Form” Blindspot: Most datasets don’t control for the form of the meme (as defined in Figure 1). Some, like the famous “Hateful Memes” dataset, focus almost exclusively on Macros. If a model is trained only on Macros, it will likely fail when presented with a screenshot-style meme or a complex “Stacked” image.
  4. Language Bias: The vast majority of resources are in English, with a small representation for Bengali and Tamil.

How Models Read Memes

How do you build a brain that understands memes? The survey outlines the evolution of model architectures used in CMU.

The Standard Pipeline: Unimodal Encoders

For a long time, the standard approach was “Divide and Conquer”:

  1. Extract Text: Use Optical Character Recognition (OCR) to pull the text. Pass this through a language model like BERT or RoBERTa.
  2. Extract Image: Pass the image through a vision encoder like ResNet or ViT (Vision Transformer) to get a vector representation of the visuals.
  3. Fusion: Concatenate (join) these two vectors and feed them into a classifier.

While effective for simple tasks, this approach often fails to capture the interaction between text and image. It sees “Smiling Trump” and “Crisis text” separately, missing the irony generated by their combination.

The Modern Approach: Vision-Language Models (VLMs)

The state-of-the-art has shifted toward large, pre-trained Vision-Language Models (VLMs) like CLIP, FLAMINGO, and GPT-4 (Vision). These models are trained on massive amounts of image-text pairs from the internet. They don’t just process text and image separately; they learn the semantic relationship between them.

For example, models like Llava or OpenFlamingo can “look” at an image and answer questions about it, making them far better suited for the Interpretation and Explanation tasks.

Experimental Results: Are We There Yet?

The researchers analyzed performance across the three key tasks.

Classification Performance

In classification tasks, models are becoming quite capable, though performance varies wildly depending on the difficulty of the dataset.

Table 2: State-of-the-art models on Meme Classification. B: Binary classification. N: Multiclass classification. L Level, T: Target, A: Attack type, G: Gab, Tw: Twitter, R: Reddit, St: Sentiment, H: Humor, Sm: Semantic

As seen in Table 2, binary classification (detecting hate vs. non-hate) is achieving high accuracy.

  • On the Hateful Memes dataset, the PaLI-X-VPD model achieves an AUC of 0.81.
  • On the WOAH5 dataset, models reach an AUC of 0.96.

However, look at the SemEval-2021-T6 row (Persuasion Techniques). The best F1 score is only 0.58. This suggests that while models are good at detecting overt hate, they struggle with subtle persuasion, propaganda, and complex humor.

Explanation and Interpretation Performance

If classification is “solved” (optimistically), explanation and interpretation are definitely not.

Table 3: Best performing models for meme explanation (first two datasets) and interpretation (last dataset). The scores were taken from the respective papers and scaled to the range [0, 1]. The best results for each dataset are boldfaced.

Table 3 paints a stark picture.

  • Human Evaluation is Low: Look at the “Correct” column. For the HatReD dataset (explaining hate speech), even the best models only achieve a correctness score of roughly 0.62.
  • MemeCap Struggles: For the MemeCap dataset (interpreting meaning), the correctness score drops to 0.36. This means that roughly two-thirds of the time, the model generates an interpretation that humans consider wrong.

Why do they fail? The researchers note that errors often stem from:

  1. Hallucination: Models inventing details that aren’t there.
  2. Visual Attention: Failing to notice small but crucial visual cues (like a specific flag in the background or a subtle facial expression).
  3. Lack of External Knowledge: The model doesn’t know who the people are or the cultural context of the meme template.

Conclusion and Future Directions

The survey by Nguyen and Ng highlights that while Computational Meme Understanding has made strides in flagging harmful content, we are still far from systems that truly “get the joke.”

To bridge this gap, the authors suggest several future research avenues:

  1. Active Knowledge Acquisition: Models need to be connected to live knowledge bases (like Know Your Meme or Wikipedia) to understand breaking news and evolving templates. A static model trained in 2020 will never understand a meme about an event in 2024.
  2. Visual Reasoning: We need models that can explain where they are looking. Teaching models to follow a human-like reasoning path (e.g., “I see a gun, I see a text about school, therefore…”) could improve accuracy and trust.
  3. Video and Animated Memes: The current research is fixated on static images. However, the internet is moving toward GIFs and short-form video (TikTok/Reels). CMU needs to expand into the temporal dimension of video.
  4. Ethical Annotation: Finally, the researchers raise a crucial ethical point. Training these models requires humans to look at thousands of hateful, toxic memes. Future work must prioritize the mental health of annotators, perhaps by using AI to filter the worst content before human verification.

Memes are a mirror of our culture—messy, fast-paced, and deeply contextual. Teaching machines to look into that mirror and understand what they see is one of the most challenging frontiers in AI today.