Social media is a double-edged sword. While it connects us, it also serves as a breeding ground for hate speech. Among the most insidious forms of online hate are misogynous memes. Unlike plain text insults, memes rely on a complex interplay between image and text, often employing dark humor, sarcasm, or obscure cultural references to mask their harmful intent.
Detecting these memes is a massive challenge for Artificial Intelligence. A standard AI might see a picture of a kitchen and the text “Make me a sandwich” and classify it as harmless banter about food. A human, however, immediately recognizes the sexist trope.
How do we teach machines to bridge this gap? A recent paper, “M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought,” proposes a fascinating solution. The researchers introduce a framework that doesn’t just look at a meme—it thinks about it, step-by-step, mimicking human reasoning to identify hate against women.
In this post, we will deconstruct this paper, exploring how Large Language Models (LLMs), Scene Graphs, and Chain-of-Thought reasoning come together to solve a complex multimodal problem.
The Problem: Why Traditional Models Fail
Before diving into the solution, we need to understand why this task is so hard. Traditional multimodal models (like early versions of VisualBERT or standard CLIP classifiers) often process text and images as raw data points. They map pixels and words to vectors and look for patterns.
However, misogyny in memes is rarely explicit. It hides in the relationship between the text and the image.

As shown in Figure 1 above:
- Approach (a) represents the standard method. The model takes the text and image, feeds them into a black box (Pre-trained Visual Language Model), and spits out a prediction. It lacks context and cultural nuance.
- Approach (b) represents the paper’s proposal (M3Hop-CoT). It doesn’t just guess; it generates Emotions, identifies the Target, and analyzes the Context before making a decision.
The researchers argue that to catch sophisticated hate speech, an AI needs to perform “Multi-hop” reasoning—jumping from understanding the visual scene to grasping the emotion, and finally to interpreting the societal context.
The Solution: M3Hop-CoT Framework
The researchers developed a model called M3Hop-CoT. The name stands for Multimodal Multi-hop Chain-of-Thought. Let’s break down the architecture to understand how it works.

The architecture, illustrated in Figure 2, operates like a sophisticated assembly line. Here is the step-by-step flow:
- Input Processing: The model takes the meme text and the meme image. It uses CLIP (a powerful model by OpenAI that understands images and text together) to extract initial features.
- Scene Graph Generation (EORs): This is a crucial addition. The model doesn’t just look at the image as a grid of pixels; it extracts Entity-Object-Relationships (EORs).
- The “Thinking” Phase (LLM & CoT): The system feeds the text and the EORs into a Large Language Model (Mistral-7B). The LLM is prompted to generate three specific “rationales”:
- Emotion: What is the mood? (e.g., hostility, sarcasm).
- Target: Is this directed at women?
- Context: What is the cultural background?
- Fusion & Prediction: These rationales are converted back into mathematical vectors and fused with the original image/text features using an “Attention” mechanism. Finally, the model makes a prediction: Misogynous or Non-Misogynous.
Let’s dig deeper into the two most innovative parts of this pipeline: the Scene Graphs and the Multi-hop Reasoning.
1. Seeing Relationships with Scene Graphs
One of the biggest failures of standard AI is missing the visual context. To fix this, the researchers employ an Unbiased Scene Graph Generation technique.

As seen in Figure 7, a scene graph translates an image into a structured set of nodes and edges. Instead of just seeing “Man” and “Bat,” the model explicitly understands the relationship: Man -> Holding -> Bat.
This structured data (Entity-Object-Relationships or EORs) helps the Large Language Model understand exactly what is happening in the image without needing to process the raw pixels itself. It bridges the gap between visual chaos and structured linguistic reasoning.
2. The Three Hops of Reasoning
The core contribution of this paper is the Chain-of-Thought (CoT) prompting. The authors don’t just ask the LLM “Is this hate speech?” Instead, they force the model to reason in three specific “hops.”

Figure 12 provides excellent examples of why this is necessary. Look at the meme in the middle (ii). It references “Jorogumo” (a spider-woman from Japanese folklore) and uses a character from Rick and Morty.
- Hop 1 (Emotion): The model identifies confusion or insult.
- Hop 2 (Target): It identifies that the comparison targets a woman’s appearance or nature.
- Hop 3 (Context): The model retrieves knowledge about the Jorogumo myth to understand that the meme is dehumanizing women by comparing them to monsters.
Without this “Context” hop, a standard model might just see a cartoon character and classify it as “harmless.” The M3Hop-CoT framework ensures that cultural references—whether they are from the 1500s, anime, or religious texts—are decoded before a judgment is made.
The Fusion: How the AI Decides
Once the LLM has generated the text rationales for Emotion, Target, and Context, the system needs to combine this text with the original image.
The researchers use a mechanism called Hierarchical Cross-Attention. In simple terms, this mechanism allows the model to “weigh” the importance of different inputs.
For example, if the text is neutral (“Look at this”), but the Context Rationale generated by the LLM says “This image depicts domestic violence,” the Attention mechanism will assign a higher weight to the Context, ensuring the final prediction leans towards “Misogynous.”
The model fuses these insights sequentially:
- Emotive Multimodal Fusion (EMF): Combines raw features with the Emotion rationale.
- Target Insight Multimodal Representation (TIMR): Adds the Target rationale.
- Comprehensive Contextual Multimodal Insight (CCMI): Adds the Context rationale.
This layered approach ensures that no single piece of evidence is ignored.
Experiments and Results
Does this complex architecture actually work better? The researchers tested M3Hop-CoT against several state-of-the-art baselines on two major datasets: MAMI (English) and MIMIC (Hindi-English Code-Mixed).

Table 2 shows the results.
- CLIP_MM (the baseline multimodal model) achieves an F1-score of roughly 72-75%.
- M3Hop-CoT (Proposed) jumps to an F1-score of roughly 79-80%.
This is a statistically significant improvement. The data shows that simply adding an LLM (like GPT-4 or Llama) helps, but the specific M3Hop architecture with Mistral performs the best. It proves that how you prompt the model (the multi-hop strategy) matters just as much as which model you use.
Qualitative Analysis: Seeing the Difference
To really understand the improvement, we can look at “Attention Maps.” These visualizations show us which parts of the meme the AI focused on to make its decision.

In Figure 5, look at the first example (a): “I was brought up to never hit a woman. She’s no woman.”
- CLIP_MM (Baseline): Predicts “Non-Misogynous.” It sees the word “never hit” and gets confused, missing the sarcasm in the second sentence.
- M3Hop-CoT (Proposed): Predicts “Misogynous.” The Grad-CAM (visual heatmap) shows it focusing intensely on the text and the slapping action in the image. The CoT reasoning helped it understand that “She’s no woman” is a dehumanizing justification for violence.
Error Analysis: Where Does it Fail?
No AI is perfect. The authors provide a transparent look at where M3Hop-CoT still struggles.

Figure 4 breaks down the errors. The proposed model (far right bar) has significantly fewer errors than the baselines, but specific categories remain problematic:
- Cartoonist Images: When images are highly stylized or abstract, the Scene Graph sometimes fails to identify objects correctly, leading to bad rationales.
- Reasoning Failure: Sometimes the LLM just hallucinates or misses the point of a complex joke.
- Annotation Error: Interestingly, a large chunk of “errors” were actually cases where the AI was right, but the human label in the dataset was debatable or wrong!

Figure 3 further illustrates the error rates. The proposed model drastically reduces the error rate for the “Misogynous” class (the blue bars) compared to CLIP_MM, meaning it is much safer for detecting harmful content.
Conclusion and Future Implications
The M3Hop-CoT paper presents a significant step forward in automated content moderation. By moving away from “black box” prediction and toward “reasoning-based” classification, the model achieves three things:
- Higher Accuracy: It catches memes that standard models miss.
- Interpretability: Because it generates rationales (Emotion, Target, Context), human moderators can check why the AI flagged a post.
- Cultural Awareness: It leverages the vast knowledge of LLMs to understand diverse cultural cues, from history to pop culture.
While challenges remain—particularly with cartoons and highly implicit sarcasm—this approach highlights that the future of AI safety lies in Neuro-Symbolic thinking: combining the pattern recognition of deep learning with the logical reasoning of language models.
For students and researchers in NLP and Computer Vision, this paper serves as a perfect example of how to creatively combine existing tools (CLIP, Scene Graphs, LLMs) to solve nuanced, real-world problems.
](https://deep-paper.org/en/paper/2410.09220/images/cover.png)