The internet is a battleground. While social media platforms have spent years refining algorithms to detect hateful text, the adversary has evolved. Hate speech is no longer just about nasty words typed into a status update; it has migrated to the visual realm. Internet memes—images overlaid with text—have become a dominant vehicle for spreading animosity, often bypassing traditional text-based filters.
This shift presents a massive engineering challenge. Text-based hate speech detection is a mature field with abundant datasets. Vision-language (multimodal) detection, however, is data-starved. Privacy concerns, copyright issues, and the sheer difficulty of scraping memes make building large training sets for memes incredibly difficult.
This leads to a fascinating research question: If we have millions of examples of hateful text, can we use them to teach an AI how to spot hateful memes?
In the paper Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning, researchers investigate this exact possibility. They explore a novel approach that leverages the richness of text-based data to improve performance in the data-scarce environment of multimodal hate speech detection.
The Core Problem: Data Scarcity and Modality Gaps
To understand the innovation of this paper, we must first understand the bottleneck. Modern AI thrives on data. To train a model to recognize hate speech against women, for instance, you typically feed it thousands of examples of misogynistic comments.
However, “Vision-Language” hate speech (like memes) is complex. A picture of a smiling person combined with sarcastic text might be hateful, while the same text on a different background is harmless. This is known as “inter-modality interaction.” Because these datasets are rare and small, models trained only on memes often fail when they encounter new, “out-of-distribution” data.
The researchers propose a solution based on Cross-Modality Knowledge Transfer. Since the underlying concept of hate remains consistent regardless of the medium (text or image), they hypothesize that a model can learn the logic of hate from text and apply it to images.
The Methodology: Few-Shot In-Context Learning
The researchers did not train a new model from scratch. Instead, they utilized Large Language Models (LLMs)—specifically Mistral-7B and Qwen2-7B—and employed a technique called Few-Shot In-Context Learning (ICL).
In ICL, you don’t update the model’s weights. Instead, you provide the model with a prompt that includes instructions and a few examples (demonstrations) of the task before asking it to solve a new problem.
The experimental pipeline is ingenious in how it translates visual data for a text-based LLM:
- Image Captioning: Since standard LLMs cannot “see” images directly, the researchers used an image captioning model (OFA) to convert the visual component of a meme into a textual description.
- Rationale Generation: Merely showing the model a hate speech example isn’t enough. The researchers prompted the LLM to generate a “rationale”—an explanation of why a specific tweet or meme was hateful (e.g., identifying the target group or the derogatory stereotype).
- Retrieval: When the model is asked to classify a new meme, it doesn’t just pick random examples to learn from. It uses retrieval algorithms (TF-IDF or BM-25) to find the most relevant examples from the support set based on similarity.
The Datasets
The study utilized distinct datasets to represent the different modalities:
- Support Set (The Teachers):
- Text Support: Latent Hatred, a dataset of tweets containing both explicit and implicit hate speech.
- Vision-Language Support: Facebook Hateful Memes (FHM) train split.
- Test Set (The Exam):
- FHM (dev_seen split) and MAMI (Multimedia Automatic Misogyny Identification).

As shown in Table 1 above, the Latent Hatred (text) dataset is significantly larger than the meme datasets, highlighting the resource gap the researchers aim to bridge.
Experiment 1: Does Text Help Vision?
The first major research question (RQ1) was straightforward: Does the text hate speech support set help with vision-language hate speech?
To test this, the researchers compared a “Zero-shot” setting (where the model is given no examples) against “Few-shot” settings (where the model is given 4, 8, or 16 text-based examples from the Latent Hatred dataset).
The results were compelling.

Referencing Table 2, we can observe several key trends:
- Text Boosts Performance: Across both the FHM and MAMI datasets, providing text-based demonstrations generally improved performance over the Zero-shot baseline. For instance, Mistral-7B saw its F1 score on the MAMI dataset jump from 0.568 (0-shot) to 0.701 (16-shots with BM-25 retrieval).
- Retrieval Matters: Randomly selecting examples (Random sampling) worked, but using intelligent retrieval (TF-IDF or BM-25) to find text examples similar to the meme’s caption yielded better results.
- More is Better: Generally, increasing the number of “shots” (examples) from 4 to 16 improved the model’s accuracy.
This confirms that the logic of hate speech contained in tweets can indeed help an AI interpret the hate speech contained in memes.
Experiment 2: Text vs. Image Support
The second research question (RQ2) provided the most surprising insight. One might assume that to detect hateful memes, showing the model examples of other hateful memes (from the FHM train set) would be the best strategy. Like-for-like training usually works best.
However, the data suggests otherwise.

Table 3 shows the results using the FHM (vision-language) dataset as the support set. When you compare Table 2 (Text Support) with Table 3 (Vision Support), a distinct pattern emerges: Text-based demonstrations outperform vision-language demonstrations.
Notice the red numbers in Table 3? These indicate instances where providing meme examples actually made the model worse than having no examples at all. The researchers speculate this is due to the “oversimplification” of visual information. When a meme is converted into a text caption for the LLM, nuanced visual context is lost. In contrast, the Latent Hatred dataset (text) contains rich, explicit, and diverse linguistic patterns that provide a stronger signal for the model to learn from.
Qualitative Analysis: Why does it work (and fail)?
The researchers went beyond raw numbers to analyze specific case studies. These examples illuminate how the model transfers knowledge from text to image.
Success Case: Conceptual Bridging
In one striking example, the model successfully classified a complex multimodal meme by drawing a conceptual bridge from a text tweet.

In Case Study 1 (Table 4), the meme features a pun on the word “stoned,” utilizing an image of a woman in a hijab. The model had previously failed to identify this as hate speech. However, after seeing a text example (Example 1) that disparagingly compared the Qur’an to “weed” and used the word “stoned,” the model made the connection. It learned that combining religious imagery with drug references or violence (stoning) is a form of hate speech.

Similarly, in Case Study 2 (Table 5), the model correctly identified a meme attacking intelligence based on appearance. The support text (Example 2) discussed IQ and stereotypes. This helped the model realize that the meme wasn’t just a random joke, but a targeted attack on a group’s intelligence—a specific category of hate speech it learned from the text.
Failure Case: Keyword Oversensitivity
However, the transfer isn’t perfect. The paper identifies a phenomenon known as oversensitivity, where the model latches onto specific keywords from the text examples and misapplies them to the memes.

In Table 6, the model misclassified a meme containing an image of baboons. Why? Because the support set (Example 1) contained a hateful tweet calling a person a “baboon.” The model learned “Baboon = Hate.” It failed to distinguish between a hateful metaphor in text and a literal image of an animal, leading to a false positive.

Table 7 illustrates a similar failure regarding historical context. A neutral meme regarding a “white resister” was flagged as hate. The support examples (Example 3 and 4) involved dismissive language about hate crimes and the Holocaust. The model seemingly absorbed the negative sentiment associated with “white” and “history/photos” in the context of hate speech and wrongly applied it to a neutral historical photo.
Conclusion and Future Implications
This research marks a significant step forward in digital safety. It demonstrates that we don’t necessarily need to wait for massive, ethically cleared multimodal datasets to improve hateful meme detection. We can leverage the vast oceans of text data already available.
The key takeaways are:
- Text helps Vision: Text-based hate speech examples significantly enhance the classification accuracy of vision-language hate speech.
- Richness over Modality: Rich text descriptions (even from a different modality) can be more instructive to an LLM than simplified captions from the same modality.
- The Risk of Bias: While effective, this method introduces risks of over-generalization, where models might flag innocent content because it shares keywords with hateful text.
For students and researchers in AI, this paper highlights the power of In-Context Learning. It shows that with clever prompting and retrieval strategies, LLMs can act as bridges between different types of media, solving complex problems by reasoning through analogy rather than just pattern matching pixels. Future work will likely focus on refining this transfer to reduce false positives and perhaps exploring cross-modality fine-tuning to bake this knowledge directly into the model weights.
](https://deep-paper.org/en/paper/2410.05600/images/cover.png)