Papers

[MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model 🔗](https://arxiv.org/abs/2406.11193)

Inside the Mind of Multimodal Models: Tracking Domain-Specific Neurons with MMNeuron

Introduction How does a large language model (LLM) “see” an image? When we feed a photograph of a chest X-ray or a satellite view of a city into a Multimodal Large Language Model (MLLM) like LLaVA or InstructBLIP, we know the architecture: an image encoder breaks the visual into features, a projector maps them to the language space, and the LLM generates a response. But what happens in the hidden layers between that initial projection and the final answer? ...

[MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance 🔗](https://arxiv.org/abs/2401.02906)

Blinded by the Light: Securing Multimodal AI Against Visual Jailbreaks with MLLM-Protector

Introduction: The New Vulnerability in Multimodal AI The rapid evolution of Artificial Intelligence has taken us from text-based Large Language Models (LLMs) like GPT-3 to Multimodal Large Language Models (MLLMs) like LLaVA and GPT-4V. These newer models possess the remarkable ability to “see”—they can process images alongside text to answer complex queries. This leap forward opens up endless applications, from medical imaging analysis to assisting the visually impaired. However, this added modality introduces a significant, often overlooked security flaw. While the AI community has spent years refining safety alignment for text—ensuring models refuse to generate hate speech or bomb-making instructions—the visual component acts as a backdoor. ...

[MIPD: Exploring Manipulation and Intention In a Novel Corpus of Polish Disinformation 🔗](https://aclanthology.org/2024.emnlp-main.1103.pdf)

Beyond Fake News: Decoding the Intent and Manipulation Behind Disinformation

The term “fake news” has become a staple of modern vocabulary, but it is a clumsy instrument for a surgical problem. Disinformation isn’t just about truth versus fiction; it is about the intent to harm and the methods used to deceive. Whether it involves denying climate change or undermining public health during a pandemic, disinformation is a calculated effort to shift public perception. To combat this effectively, we need to understand not just what is being said, but why and how. This is the core problem addressed in a recent paper titled “MIPD: Exploring Manipulation and Intention In a Novel Corpus of Polish Disinformation.” ...

[MIND: Multimodal Shopping Intention Distillation from Large Vision-Language Models for E-commerce Purchase Understanding 🔗](https://arxiv.org/abs/2406.10701)

Why Did You Buy That? Decoding Shopping Intentions with Multimodal AI

Introduction Imagine walking into a store and buying a wireless mouse. A few minutes later, you pick up a solar-powered keyboard. To a human observer, the connection is obvious: you are likely setting up an eco-friendly, clutter-free home office. However, for traditional Artificial Intelligence systems in e-commerce, this connection is surprisingly difficult to make. Most existing systems rely solely on text—product titles and descriptions. When a text-based model sees “Orbit Trackball Mouse” and “Wireless Solar Keyboard,” it might correctly categorize them as “electronics,” but it often misses the nuanced intention behind the purchase. It fails to “see” that both items are white, ergonomic, and designed for a specific type of user. ...

[MIBench: Evaluating Multimodal Large Language Models over Multiple Images 🔗](https://arxiv.org/abs/2407.15272)

Beyond the Single Frame: Why Multimodal LLMs Struggle with Multi-Image Scenarios

Introduction The rise of Multimodal Large Language Models (MLLMs) like GPT-4V, LLaVA, and mPLUG-Owl has revolutionized how Artificial Intelligence perceives the world. These models can describe photos, answer questions about diagrams, and even write code based on whiteboard sketches. However, there is a significant gap between these benchmark achievements and real-world utility. Most current benchmarks focus on single-image scenarios. The model is given one picture and asked a question. Yet, human visual consumption is rarely isolated to a single frame. When we browse a website, we integrate information from multiple product photos and textual descriptions. When we watch a tutorial, we follow a temporal sequence of steps. When we scroll through social media, we process interleaved text and images simultaneously. ...

[MEANT: Multimodal Encoder for Antecedent Information 🔗](https://arxiv.org/abs/2411.06616)

Reading the Market: How MEANT Combines Images, Tweets, and Time for Stock Prediction

The stock market is a chaotic, noisy environment. To make sense of it, a human trader doesn’t just look at a single number. They look at price charts (visual), read news and social media (textual), and analyze numeric indicators (quantitative). Crucially, they don’t just look at the current moment; they look at the trend over the last few days or weeks. This combination of different data types over time is what researchers call temporal multimodal data. ...

[MAGIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration 🔗](https://arxiv.org/abs/2311.08562)

Can AI Play Nice? Benchmarking the Social Intelligence of Large Language Models

Introduction: The Missing “Social” Piece in AI We have all witnessed the meteoric rise of Large Language Models (LLMs) like GPT-4 and Claude. We know they can write code, compose poetry, and pass the bar exam. They possess incredible reasoning capabilities, memory, and tool usage. But there is a frontier that remains largely unexplored and surprisingly difficult for these digital polymaths: social intelligence. In the real world, intelligence is rarely solitary. We work in teams, we negotiate prices, we play games where we must hide our intentions, and we make decisions based on what we think others are thinking. This is the domain of Multi-Agent Systems (MAS). ...

[MASIVE: Open-Ended Affective State Identification in English and Spanish 🔗](https://arxiv.org/abs/2407.12196)

Beyond Happy and Sad: Teaching AI to Understand Complex Human Feelings

If you were asked to describe how you feel after a long, difficult week that ended with a small victory, you probably wouldn’t just say “happy” or “sad.” You might say you feel relieved, drained, accomplished, or bittersweet. Human emotion is high-dimensional and nuanced. Yet, for years, Natural Language Processing (NLP) has treated emotion analysis as a simple sorting task. Most systems try to force complex human sentences into a tiny set of boxes—usually the “Basic Six” proposed by psychologist Paul Ekman (Anger, Disgust, Fear, Joy, Sadness, Surprise). ...

[MARE: Multi-Aspect Rationale Extractor on Unsupervised Rationale Extraction 🔗](https://arxiv.org/abs/2410.03531)

Opening the Black Box: How MARE Extracts Multi-Aspect Rationales from Text

Opening the Black Box: How MARE Extracts Multi-Aspect Rationales from Text Deep learning models, particularly those based on Transformers like BERT, have revolutionized text classification. They can read a movie review and tell you with high accuracy whether it’s positive or negative. But there is a persistent problem: these models are “black boxes.” They give us a prediction, but they rarely tell us why they made it. In high-stakes domains like healthcare, law, or finance, “because the model said so” isn’t good enough. We need rationales—specific snippets of text that justify the model’s decision. ...

[MAR: Matching-Augmented Reasoning for Enhancing Visual-based Entity Question Answering 🔗](https://aclanthology.org/2024.emnlp-main.91.pdf)

Who Is That? Solving the Identity Crisis in Multimodal LLMs with Matching-Augmented Reasoning

Who Is That? Solving the Identity Crisis in Multimodal LLMs with Matching-Augmented Reasoning Multimodal Large Language Models (MLLMs) like GPT-4V and LLaVA have revolutionized how computers interact with the world. You can upload a photo of a complex scene, and these models can describe the lighting, read the text on a sign, or tell you what kind of dog is in the picture. They feel almost magical. However, the magic often fades when you ask a very specific, human-centric question: “Who is the person in this red box?” ...

[M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought 🔗](https://arxiv.org/abs/2410.09220)

Decoding Hate: How AI Uses Chain-of-Thought Reasoning to Spot Misogynous Memes

Social media is a double-edged sword. While it connects us, it also serves as a breeding ground for hate speech. Among the most insidious forms of online hate are misogynous memes. Unlike plain text insults, memes rely on a complex interplay between image and text, often employing dark humor, sarcasm, or obscure cultural references to mask their harmful intent. Detecting these memes is a massive challenge for Artificial Intelligence. A standard AI might see a picture of a kitchen and the text “Make me a sandwich” and classify it as harmless banter about food. A human, however, immediately recognizes the sexist trope. ...

[M3D: MultiModal MultiDocument Fine-Grained Inconsistency Detection 🔗](https://aclanthology.org/2024.emnlp-main.1243.pdf)

Beyond True or False: Detecting Fine-Grained Inconsistencies Across Multimodal Documents

In the age of information overload, verifying a single claim often feels like detective work. You read a headline, check a news article, watch a video clip, and perhaps look at a photo caption. Rarely does a single document hold all the answers. Yet, most automated fact-checking systems today operate in a bubble: they look at a single piece of text and output a binary “True” or “False.” But misinformation is rarely black and white. A claim might be mostly true but contain a fabricated statistic. Or, it might describe a real event but attribute it to the wrong location shown in a video. To build trust, AI needs to do more than just flag a post; it needs to explain exactly which part of the claim contradicts the evidence. ...

[M2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning 🔗](https://aclanthology.org/2024.emnlp-main.218.pdf)

Tuning 0.09% of the Parameters: How Multimodal Prompt Tuning (M2PT) Revolutionizes Zero-Shot Learning

The dream of Artificial General Intelligence (AGI) relies heavily on the ability of machines to process information the way humans do: multimodally. When you look at a picture of a crowded street and answer the question, “Is it safe to cross?”, you are seamlessly blending visual perception with linguistic reasoning. Multimodal Large Language Models (MLLMs) like LLaVA and Flamingo have made massive strides in mimicking this capability. However, there is a catch. As these models grow into the billions of parameters, adapting them to new tasks becomes prohibitively expensive. Traditional “full fine-tuning”—where we update every single weight in the model—is computationally heavy and storage-intensive. ...

[Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps 🔗](https://arxiv.org/abs/2407.07071)

The Lookback Lens: Detecting Hallucinations by Watching Where LLMs Look

Introduction In the current landscape of Large Language Models (LLMs), we often rely on a technique called Retrieval-Augmented Generation (RAG). The premise is simple: LLMs can’t know everything, especially private data or recent news, so we provide them with relevant documents (the context) and ask them to answer questions or summarize that information. We assume that if we give the model the correct facts, it will use them. Unfortunately, that is not always true. ...

[LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering 🔗](https://arxiv.org/abs/2410.18050)

Taming the Long Context: How LongRAG Solves the 'Lost in the Middle' Problem

In the rapidly evolving world of Large Language Models (LLMs), we have seen a massive push toward “Long-Context” capabilities. Models like Gemini 1.5 or GPT-4-Turbo boast the ability to process hundreds of thousands of tokens—entire novels or codebases—in a single prompt. Theoretically, this should solve the problem of answering complex questions based on large documents. However, reality tells a different story. When presented with massive amounts of data, LLMs often suffer from the “lost in the middle” phenomenon: they are great at remembering the beginning and end of a document but tend to forget or hallucinate details buried in the middle. ...

[LONGEMBED: Extending Embedding Models for Long Context Retrieval 🔗](https://arxiv.org/abs/2404.12096)

Breaking the 512-Token Barrier: How to Extend Embedding Models for Long Context Retrieval

In the rapidly evolving world of Natural Language Processing (NLP), text embedding models are the unsung heroes. They transform text into vector representations—lists of numbers that capture semantic meaning—serving as the engine for information retrieval (IR) and Retrieval-Augmented Generation (RAG). However, there is a persistent bottleneck in this engine: the context window. Most popular embedding models, such as BERT-based architectures, are restricted to a short context window, typically 512 tokens. In real-world applications—like searching through legal contracts, summarizing long meeting transcripts, or indexing entire Wikipedia articles—512 tokens is simply not enough. When the input exceeds this limit, the text is usually truncated, leading to a massive loss of information. ...

[LogicST: A Logical Self-Training Framework for Document-Level Relation Extraction with Incomplete Annotations 🔗](https://aclanthology.org/2024.emnlp-main.314.pdf)

LogicST: How Logical Rules Can Fix Neural Networks in Relation Extraction

Introduction In the age of Big Data, we have more text than any human could ever read. To make sense of this information, we rely on Relation Extraction (RE)—the process of teaching machines to identify relationships between entities in text. For example, reading “Paris is in France” and extracting the triplet (Paris, Located In, France). This technology is the backbone of Knowledge Graphs, Question Answering systems, and semantic search. However, as we move from simple sentences to complex documents, the challenge grows exponentially. Document-level Relation Extraction (DocRE) requires understanding connections across long paragraphs, involving multiple entities. ...

[LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models 🔗](https://arxiv.org/abs/2401.00757)

Can AI Really Reason? Inside LogicAsker, the Framework Testing LLMs on Formal Logic

Large Language Models (LLMs) like GPT-4 and Llama 3 have become ubiquitous in our lives. They write poetry, generate code, summarize complex emails, and even crack jokes. When you interact with a chatbot that seems so articulate, it is natural to assume that there is a robust reasoning engine beneath the hood—a digital brain capable of connecting facts to draw logical conclusions. But here lies a significant challenge in the field of Artificial Intelligence: Is the model actually reasoning, or is it just really good at pattern matching? ...

[Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia 🔗](https://arxiv.org/abs/2410.04282)

Lost in Translation: How AI Detects Information Gaps Across Wikipedia Languages

Wikipedia is often viewed as a singular, universal repository of human knowledge. We tend to assume that switching the language setting from English to French or Russian simply translates the text. However, the reality is far more complex. Wikipedia is a federation of distinct communities, each with its own editors, cultural norms, and biases. This leads to distinct narratives where facts present in one language are completely omitted in another. ...

[Local Contrastive Editing of Gender Stereotypes 🔗](https://arxiv.org/abs/2410.17739)

Performing Brain Surgery on BERT: How to Locate and Edit Gender Bias in Language Models

Large Language Models (LMs) are mirrors of the data they are trained on. Unfortunately, this means they often reflect the societal biases, including gender stereotypes, found in the vast text of the internet. While we have many tools to measure this bias—checking if a model associates “doctor” more with men than women—we still have a limited understanding of where this bias physically lives inside the model. Which specific numbers (weights) among the millions or billions of parameters are responsible for thinking that “nurse” implies “female”? ...