EMNLP 2024

[MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension 🔗](https://arxiv.org/abs/2409.13609)

How MaPPER Makes Visual Grounding Efficient: A Deep Dive into Prior-Guided Tuning

Introduction Imagine you are looking at a crowded photograph of a street scene. A friend stands beside you and says, “Look at the guy in the yellow shirt standing near the bike.” Instantly, your brain processes the language, scans the image, filters out the “guys in blue shirts” and “guys near cars,” and locks onto the specific target. In computer vision, this task is known as Referring Expression Comprehension (REC). The goal is to ground a specific region in an image based on a natural language description. While this sounds intuitive to humans, it is a complex challenge for AI. It requires a model to possess strong visual perception, deep linguistic understanding, and—most importantly—the ability to align these two modalities perfectly. ...

[MTLS: Making Texts into Linguistic Symbols 🔗](https://aclanthology.org/2024.emnlp-main.206.pdf)

Beyond Vocabulary: How Teaching AI to "See" Words Unlocks Multilingual Powers

Language is a strange thing. If you speak English, the word “Love” is a familiar sequence of four letters. If you speak Greek, “αγάπη” carries the same emotional weight but looks completely different. If you speak Chinese, “爱” is a distinct logogram. ...

[MTA4DPR: Multi-Teaching-Assistants Based Iterative Knowledge Distillation for Dense Passage Retrieval 🔗](https://aclanthology.org/2024.emnlp-main.336.pdf)

Why One Teacher Isn't Enough: Boosting Dense Retrieval with Multi-Assistant Distillation

Introduction In the world of Information Retrieval (IR), we are constantly balancing a difficult trade-off: accuracy versus speed. We want our search engines to understand the nuance of human language like a massive Large Language Model (LLM), but we need them to return results in milliseconds like a simple keyword search. Dense Passage Retrieval (DPR) has emerged as a powerful solution, using deep learning to represent queries and documents as vectors. However, the most accurate DPR models are often massive, computationally expensive, and slow to deploy. To solve this, researchers turn to Knowledge Distillation (KD)—a technique where a small, fast “student” model learns to mimic a large, slow “teacher” model. ...

[MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models 🔗](https://arxiv.org/abs/2401.16745)

Beyond the First Prompt: Evaluating How LLMs Handle Long Conversations with MT-Eval

Introduction We are currently living in the “golden age” of Large Language Models (LLMs). From drafting emails to generating code snippets, models like GPT-4 and Llama-2 have integrated themselves into our daily workflows. When we benchmark these models, however, we often treat them like search engines: we ask a single question, get a single answer, and grade the result. But is this how we actually use AI? In the real world, interaction is rarely a one-shot event. We chat. We ask for revisions. We change the topic slightly, then circle back to an earlier point. We ask the model to “remember what I said three messages ago.” This is the domain of multi-turn interaction, and it is a significantly harder challenge for an AI than answering a standalone query. ...

[MSI-Agent: Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making 🔗](https://arxiv.org/abs/2409.16686)

Building Smarter Robots: How Multi-Scale Insights Solve the Memory Problem in Embodied AI

Introduction Imagine you are teaching a robot how to navigate a kitchen. On day one, you teach it how to make a salad. It learns a valuable lesson: “Use a bowl to hold the ingredients.” On day two, you ask the robot to water a plant. The robot, eager to apply its past knowledge, remembers the concept of a “bowl” and the action “fill with water.” However, because its memory is cluttered, it might erroneously try to “slice” the water or mix the plant with dressing because it associates bowls with cooking. ...

[MQuinE: a cure for Z-paradox in knowledge graph embedding models 🔗](https://arxiv.org/abs/2402.03583)

The Z-Paradox: Why Your Knowledge Graph Model Might Be Hallucinating Connections (and How MQuinE Fixes It)

Introduction In the world of Artificial Intelligence, Knowledge Graphs (KGs) are the unsung heroes. They power the sidebar on your Google Search, drive product recommendations on Amazon, and help complex systems understand that “Paris” is the capital of “France.” To make these graphs useful for machine learning, we use Knowledge Graph Embedding (KGE) models. These models translate entities (like “Paris”) and relations (like “is capital of”) into mathematical vectors and matrices. ...

[MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge Graphs 🔗](https://arxiv.org/abs/2403.05814)

Mastering the Art of Conversation: How MP2D Uses Knowledge Graphs to Teach AI Topic Shifts

Have you ever noticed how rigid most chatbots feel? You ask about the weather, and they tell you the forecast. You ask about a restaurant, and they give you a menu. But if you try to naturally segue from that restaurant to the history of the cuisine, and then perhaps to a famous chef who cooks that cuisine, the bot often stumbles. It loses context or treats the new topic as a completely isolated query. ...

[MOSEL: Inference Serving Using Dynamic Modality Selection 🔗](https://arxiv.org/abs/2310.18481)

Less is More: Accelerating Multimodal AI with Dynamic Modality Selection (MOSEL)

Introduction We are living in the era of massive artificial intelligence. In recent years, deep learning models—particularly Transformers—have shattered records in computer vision and natural language processing. We have moved beyond simple image classifiers to complex multimodal systems capable of understanding video, audio, and text simultaneously. However, this capability comes at a steep price. Between 2012 and 2020, the computational requirements for state-of-the-art machine learning applications increased by a staggering factor of 1,000,000. While hardware has improved, it hasn’t kept pace with this exponential growth. This creates a massive bottleneck for inference serving—the process of actually running these models in production to generate predictions for users. ...

[MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages 🔗](https://arxiv.org/abs/2410.01036)

Bridging the Gap: How MOSEL Brings True Open Source to Speech AI

Introduction In the rapidly evolving world of Artificial Intelligence, “Open Source” has become a buzzword. From Large Language Models (LLMs) to Speech Foundation Models (SFMs), developers and researchers are flooded with new releases claiming to be open. But if you scratch beneath the surface, a complex problem emerges: Open Washing. Many models release their weights (the trained parameters) but keep their training data and code proprietary. Even when data is released, it often comes with restrictive licenses—such as “Non-Commercial” or “No-Derivatives”—which strictly violates the definition of Open Source AI. For undergraduate and master’s students entering the field, this distinction is critical. You cannot build a truly open, community-driven, or commercial application if the foundation you are building on is legally shaky. ...

[MORPHEUS: Modeling Role from Personalized Dialogue History by Exploring and Utilizing Latent Space 🔗](https://arxiv.org/abs/2407.02345)

Beyond Hardcoded Personas: How MORPHEUS Generates Personalized Dialogue Using Latent Space

Imagine chatting with a sophisticated AI. You ask, “What did you do this weekend?” and it replies, “I went hiking with my dog.” Ten minutes later, you mention you love nature, and it replies, “I hate the outdoors, I prefer video games.” The illusion breaks. The personality—or persona—is inconsistent. Building chatbots that maintain a consistent, engaging personality is one of the “holy grails” of Natural Language Processing (NLP). traditionally, this is done by explicitly feeding the model a “role profile” (e.g., Name: Alex, Hobby: Hiking, Pet: Dog) every time it generates a response. But in the real world, we often don’t have access to these explicit profiles due to data scarcity or privacy concerns. We only have the dialogue history—the breadcrumbs of personality left behind in previous conversations. ...

[MMOE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts 🔗](https://arxiv.org/abs/2311.09580)

Why Your Multimodal AI Misses the Joke: Introducing Mixtures of Multimodal Interaction Experts (MMOE)

Introduction Imagine scrolling through social media and seeing a photo of a chaotic, messy room with the caption, “Living my best life.” As a human, you immediately recognize the sarcasm. The image (mess) and the text (“best life”) contradict each other, and that contradiction creates the meaning. Now, consider how a standard Multimodal AI sees this. It processes the image, processes the text, and tries to align them. It might see “mess” and “best life” and simply get confused because the semantic content doesn’t overlap. This highlights a critical limitation in current Vision-Language Models (VLMs): they are excellent at identifying when images and text say the same thing (redundancy), but they struggle when the meaning emerges from the interaction between the two (synergy), such as in sarcasm or humor. ...

[MMTE: Corpus and Metrics for Evaluating Machine Translation Quality of Metaphorical Language 🔗](https://arxiv.org/abs/2406.13698)

Lost in Translation: Why AI Struggles with Metaphors and How We Fix It

Introduction: The “Pink Elephant” in the Room Imagine you are trying to tell a friend that you were incredibly drunk last night. If you are speaking English, you might say you were “seeing pink elephants.” Now, imagine feeding that sentence into a translation engine to communicate with a friend in China. If the AI translates it literally, your Chinese friend might be confused about why you were at a zoo. In Chinese culture, a common metaphorical equivalent for being heavily drunk is “collapsed like quagmire” (烂醉如泥). ...

[MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model 🔗](https://arxiv.org/abs/2406.11193)

Inside the Mind of Multimodal Models: Tracking Domain-Specific Neurons with MMNeuron

Introduction How does a large language model (LLM) “see” an image? When we feed a photograph of a chest X-ray or a satellite view of a city into a Multimodal Large Language Model (MLLM) like LLaVA or InstructBLIP, we know the architecture: an image encoder breaks the visual into features, a projector maps them to the language space, and the LLM generates a response. But what happens in the hidden layers between that initial projection and the final answer? ...

[MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance 🔗](https://arxiv.org/abs/2401.02906)

Blinded by the Light: Securing Multimodal AI Against Visual Jailbreaks with MLLM-Protector

Introduction: The New Vulnerability in Multimodal AI The rapid evolution of Artificial Intelligence has taken us from text-based Large Language Models (LLMs) like GPT-3 to Multimodal Large Language Models (MLLMs) like LLaVA and GPT-4V. These newer models possess the remarkable ability to “see”—they can process images alongside text to answer complex queries. This leap forward opens up endless applications, from medical imaging analysis to assisting the visually impaired. However, this added modality introduces a significant, often overlooked security flaw. While the AI community has spent years refining safety alignment for text—ensuring models refuse to generate hate speech or bomb-making instructions—the visual component acts as a backdoor. ...

[MIPD: Exploring Manipulation and Intention In a Novel Corpus of Polish Disinformation 🔗](https://aclanthology.org/2024.emnlp-main.1103.pdf)

Beyond Fake News: Decoding the Intent and Manipulation Behind Disinformation

The term “fake news” has become a staple of modern vocabulary, but it is a clumsy instrument for a surgical problem. Disinformation isn’t just about truth versus fiction; it is about the intent to harm and the methods used to deceive. Whether it involves denying climate change or undermining public health during a pandemic, disinformation is a calculated effort to shift public perception. To combat this effectively, we need to understand not just what is being said, but why and how. This is the core problem addressed in a recent paper titled “MIPD: Exploring Manipulation and Intention In a Novel Corpus of Polish Disinformation.” ...

[MIND: Multimodal Shopping Intention Distillation from Large Vision-Language Models for E-commerce Purchase Understanding 🔗](https://arxiv.org/abs/2406.10701)

Why Did You Buy That? Decoding Shopping Intentions with Multimodal AI

Introduction Imagine walking into a store and buying a wireless mouse. A few minutes later, you pick up a solar-powered keyboard. To a human observer, the connection is obvious: you are likely setting up an eco-friendly, clutter-free home office. However, for traditional Artificial Intelligence systems in e-commerce, this connection is surprisingly difficult to make. Most existing systems rely solely on text—product titles and descriptions. When a text-based model sees “Orbit Trackball Mouse” and “Wireless Solar Keyboard,” it might correctly categorize them as “electronics,” but it often misses the nuanced intention behind the purchase. It fails to “see” that both items are white, ergonomic, and designed for a specific type of user. ...

[MIBench: Evaluating Multimodal Large Language Models over Multiple Images 🔗](https://arxiv.org/abs/2407.15272)

Beyond the Single Frame: Why Multimodal LLMs Struggle with Multi-Image Scenarios

Introduction The rise of Multimodal Large Language Models (MLLMs) like GPT-4V, LLaVA, and mPLUG-Owl has revolutionized how Artificial Intelligence perceives the world. These models can describe photos, answer questions about diagrams, and even write code based on whiteboard sketches. However, there is a significant gap between these benchmark achievements and real-world utility. Most current benchmarks focus on single-image scenarios. The model is given one picture and asked a question. Yet, human visual consumption is rarely isolated to a single frame. When we browse a website, we integrate information from multiple product photos and textual descriptions. When we watch a tutorial, we follow a temporal sequence of steps. When we scroll through social media, we process interleaved text and images simultaneously. ...

[MEANT: Multimodal Encoder for Antecedent Information 🔗](https://arxiv.org/abs/2411.06616)

Reading the Market: How MEANT Combines Images, Tweets, and Time for Stock Prediction

The stock market is a chaotic, noisy environment. To make sense of it, a human trader doesn’t just look at a single number. They look at price charts (visual), read news and social media (textual), and analyze numeric indicators (quantitative). Crucially, they don’t just look at the current moment; they look at the trend over the last few days or weeks. This combination of different data types over time is what researchers call temporal multimodal data. ...

[MAGIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration 🔗](https://arxiv.org/abs/2311.08562)

Can AI Play Nice? Benchmarking the Social Intelligence of Large Language Models

Introduction: The Missing “Social” Piece in AI We have all witnessed the meteoric rise of Large Language Models (LLMs) like GPT-4 and Claude. We know they can write code, compose poetry, and pass the bar exam. They possess incredible reasoning capabilities, memory, and tool usage. But there is a frontier that remains largely unexplored and surprisingly difficult for these digital polymaths: social intelligence. In the real world, intelligence is rarely solitary. We work in teams, we negotiate prices, we play games where we must hide our intentions, and we make decisions based on what we think others are thinking. This is the domain of Multi-Agent Systems (MAS). ...

[MASIVE: Open-Ended Affective State Identification in English and Spanish 🔗](https://arxiv.org/abs/2407.12196)

Beyond Happy and Sad: Teaching AI to Understand Complex Human Feelings

If you were asked to describe how you feel after a long, difficult week that ended with a small victory, you probably wouldn’t just say “happy” or “sad.” You might say you feel relieved, drained, accomplished, or bittersweet. Human emotion is high-dimensional and nuanced. Yet, for years, Natural Language Processing (NLP) has treated emotion analysis as a simple sorting task. Most systems try to force complex human sentences into a tiny set of boxes—usually the “Basic Six” proposed by psychologist Paul Ekman (Anger, Disgust, Fear, Joy, Sadness, Surprise). ...