EMNLP 2024

[MEANT: Multimodal Encoder for Antecedent Information 🔗](https://arxiv.org/abs/2411.06616)

Reading the Market: How MEANT Combines Images, Tweets, and Time for Stock Prediction

The stock market is a chaotic, noisy environment. To make sense of it, a human trader doesn’t just look at a single number. They look at price charts (visual), read news and social media (textual), and analyze numeric indicators (quantitative). Crucially, they don’t just look at the current moment; they look at the trend over the last few days or weeks. This combination of different data types over time is what researchers call temporal multimodal data. ...

[MAGIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration 🔗](https://arxiv.org/abs/2311.08562)

Can AI Play Nice? Benchmarking the Social Intelligence of Large Language Models

Introduction: The Missing “Social” Piece in AI We have all witnessed the meteoric rise of Large Language Models (LLMs) like GPT-4 and Claude. We know they can write code, compose poetry, and pass the bar exam. They possess incredible reasoning capabilities, memory, and tool usage. But there is a frontier that remains largely unexplored and surprisingly difficult for these digital polymaths: social intelligence. In the real world, intelligence is rarely solitary. We work in teams, we negotiate prices, we play games where we must hide our intentions, and we make decisions based on what we think others are thinking. This is the domain of Multi-Agent Systems (MAS). ...

[MASIVE: Open-Ended Affective State Identification in English and Spanish 🔗](https://arxiv.org/abs/2407.12196)

Beyond Happy and Sad: Teaching AI to Understand Complex Human Feelings

If you were asked to describe how you feel after a long, difficult week that ended with a small victory, you probably wouldn’t just say “happy” or “sad.” You might say you feel relieved, drained, accomplished, or bittersweet. Human emotion is high-dimensional and nuanced. Yet, for years, Natural Language Processing (NLP) has treated emotion analysis as a simple sorting task. Most systems try to force complex human sentences into a tiny set of boxes—usually the “Basic Six” proposed by psychologist Paul Ekman (Anger, Disgust, Fear, Joy, Sadness, Surprise). ...

[MARE: Multi-Aspect Rationale Extractor on Unsupervised Rationale Extraction 🔗](https://arxiv.org/abs/2410.03531)

Opening the Black Box: How MARE Extracts Multi-Aspect Rationales from Text

Opening the Black Box: How MARE Extracts Multi-Aspect Rationales from Text Deep learning models, particularly those based on Transformers like BERT, have revolutionized text classification. They can read a movie review and tell you with high accuracy whether it’s positive or negative. But there is a persistent problem: these models are “black boxes.” They give us a prediction, but they rarely tell us why they made it. In high-stakes domains like healthcare, law, or finance, “because the model said so” isn’t good enough. We need rationales—specific snippets of text that justify the model’s decision. ...

[MAR: Matching-Augmented Reasoning for Enhancing Visual-based Entity Question Answering 🔗](https://aclanthology.org/2024.emnlp-main.91.pdf)

Who Is That? Solving the Identity Crisis in Multimodal LLMs with Matching-Augmented Reasoning

Who Is That? Solving the Identity Crisis in Multimodal LLMs with Matching-Augmented Reasoning Multimodal Large Language Models (MLLMs) like GPT-4V and LLaVA have revolutionized how computers interact with the world. You can upload a photo of a complex scene, and these models can describe the lighting, read the text on a sign, or tell you what kind of dog is in the picture. They feel almost magical. However, the magic often fades when you ask a very specific, human-centric question: “Who is the person in this red box?” ...

[M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought 🔗](https://arxiv.org/abs/2410.09220)

Decoding Hate: How AI Uses Chain-of-Thought Reasoning to Spot Misogynous Memes

Social media is a double-edged sword. While it connects us, it also serves as a breeding ground for hate speech. Among the most insidious forms of online hate are misogynous memes. Unlike plain text insults, memes rely on a complex interplay between image and text, often employing dark humor, sarcasm, or obscure cultural references to mask their harmful intent. Detecting these memes is a massive challenge for Artificial Intelligence. A standard AI might see a picture of a kitchen and the text “Make me a sandwich” and classify it as harmless banter about food. A human, however, immediately recognizes the sexist trope. ...

[M3D: MultiModal MultiDocument Fine-Grained Inconsistency Detection 🔗](https://aclanthology.org/2024.emnlp-main.1243.pdf)

Beyond True or False: Detecting Fine-Grained Inconsistencies Across Multimodal Documents

In the age of information overload, verifying a single claim often feels like detective work. You read a headline, check a news article, watch a video clip, and perhaps look at a photo caption. Rarely does a single document hold all the answers. Yet, most automated fact-checking systems today operate in a bubble: they look at a single piece of text and output a binary “True” or “False.” But misinformation is rarely black and white. A claim might be mostly true but contain a fabricated statistic. Or, it might describe a real event but attribute it to the wrong location shown in a video. To build trust, AI needs to do more than just flag a post; it needs to explain exactly which part of the claim contradicts the evidence. ...

[M2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning 🔗](https://aclanthology.org/2024.emnlp-main.218.pdf)

Tuning 0.09% of the Parameters: How Multimodal Prompt Tuning (M2PT) Revolutionizes Zero-Shot Learning

The dream of Artificial General Intelligence (AGI) relies heavily on the ability of machines to process information the way humans do: multimodally. When you look at a picture of a crowded street and answer the question, “Is it safe to cross?”, you are seamlessly blending visual perception with linguistic reasoning. Multimodal Large Language Models (MLLMs) like LLaVA and Flamingo have made massive strides in mimicking this capability. However, there is a catch. As these models grow into the billions of parameters, adapting them to new tasks becomes prohibitively expensive. Traditional “full fine-tuning”—where we update every single weight in the model—is computationally heavy and storage-intensive. ...

[Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps 🔗](https://arxiv.org/abs/2407.07071)

The Lookback Lens: Detecting Hallucinations by Watching Where LLMs Look

Introduction In the current landscape of Large Language Models (LLMs), we often rely on a technique called Retrieval-Augmented Generation (RAG). The premise is simple: LLMs can’t know everything, especially private data or recent news, so we provide them with relevant documents (the context) and ask them to answer questions or summarize that information. We assume that if we give the model the correct facts, it will use them. Unfortunately, that is not always true. ...

[LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering 🔗](https://arxiv.org/abs/2410.18050)

Taming the Long Context: How LongRAG Solves the 'Lost in the Middle' Problem

In the rapidly evolving world of Large Language Models (LLMs), we have seen a massive push toward “Long-Context” capabilities. Models like Gemini 1.5 or GPT-4-Turbo boast the ability to process hundreds of thousands of tokens—entire novels or codebases—in a single prompt. Theoretically, this should solve the problem of answering complex questions based on large documents. However, reality tells a different story. When presented with massive amounts of data, LLMs often suffer from the “lost in the middle” phenomenon: they are great at remembering the beginning and end of a document but tend to forget or hallucinate details buried in the middle. ...

[LONGEMBED: Extending Embedding Models for Long Context Retrieval 🔗](https://arxiv.org/abs/2404.12096)

Breaking the 512-Token Barrier: How to Extend Embedding Models for Long Context Retrieval

In the rapidly evolving world of Natural Language Processing (NLP), text embedding models are the unsung heroes. They transform text into vector representations—lists of numbers that capture semantic meaning—serving as the engine for information retrieval (IR) and Retrieval-Augmented Generation (RAG). However, there is a persistent bottleneck in this engine: the context window. Most popular embedding models, such as BERT-based architectures, are restricted to a short context window, typically 512 tokens. In real-world applications—like searching through legal contracts, summarizing long meeting transcripts, or indexing entire Wikipedia articles—512 tokens is simply not enough. When the input exceeds this limit, the text is usually truncated, leading to a massive loss of information. ...

[LogicST: A Logical Self-Training Framework for Document-Level Relation Extraction with Incomplete Annotations 🔗](https://aclanthology.org/2024.emnlp-main.314.pdf)

LogicST: How Logical Rules Can Fix Neural Networks in Relation Extraction

Introduction In the age of Big Data, we have more text than any human could ever read. To make sense of this information, we rely on Relation Extraction (RE)—the process of teaching machines to identify relationships between entities in text. For example, reading “Paris is in France” and extracting the triplet (Paris, Located In, France). This technology is the backbone of Knowledge Graphs, Question Answering systems, and semantic search. However, as we move from simple sentences to complex documents, the challenge grows exponentially. Document-level Relation Extraction (DocRE) requires understanding connections across long paragraphs, involving multiple entities. ...

[LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models 🔗](https://arxiv.org/abs/2401.00757)

Can AI Really Reason? Inside LogicAsker, the Framework Testing LLMs on Formal Logic

Large Language Models (LLMs) like GPT-4 and Llama 3 have become ubiquitous in our lives. They write poetry, generate code, summarize complex emails, and even crack jokes. When you interact with a chatbot that seems so articulate, it is natural to assume that there is a robust reasoning engine beneath the hood—a digital brain capable of connecting facts to draw logical conclusions. But here lies a significant challenge in the field of Artificial Intelligence: Is the model actually reasoning, or is it just really good at pattern matching? ...

[Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia 🔗](https://arxiv.org/abs/2410.04282)

Lost in Translation: How AI Detects Information Gaps Across Wikipedia Languages

Wikipedia is often viewed as a singular, universal repository of human knowledge. We tend to assume that switching the language setting from English to French or Russian simply translates the text. However, the reality is far more complex. Wikipedia is a federation of distinct communities, each with its own editors, cultural norms, and biases. This leads to distinct narratives where facts present in one language are completely omitted in another. ...

[Local Contrastive Editing of Gender Stereotypes 🔗](https://arxiv.org/abs/2410.17739)

Performing Brain Surgery on BERT: How to Locate and Edit Gender Bias in Language Models

Large Language Models (LMs) are mirrors of the data they are trained on. Unfortunately, this means they often reflect the societal biases, including gender stereotypes, found in the vast text of the internet. While we have many tools to measure this bias—checking if a model associates “doctor” more with men than women—we still have a limited understanding of where this bias physically lives inside the model. Which specific numbers (weights) among the millions or billions of parameters are responsible for thinking that “nurse” implies “female”? ...

[LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models 🔗](https://arxiv.org/abs/2407.02987)

LoRA-Guard: Achieving On-Device AI Safety with Parameter-Efficient Adaptation

Introduction The rapid evolution of Large Language Models (LLMs) has brought us capable conversational assistants, coding partners, and creative writers. However, this capability comes with a significant caveat: without careful alignment, these models can generate toxic, offensive, or illegal content. While “safety tuning” (like Reinforcement Learning from Human Feedback) helps, it isn’t a silver bullet. Jailbreaks—cleverly crafted prompts designed to bypass safety filters—remain a persistent threat. To combat this, the industry has turned to guardrails: separate, dedicated models that monitor the conversation and flag harmful content. The problem? Running a massive LLM is already computationally expensive. Running a second massive model just to police the first one is often impossible, especially on resource-constrained devices like mobile phones or laptops. ...

[LitSearch: A Retrieval Benchmark for Scientific Literature Search 🔗](https://arxiv.org/abs/2407.18940)

Cracking the Code of Scientific Search: Inside the LitSearch Benchmark

Introduction: The Needle in the Academic Haystack If you are a student or a researcher, you know the struggle. You have a specific concept in mind—perhaps a vague memory of a paper that “uses structured pruning to scale down language models”—but you don’t remember the title, the authors, or the year. You turn to Google Scholar or a similar academic search engine, type in your query, and … nothing. Or rather, pages and pages of tangentially related results that rely on keyword matching but fail to capture the concept you are looking for. ...

[Link, Synthesize, Retrieve: Universal Document Linking for Zero-Shot Information Retrieval 🔗](https://arxiv.org/abs/2410.18385)

Connecting the Dots: How Universal Document Linking Solves Zero-Shot Retrieval

Introduction Imagine building a search engine for a brand-new medical database or a collection of legal precedents in a foreign language. You have millions of documents, but you have a major problem: zero users. Without a history of user queries (the things people type into the search bar), how do you teach your search algorithm what “relevance” looks like? This is the challenge of Zero-Shot Information Retrieval (IR). Modern search engines rely heavily on “Dense Retrieval” models—neural networks that understand semantic meaning. However, these models need massive amounts of training data (pairs of questions and answers) to work well. When you drop them into a new domain without fine-tuning, their performance usually collapses. ...

[Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination 🔗](https://arxiv.org/abs/2406.08818)

The Standard American Default: How ChatGPT Fails Speakers of Global English Dialects

Large Language Models (LLMs) like ChatGPT are often presented as universal tools—omniscient assistants capable of conversing on any topic, in any language. However, when we peel back the layers of this “universality,” we often find a very specific worldview encoded in the system. For millions of English speakers around the world, ChatGPT does not act as a neutral mirror of their language; instead, it acts as a corrective lens, filtering out their cultural identity or, worse, reflecting a caricature back at them. ...

[Linear Layer Extrapolation for Fine-Grained Emotion Classification 🔗](https://aclanthology.org/2024.emnlp-main.1161.pdf)

Beyond the Final Layer—Extrapolating Emotion in LLMs

Introduction Imagine you are texting a friend. They reply: “You can’t change who people are, but you can love them #sadly.” How do you classify the emotion here? A standard sentiment analysis tool might see the word “love” and tag it as Joy, or see the hashtag and tag it as Sadness. But a human reader detects something more nuanced: a sense of resignation, a difficult acceptance of reality. The correct label is likely Pessimism. ...