EMNLP 2024

[METAREFLECTION: Learning Instructions for Language Agents using Past Reflections 🔗](https://arxiv.org/abs/2405.13009)

How Agents Learn from Mistakes: Introducing METAREFLECTION

Imagine you are studying for a difficult history exam. You take a practice quiz and get a question wrong about the French Revolution. You don’t just look up the correct answer for that specific question; you realize you have a fundamental misunderstanding of the timeline. You write a note to yourself: “Always check the dates of events before determining causality.” This process—generalizing a specific failure into a rule for future success—is a hallmark of human learning. It is often referred to as accumulating semantic memory. ...

[MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic 🔗](https://arxiv.org/abs/2406.11385)

Solving the LLM Merging Trilemma: A Deep Dive into MetaGPT

In the rapidly evolving landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and LLaMA-2 have become the backbone of modern NLP. The standard workflow is familiar: take a massive pre-trained base model, then fine-tune it on a specific task—be it coding, mathematics, or creative writing. This yields high performance, but it creates a logistical nightmare. For every new capability, you need to deploy a separate, heavy model. Ideally, we want Multi-Task Learning (MTL): a single model that is proficient in everything. However, training a billion-parameter model on all tasks simultaneously is computationally prohibitive and often impossible due to data privacy constraints (many dataset owners won’t share their raw data). ...

[Message Passing on Semantic-Anchor-Graphs for Fine-grained Emotion Representation Learning and Classification 🔗](https://aclanthology.org/2024.emnlp-main.162.pdf)

Anchoring Emotions: How SEAN-GNN Captures Subtle Feelings in Text

Humans are emotionally complex creatures. We don’t just feel “happy” or “sad.” We feel ecstatic, content, devastated, terrified, or apprehensive. In the field of Natural Language Processing (NLP), distinguishing between these subtle nuances is known as Fine-grained Emotion Classification (FEC). While standard sentiment analysis might be satisfied with labeling a sentence as “negative,” FEC aims to determine if that negativity stems from anger, fear, or sadness. This is incredibly difficult for machines because the difference often lies in the specific choice of vocabulary and the precise arrangement of words. ...

[Mentor-KD: Making Small Language Models Better Multi-step Reasoners 🔗](https://arxiv.org/abs/2410.09037)

Bridging the Gap: How a 'Mentor' Model Teaches Small AIs to Reason Like Giants

In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 or Claude are the undisputed heavyweights. They possess an “emergent” ability known as Chain-of-Thought (CoT) reasoning—the capability to break down complex problems into step-by-step logical progressions to arrive at a correct answer. However, there is a catch. These reasoning abilities typically only emerge in models with hundreds of billions of parameters. Running these models requires massive computational resources or expensive API calls, making them impractical for deployment on local devices or in low-resource environments. ...

[Memory-Efficient Fine-Tuning of Transformers via Token Selection 🔗](https://arxiv.org/abs/2501.18824)

TokenTune: Squeezing Large Language Models onto Smaller GPUs by Ignoring Tokens

The explosion of Large Language Models (LLMs) has democratized access to powerful AI, but customizing these models remains a hardware nightmare. While using a pre-trained model like Llama-2 or GPT-3 is relatively cheap, fine-tuning it—specializing it for medical data, code generation, or a specific writing style—requires massive computational resources. For instance, fine-tuning a 65-billion parameter model can require upwards of 780 GB of GPU memory. This effectively gates the ability to customize state-of-the-art models behind an enterprise-level paywall. ...

[Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk 🔗](https://aclanthology.org/2024.emnlp-main.1169.pdf)

Memorize Step by Step: A Smarter Way to Handle Long Contexts in LLMs

The era of the “million-token context window” is here. With models like Claude 3 and Gemini 1.5, we are moving away from short prompts into the territory of processing entire books, codebases, and legal archives in a single go. But there is a catch: hardware hasn’t scaled as fast as our ambitions. Processing a sequence of 1 million tokens requires massive computational resources and GPU memory. If you try to feed a text that long into a standard GPU, you will almost certainly hit an “Out of Memory” (OOM) error before the model even generates the first word. ...

[MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification 🔗](https://arxiv.org/abs/2409.14703)

Decoding Memes: How MemeCLIP and PrideMM Are Changing Multimodal Content Moderation

Introduction In the digital age, memes are more than just funny pictures; they are a sophisticated language of their own. They can distill complex political opinions, social commentary, and cultural inside jokes into a single, shareable unit. However, this power has a dark side. Memes have become a potent vehicle for hate speech, cyberbullying, and disinformation, often hiding behind layers of irony and sarcasm that traditional content moderation systems struggle to parse. ...

[Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? 🔗](https://arxiv.org/abs/2411.04118)

The Medical AI Mirage: Why Specialized Models Might Not Be Better Than General Ones

The intersection of Artificial Intelligence and medicine is currently one of the most exciting frontiers in technology. Every few months, we see a new headline announcing a “Medical LLM”—a specialized artificial intelligence tailored specifically for healthcare. The narrative is almost always the same: take a powerful general-purpose model (like Llama or Mistral), train it further on a massive library of medical textbooks and PubMed articles, and voilà: you have a digital doctor that outperforms its generalist predecessor. ...

[Media Attitude Detection via Framing Analysis with Events and their Relations 🔗](https://aclanthology.org/2024.emnlp-main.954.pdf)

Beyond Word Choice—Detecting Media Bias Through Event Framing and Causal Narratives

In March 2024, Vladimir Putin won the Russian presidential election. If you read about this event in a state-backed Russian outlet, you likely encountered a narrative of “legitimacy,” “national unity,” and a “landslide victory.” If you read a Western outlet, the story was likely framed around “electoral fraud,” the “suppression of opponents,” and the ongoing war in Ukraine. The core facts—that an election happened and Putin won—are the same. The difference lies in the media attitude, or how the outlet feels about the event. ...

[MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations 🔗](https://arxiv.org/abs/2410.14204)

Why Medical AI Needs Better Data: Introducing MediTOD and the CMAS Ontology

Introduction Imagine a future where doctor burnout is significantly reduced, and patients have instant access to high-quality medical triage. This is the promise of Medical Task-Oriented Dialogue (TOD) systems. These AI agents aim to assist doctors by collecting patient medical history, aiding in diagnosis, and guiding treatment selection. However, if you have ever tried to build a medical chatbot, you likely ran into a massive wall: data. Specifically, the lack of high-quality, privacy-compliant, English-language datasets. While general-purpose chatbots have flourished thanks to massive internet scrapes, medical AI is starved for data due to strict privacy regulations (like HIPAA). Furthermore, the few datasets that do exist often oversimplify the complexity of medicine. They might capture that a patient has a fever, but fail to capture when it started, how it progressed, or what makes it better. ...

[MEDREADME: A Systematic Study for Fine-grained Sentence Readability in Medical Domain 🔗](https://arxiv.org/abs/2405.02144)

Decoding Medical Jargon: How We Measure and Improve Readability in Healthcare

Introduction “If you can’t measure it, you can’t improve it.” This famous quote by Peter Drucker rings especially true in the world of medical communication. We live in an era where reliable medical knowledge is crucial for public health. From Wikipedia articles to the Merck Manuals, and from cutting-edge research papers to patient pamphlets, the dissemination of health information is constant. However, access to information does not equate to understanding. Medical texts are notoriously difficult to digest. They are dense, technical, and filled with specialized terminology that can alienate the very people they are meant to help—patients and non-experts. ...

[MedCoT: Medical Chain of Thought via Hierarchical Expert 🔗](https://arxiv.org/abs/2412.13736)

Why Two Doctors Are Better Than One: Decoding MedCoT for Medical AI

Imagine visiting a doctor with a complex X-ray. You ask, “Is there a tumor?” and the doctor simply says “Yes” and walks out of the room. No explanation, no pointing to the shadow on the film, and no discussion of why they reached that conclusion. You would likely feel terrified and skeptical. Unfortunately, this is how many Artificial Intelligence systems in Medical Visual Question Answering (Med-VQA) currently operate. They ingest an image and a question, and they output a flat answer. While accuracy is important, in a clinical setting, the reasoning path—the “why”—is just as critical as the final “what.” Furthermore, relying on a single AI model is like relying on a single doctor who might be tired or biased; it lacks the robustness required for life-critical diagnostics. ...

[MedAdapter: Efficient Test-Time Adaptation of Large Language Models Towards Medical Reasoning 🔗](https://arxiv.org/abs/2405.03000)

Bridging the Gap: How MedAdapter Optimizes LLMs for Medicine Without Breaking the Bank

The integration of Large Language Models (LLMs) into the biomedical domain holds immense promise, from assisting in complex diagnoses to automating clinical note-taking. However, a significant barrier stands in the way of widespread adoption: the “resource-privacy-performance” trilemma. On one hand, we have massive Black-Box LLMs (like GPT-4) that offer state-of-the-art reasoning but come with high costs and severe privacy risks when patient data is involved. On the other hand, we have White-Box LLMs (like LLaMA-2) that can be run locally and privately, but often struggle to match the reasoning capabilities of their larger counterparts, even after expensive fine-tuning. ...

[Measuring Psychological Depth in Language Models 🔗](https://aclanthology.org/2024.emnlp-main.953.pdf)

Can AI Make You Cry? Measuring the Psychological Depth of LLM-Generated Stories

Introduction We have reached a point in the evolution of Artificial Intelligence where machines can generate text that is grammatically perfect, stylistically consistent, and undeniably coherent. If you ask GPT-4 to write a sonnet about a toaster, it will do so with impressive rhyme and meter. But there is a frontier that has remained elusive, a quality that separates a technical manual from a heartbreaking novel: Psychological Depth. Evaluations of Large Language Models (LLMs) have traditionally focused on objective metrics. We measure perplexity, distinct-n-grams, and discourse coherence. We check for toxicity and bias. While indispensable, these metrics treat text as data. They do not account for the reader. They cannot tell us if a story evokes empathy, if it makes your heart race, or if the characters feel like genuine human beings rather than cardboard cutouts. ...

[Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions 🔗](https://arxiv.org/abs/2407.20243)

Shrinking Giants: How Matryoshka-Adaptor Makes LLM Embeddings Smaller, Faster, and Cheaper

In the world of Large Language Models (LLMs) and Information Retrieval (IR), bigger has almost always meant better. High-dimensional embeddings—those long vectors of numbers representing the semantic meaning of text—capture subtle nuances that smaller vectors miss. A 3,072-dimensional vector from OpenAI usually understands your query better than a 256-dimensional one. But “bigger” comes with a steep price tag. Storing millions of high-dimensional vectors requires massive amounts of memory. Searching through them (vector search) creates high latency and computational costs. Engineers are often stuck in a dilemma: choose high accuracy and burn through the budget, or choose efficiency and accept worse search results. ...

[Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models 🔗](https://arxiv.org/abs/2406.12572)

Are You Smarter Than a 3rd Grader? Why LLMs Fail the Mathador Challenge

Introduction In the rapidly evolving landscape of Artificial Intelligence, we have become accustomed to headlines declaring that Large Language Models (LLMs) have conquered yet another human milestone. We see models acing the Bar Exam, performing at graduate levels in physics, and solving complex code challenges. If you look at popular leaderboards, it seems we are approaching a saturation point where AI capabilities match or exceed specialized human performance. But there is a catch. ...

[MatchTime: Towards Automatic Soccer Game Commentary Generation 🔗](https://arxiv.org/abs/2406.18530)

Why AI Commentators Struggle (and How 'MatchTime' Fixes It)

Imagine watching a soccer match where the commentator screams “Goal!” two minutes after the ball has already hit the net. It would be disorienting, annoying, and largely useless. Yet, this is the precise problem plaguing Artificial Intelligence when we try to teach it to understand sports. For years, researchers have been trying to build systems that can automatically narrate sports videos. The potential is immense: from automated highlights to assisting visually impaired fans. However, current models often fail to sound professional or accurate. ...

[Making Large Language Models Better Reasoners with Orchestrated Streaming Experiences 🔗](https://arxiv.org/abs/2504.00473)

RoSE: How LLMs Can Self-Improve Through Orchestrated Streaming Experiences

Introduction Imagine a student preparing for a difficult mathematics exam. They don’t just memorize formulas; they work through practice problems. When they solve a problem correctly, they remember the logic they used. Later, when they face a similar but new problem, they recall that successful logic to guide them. This process—accumulating experiences, filtering out the mistakes, and recalling the most relevant and complex solutions—is fundamental to human learning. However, standard Large Language Models (LLMs) often lack this dynamic “experiential” capability in their standard deployment. They are typically static. You prompt them, they answer, and the interaction ends. If they solve a problem brilliantly, that “thought process” usually evaporates once the session closes. ...

[Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training 🔗](https://arxiv.org/abs/2406.17404)

Speeding Up LLMs for Free: The "Make Some Noise" Training Framework

The capabilities of Large Language Models (LLMs) like GPT-4 and LLaMA have revolutionized artificial intelligence. However, if you have ever watched an LLM generate a response, you have likely noticed a fundamental bottleneck: the text appears one word at a time, like a slow typist. This sluggishness is due to the Auto-Regressive (AR) decoding paradigm. To generate the 100th token, the model strictly needs the previous 99 tokens. This sequential dependency prevents parallel processing during generation, leaving powerful GPUs idling while they wait for the next token to be decided. ...

[Major Entity Identification: A Generalizable Alternative to Coreference Resolution 🔗](https://aclanthology.org/2024.emnlp-main.652.pdf)

Why We Should Stop Clustering and Start Identifying: A New Approach to Coreference

Imagine you are analyzing the novel Aladdin. You want to track every time the text refers to the protagonist, whether by his name (“Aladdin”), a nominal phrase (“the boy”), or a pronoun (“he”). In Natural Language Processing (NLP), this is classically handled by Coreference Resolution (CR). The goal of CR is to find every mention in a text and cluster them together based on which entity they refer to. It sounds straightforward, but in practice, CR is notoriously difficult. Models trained on news articles often fail spectacularly when applied to literature or medical texts. Why? because they get bogged down trying to resolve everything, including insignificant background characters or abstract concepts, often disagreeing on what even counts as a “mention.” ...