EMNLP 2024

[ABLE: Personalized Disability Support with Politeness and Empathy Integration 🔗](https://aclanthology.org/2024.emnlp-main.1252.pdf)

Beyond Generic Chatbots: How ABLE Uses RL to Bring Empathy and Personalization to Disability Support

Introduction Imagine navigating a world not designed for you. For the over one billion people globally living with some form of physical disability, this is a daily reality. Whether it’s finding wheelchair-accessible housing, managing chronic pain, or dealing with the social isolation that often accompanies physical limitations, the need for reliable support is massive. In the age of AI, conversational agents (chatbots) offer a promising solution. They are available 24/7 and can provide immediate information. However, there is a glaring problem with most current systems: they are robotic. If a user expresses frustration about losing mobility, a standard chatbot might output a sterile list of medical definitions. It lacks the “human” touch—the ability to understand the user’s specific personality, age, and gender, and to respond with genuine empathy and politeness. ...

[A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks 🔗](https://arxiv.org/abs/2501.17569)

Beyond the Leaderboard: Why Large Language Models Still Fail at Reading Comprehension

In the fast-paced world of Natural Language Processing (NLP), we are often dazzled by the sheer scale of modern models. From GPT-4 to LLaMA, the headlines focus on parameter counts—billions upon trillions—and their dominance on standardized leaderboards. But there is a quiet, persistent problem in the field: the “black box” nature of evaluation. We know that models fail. We see them hallucinate, miss obvious details, or misinterpret simple questions. However, looking at a global accuracy score on a benchmark like SQuAD or SuperGLUE doesn’t tell us why they fail. Is it the syntax? Is it the vocabulary? Is it the ambiguity of the meaning? ...

[A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models 🔗](https://arxiv.org/abs/2404.13940)

Beyond Standard Tests: Evaluating LLMs Based on What Users Actually Want

Introduction Imagine a student who aces every written exam in history, mathematics, and computer science but struggles to hold a conversation, offer advice to a friend, or brainstorm a creative gift idea. In the world of Artificial Intelligence, this is a common paradox. We have Large Language Models (LLMs) that score near-perfect marks on standardized tests like the Bar Exam or math Olympiad questions, yet they sometimes fail to satisfy simple, messy, real-world user requests. ...

[A Usage-centric Take on Intent Understanding in E-Commerce 🔗](https://arxiv.org/abs/2402.14901)

Beyond "Customers Who Bought This": Unlocking True User Intent in E-Commerce

Have you ever searched for a “camping stove” online, added it to your cart, and then been flooded with recommendations for… more camping stoves? While modern e-commerce recommendation systems are incredibly powerful, they often suffer from a fundamental misunderstanding of why a user is shopping. They excel at identifying product similarity (“You liked this stove, here is another stove”) or co-buying patterns (“People who bought this stove also bought this fuel canister”). However, they struggle to capture the broader User Intent. ...

[A Two-Step Approach for Data-Efficient French Pronunciation Learning 🔗](https://arxiv.org/abs/2410.05698)

Decoding the French Flow - A Data-Efficient Approach to Pronunciation Learning

Decoding the French Flow: A Data-Efficient Approach to Pronunciation Learning If you have ever tried to learn French, you have likely encountered a specific frustration. You learn a word, you memorize how to pronounce it, and then you hear a native speaker say it in a sentence, and it sounds completely different. This is not just a problem for language learners; it is a massive headache for Text-to-Speech (TTS) systems. While modern TTS systems are incredibly advanced, achieving natural-sounding speech in French requires mastering complex phonological rules where words blend into one another. ...

[A Thorough Examination of Decoding Methods in the Era of LLMs 🔗](https://arxiv.org/abs/2402.06925)

Cracking the Code - A Deep Dive into LLM Decoding Methods

When we interact with Large Language Models (LLMs) like ChatGPT or Llama, we often view them as magic boxes: we input a prompt, and a coherent response appears. However, under the hood, these models are fundamentally next-token predictors. They output a probability distribution over a vocabulary of thousands of tokens. The process of converting these probabilities into the fluent text we read is called decoding. For a long time, the community operated on “folk wisdom” derived from older, task-specific models (like those used for translation in 2018). But do these rules apply to the massive, general-purpose LLMs of today? A recent paper, “A Thorough Examination of Decoding Methods in the Era of LLMs,” provides a comprehensive answer. ...

[A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations 🔗](https://arxiv.org/abs/2407.04069)

The Wild West of LLM Evaluation: Why Your Benchmarks Might Be Wrong (and How to Fix Them)

Introduction We are living in the golden age of Large Language Models (LLMs). Every week, a new model drops—claiming to be faster, smarter, and more capable than its predecessors. We see charts showing massive bars, higher numbers, and claims of “state-of-the-art” performance on benchmarks like MMLU or HumanEval. But here is the uncomfortable question: Can we actually trust these numbers? Evaluating an LLM is not like compiling code; it is not a binary pass/fail. It is a complex, nuanced process fraught with invisible pitfalls. If you change the way you phrase a question slightly, the model’s score might plummet. If the model was accidentally trained on the test questions (data contamination), it’s cheating without knowing it. If the code used to grade the model’s answers is buggy, the leaderboard ranking is meaningless. ...

[A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences 🔗](https://arxiv.org/abs/2406.11341)

Can AI Actually Reason? Inside the Logic of Large Language Models

Introduction Since the arrival of Transformers and Large Language Models (LLMs) like GPT-4 and LLaMA, a singular question has dominated the field of Natural Language Processing (NLP): Do these models actually reason, or are they just sophisticated parrots? We know LLMs are incredibly proficient at language. They can write poetry, summarize emails, and even generate code. But proficiency in language does not automatically equate to proficiency in logic. When a human solves a math problem or a logic puzzle, they are (usually) applying strict deductive rules. When an LLM does it, are they doing the same? Or are they acting as “soft reasoners”—emulating the appearance of reasoning based on statistical patterns and surface-level semantics? ...

[A Survey on In-context Learning 🔗](https://arxiv.org/abs/2301.00234)

Mastering In-Context Learning: How LLMs Learn from Analogy

Introduction In recent years, the field of Natural Language Processing (NLP) has undergone a paradigm shift. We have moved from training specific models for specific tasks to utilizing massive, general-purpose Large Language Models (LLMs) like GPT-4 and Llama. What makes these models truly revolutionary is not just their size, but their ability to learn a new task simply by looking at a few examples, without any update to their internal parameters. This phenomenon is known as In-Context Learning (ICL). ...

[A Survey of Ontology Expansion for Conversational Understanding 🔗](https://arxiv.org/abs/2410.15019)

Beyond Scripted Chatbots: How AI is Learning to Expand Its Own Universe

If you have ever interacted with a customer service chatbot, you have likely hit a wall. You ask a question, perhaps phrased slightly differently than the bot expects, or about a topic that feels relevant but is technically “new,” and you get the dreaded response: “I’m sorry, I didn’t understand that.” This limitation stems from a fundamental design choice in traditional conversational AI: the static ontology. Most systems are built on a pre-defined list of things they can understand. If a user’s request falls outside that list, the system fails. But the real world is dynamic. New products launch, new terminologies emerge (think “social distancing” or specific vaccine brand names in 2020), and user needs evolve rapidly. ...

[A Survey of AMR Applications 🔗](https://aclanthology.org/2024.emnlp-main.390.pdf)

Beyond Black Boxes: How Abstract Meaning Representation is Reshaping NLP

In the current era of Natural Language Processing (NLP), Large Language Models (LLMs) often feel like magic. You feed in a sentence, and out comes a translation, a summary, or a poem. However, despite their prowess, these neural models remain “black boxes.” They rely on statistical probabilities rather than explicit understanding, which can lead to hallucinations, logical inconsistencies, or a lack of interpretability. This brings us to a pivotal question: How do we inject structure, logic, and explicit meaning into these systems? ...

[A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf Affect-related Tweet Classifiers 🔗](https://arxiv.org/abs/2407.01834)

Does Your Name Make AI Angry? Understanding Nationality Bias and Perplexity in NLP Models

In the modern digital landscape, Artificial Intelligence is no longer just a futuristic concept; it is an active gatekeeper. Algorithms decide which comments on social media are flagged as “hate speech,” which customer service tickets are prioritized based on “sentiment,” and sometimes even which resumes get passed to a human recruiter. But what if these gatekeepers are xenophobic? What if a simple change of a name—from “John” to “Ahmed” or “Santiago”—drastically alters how an AI interprets the exact same sentence? ...

[A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models 🔗](https://arxiv.org/abs/2410.04027)

Back to Basics: Correcting Chinese Spelling Errors Without Training or Prompts

Have you ever typed a message in a hurry, hit send, and then realized your phone’s autocorrect turned a heartfelt compliment into complete nonsense? In English, spelling errors are usually about letter arrangements. In Chinese, however, the problem is far more complex due to the nature of the language. Because Chinese input methods rely heavily on Pinyin (phonetic typing), a simple slip of the finger or a similar-sounding syllable can result in a completely different character with a radically different meaning. ...

[A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression 🔗](https://aclanthology.org/2024.emnlp-main.1027.pdf)

Less is More: Compressing LLM Memory with Simple L2 Norms

The race for larger context windows in Large Language Models (LLMs) is one of the most exciting developments in AI. We have moved rapidly from models that could barely remember a few paragraphs to systems like GPT-4 and Gemini 1.5 that can process entire novels, codebases, or legal contracts in a single prompt. However, this capability comes with a massive computational cost. The bottleneck is often memory. Specifically, the Key-Value (KV) Cache. ...

[A Simple LLM Framework for Long-Range Video Question-Answering 🔗](https://arxiv.org/abs/2312.17235)

LLoVi: Solving Long-Range Video Understanding by Reading the Movie

Introduction Imagine you are watching a three-minute video of someone assembling a complex piece of furniture. At the end, I ask you: “What was the very first tool the person picked up?” To answer this, you have to remember the beginning of the video, understand the sequence of actions, and identify the object. For humans, this is trivial. For AI, it is notoriously difficult. While computer vision has mastered short clips (5–10 seconds), understanding “long-range” videos (minutes or hours) remains a significant hurdle. Traditional methods try to cram massive amounts of visual data into memory banks or complex spatiotemporal graphs, often hitting computational bottlenecks. ...

[A SMART Mnemonic Sounds like 'Glue Tonic': Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick 🔗](https://arxiv.org/abs/2406.15352)

Making Learning Stick: How SMART Aligning LLMs with Student Feedback Creates Better Mnemonics

Vocabulary acquisition is often the bane of a student’s existence. Whether preparing for the GRE, learning a new language, or mastering medical terminology, the sheer volume of new terms can be overwhelming. Cognitive science has long offered a solution: keyword mnemonics. These are memorable verbal links that connect a new, complex term to a simpler, familiar keyword, followed by an explanation that bridges the two. For example, to learn the word Benevolent (meaning kind), you might link it to Benefit. The explanation? “A boss who gives their employees benefits is kind—or benevolent.” ...

[A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners 🔗](https://arxiv.org/abs/2406.11050)

The Illusion of Logic: Why LLMs Fail When We Change the Names

Large Language Models (LLMs) like GPT-4 and Claude 3 have dazzled the world with their ability to write code, compose poetry, and solve complex problems. When we see an LLM answer a classic riddle or a logic puzzle correctly, it is tempting to attribute human-like reasoning capabilities to the machine. We assume the model “understands” the logic. But what if that understanding is brittle? What if the model isn’t solving the logic puzzle, but rather recognizing the specific words—the tokens—used in the puzzle? ...

[A New Pipeline for Knowledge Graph Reasoning Enhanced by Large Language Models Without Fine-Tuning 🔗](https://aclanthology.org/2024.emnlp-main.81.pdf)

Bridging the Gap: How to Enhance Knowledge Graph Reasoning with LLMs Without Fine-Tuning

Introduction In the evolving landscape of Artificial Intelligence, we often find ourselves managing two distinct types of “brains.” on one side, we have Knowledge Graphs (KGs). These are structured, logical databases that map the world into entities (nodes) and relations (edges). They are precise and factual but often brittle; if a connection is missing, the system fails to see the relationship. On the other side, we have Large Language Models (LLMs) like GPT-4 or Llama 3. These possess vast, general-world knowledge and can generate human-like text, but they can be prone to “hallucinations” and are computationally expensive to update or fine-tune. ...

[A Multi-Perspective Analysis of Memorization in Large Language Models 🔗](https://arxiv.org/abs/2405.11577)

Inside the Black Box: How and Why LLMs Memorize Training Data

Large Language Models (LLMs) like GPT-4 or LLaMA are often described as having “emergent abilities”—capabilities that appear as the models scale in size. One of the most fascinating, and controversial, of these behaviors is memorization. Memorization occurs when an LLM generates content verbatim from its training data. On one hand, this allows models to act as vast knowledge bases, recalling historical facts or coding syntax. On the other hand, it poses significant privacy and copyright risks. If a model was trained on sensitive personal data or copyrighted books, elicitng that exact text is a major vulnerability. ...

[A Morphology-Based Investigation of Positional Encodings 🔗](https://arxiv.org/abs/2404.04530)

Do All Languages Need Word Order? Why Transformers Might Be Over-Engineering Grammar

If you read the sentence “The dog bit the man,” you know exactly who is in trouble. If you swap the words to “The man bit the dog,” the meaning flips entirely. This is because English relies heavily on word order to convey meaning. To understand the sentence, you need to know not just what words are present, but where they are sitting. But what if you spoke a language where “The dog bit the man” and “Man dog bit the” meant the exact same thing? ...