EMNLP 2024

[Numerologic: Number Encoding for Enhanced LLMs' Numerical Reasoning 🔗](https://arxiv.org/abs/2404.00459)

Why LLMs Can't Count: Fixing Numerical Reasoning with NumeroLogic

It is one of the great ironies of modern Artificial Intelligence: a Large Language Model (LLM) like GPT-4 can write a sonnet in the style of Shakespeare, debug complex Python code, and pass the Bar exam, yet it often stumbles when asked to multiply two three-digit numbers. For students and researchers exploring the architecture of Transformers, this behavior can be baffling. Computers are, at their core, calculators. Why is the most advanced “computer brain” we’ve ever built so bad at basic arithmetic? ...

[Null-Shot Prompting: Rethinking Prompting Large Language Models With Hallucination 🔗](https://aclanthology.org/2024.emnlp-main.740.pdf)

The Pinocchio Strategy: Boosting LLM Performance by Encouraging Hallucination

In the world of Large Language Models (LLMs), “hallucination” is usually a dirty word. It refers to the moment an AI confidently asserts that the moon is made of green cheese or invents a historical event that never happened. Researchers spend millions of dollars and countless hours trying to stop models from hallucinating. But what if hallucination isn’t just a bug? What if it’s a feature that, when manipulated correctly, can actually make a model smarter? ...

[NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data 🔗](https://arxiv.org/abs/2402.15343)

How to Train a Tiny NER Model to Rival LLMs: Inside NuNER

Named Entity Recognition (NER) is one of the bread-and-butter tasks of Natural Language Processing. Whether it is extracting stock tickers from financial news, identifying proteins in biomedical papers, or parsing dates from legal contracts, NER is everywhere. For years, the standard workflow for building a custom NER model has been rigid: take a pre-trained foundation model like BERT or RoBERTa, hire humans to annotate thousands of examples for your specific entities, and fine-tune the model. This process is slow, expensive, and inflexible. ...

[Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment 🔗](https://arxiv.org/abs/2406.12606)

Less is More: Why Pruning Neurons Improves LLM Alignment

Since the transformer architecture burst onto the scene with the famous paper “Attention Is All You Need,” the philosophy in Deep Learning has often leaned towards “more is better.” More data, more layers, more parameters. However, when it comes to alignment—the process of ensuring Large Language Models (LLMs) are helpful, honest, and harmless—it turns out that using everything might actually be the problem. In a fascinating research paper titled “Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment,” researchers from Renmin University of China and Beihang University challenge the status quo. They propose a counter-intuitive idea: by identifying and training only the most relevant neurons (and ignoring the rest), we can align models better, faster, and more effectively than by updating every single parameter. ...

[Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation 🔗](https://arxiv.org/abs/2404.06809)

Trust Issues in AI: How Credibility-Aware Generation Fixes RAG's Biggest Flaw

Introduction Retrieval-Augmented Generation (RAG) has become the de facto standard for building knowledgeable AI systems. By connecting Large Language Models (LLMs) to external databases, we promised to solve the twin problems of hallucinations and knowledge cutoffs. The logic was simple: if the model doesn’t know the answer, let it look it up. But there is a flaw in this logic. RAG systems operate on a dangerous assumption: that everything retrieved is true. ...

[NOISEBENCH: Benchmarking the Impact of Real Label Noise on Named Entity Recognition 🔗](https://arxiv.org/abs/2405.07609)

Why Your Model Believes Lies: The Reality of Label Noise in NER

In the world of supervised machine learning, we often operate under a comfortable assumption: the “Ground Truth” is actually true. We assume our training datasets—painstakingly annotated by humans or scraped from reliable sources—are accurate. But anyone who has looked closely at a large dataset knows this is a myth. Datasets are messy. They contain mistakes, inconsistencies, and what researchers call label noise. In Named Entity Recognition (NER), where models must identify and classify proper names (like organizations, locations, or persons) in text, this noise can be particularly damaging. If a training set mislabels “Apple” as a Location instead of an Organization, the model learns a false pattern. ...

[Noise, Novels, Numbers. A Framework for Detecting and Categorizing Noise in Danish and Norwegian Literature 🔗](https://aclanthology.org/2024.emnlp-main.196.pdf)

Listening to the Past: How AI Reveals the Soundscapes of 19th Century Literature

Introduction When we think about history, we usually visualize it. We picture the sepia-toned photographs of the late 19th century, the industrial smog of growing cities, or the fashion of the Victorian era. But have you ever stopped to wonder what the past sounded like? Before the advent of recording technology, the auditory world was ephemeral. We cannot listen to a street corner in Copenhagen in 1880. However, we have “earwitnesses”—the authors who lived through those times and documented their sensory environments in literature. The novels of the Scandinavian “Modern Breakthrough” (1870–1899) are filled with the clatter of horse-drawn carriages, the hiss of new steam engines, and the murmurs of urban crowds. ...

[No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages 🔗](https://arxiv.org/abs/2411.03769)

Can AI Feel Art? Teaching Vision Models to Understand Culture in 28 Languages

Introduction In the world of Artificial Intelligence, Computer Vision has historically been obsessed with objectivity. Show a model a picture of a park, and it will dutifully report: “A dog running on green grass.” This is impressive, but it misses a fundamental layer of human experience: subjectivity and emotion. When we look at a painting—say, Starry Night—we don’t just see “yellow circles on a blue background.” We feel awe, melancholy, or excitement. ...

[Neuron-Level Knowledge Attribution in Large Language Models 🔗](https://arxiv.org/abs/2312.12141)

Inside the Black Box - Mapping Knowledge Neurons in LLMs

Inside the Black Box: Mapping Knowledge Neurons in LLMs Large Language Models (LLMs) like GPT-4 and Llama have demonstrated a remarkable ability to store and recall factual knowledge. When you ask an LLM, “What is the capital of France?”, it effortlessly retrieves “Paris.” But where exactly does this information live? Is “Paris” stored in a specific cluster of neurons? And if so, how does the model know when to activate them? ...

[Neuron Specialization: Leveraging Intrinsic Task Modularity for Multilingual Machine Translation 🔗](https://arxiv.org/abs/2404.11201)

Neuron Specialization: Unlocking the Intrinsic Modularity of Multilingual Models

The dream of a “universal translator”—a single AI model capable of fluently speaking dozens, if not hundreds, of languages—is one of the Holy Grails of Natural Language Processing (NLP). Companies and researchers are racing to build massive multilingual models that can translate English to French, Chinese to Swahili, and everything in between. But there is a hidden conflict inside these models. When you force one neural network to learn thirty different languages, the languages often fight for “brain space.” This phenomenon is known as negative interference. High-resource languages (like English or German) tend to dominate the model’s parameters, causing performance to drop for low-resource languages. Conversely, optimizing for too many tasks can degrade performance on the main tasks compared to specialized, single-language models. ...

[NeuroTrialNER: An Annotated Corpus for Neurological Diseases and Therapies in Clinical Trial Registries 🔗](https://aclanthology.org/2024.emnlp-main.1050.pdf)

Unlocking the Brain: How AI and a New Dataset Are Decoding Clinical Trials

Introduction Developing new drugs is notoriously difficult, but nowhere is the struggle more apparent than in neurology. The failure rate for Alzheimer’s disease clinical trials, for instance, has historically hovered above 99%. Billions of dollars and decades of research often end without a viable cure. However, even failed trials contain a goldmine of data. Every trial registered represents a hypothesis, a methodology, and a specific intervention tested on a specific population. ...

[Neeko: Leveraging Dynamic LoRA for Efficient Multi-Character Role-Playing Agent 🔗](https://arxiv.org/abs/2402.13717)

Meet Neeko: The Shapeshifting AI That Masters Multi-Character Role-Playing

Introduction Imagine having a conversation with Harry Potter about his first Quidditch match, and then, without switching apps or reloading a model, turning to Lord Voldemort to discuss the Dark Arts. While Large Language Models (LLMs) like ChatGPT have mastered open-domain chat, making them truly “stay in character”—especially multiple different characters—remains a significant hurdle. Current role-playing agents (RPAs) usually face a dilemma. They either rely on prompt engineering (telling the model “Act like X”), which often breaks character over long conversations, or they require training a completely separate model for every single character, which is computationally expensive and inefficient. Furthermore, what happens when you want to add a new character? Usually, you have to retrain the whole system, risking “catastrophic forgetting”—where the model learns the new role but forgets how to play the old ones. ...

[Nash CoT: Multi-Path Inference with Preference Equilibrium 🔗](https://arxiv.org/abs/2407.07099)

Game Theory Meets LLMs: How Nash CoT Optimizes Reasoning

In the rapidly evolving landscape of Large Language Models (LLMs), a recurring challenge persists: how do we make models “think” better without breaking the bank? We know that LLMs are capable of impressive feats, but they often stumble on complex reasoning tasks involving math, logic, or symbolic manipulation. To counter this, researchers developed Chain-of-Thought (CoT) prompting—asking the model to “think step by step.” To make this even more robust, we often use Self-Consistency, where we ask the model the same question multiple times (multi-path inference) and vote for the most common answer. ...

[NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian 🔗](https://arxiv.org/abs/2312.01314)

Can AI Speak 'Norwegian'? Building Generative Models for Low-Resource Languages

If you follow the current trajectory of Artificial Intelligence, you might assume that Large Language Models (LLMs) have solved natural language. Models like GPT-4 can write poetry, code in Python, and summarize legal documents with ease. However, there is a hidden disparity in the AI landscape: the dominance of English. While English-centric models flourish, languages with fewer speakers—and consequently less digitized training data—are often left behind. This category, known as Low-Resource Languages (LRLs), includes Norwegian, which is spoken by only about 5 million people. When we test mainstream models on these languages, we often find that translation is not the same as comprehension. A model might translate words correctly but fail spectacularly at understanding cultural nuance or local context. ...

[Multiples Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing 🔗](https://arxiv.org/abs/2406.11085)

Saving Languages with AI: How LLMs and Translations Boost Low-Resource Glossing

Introduction Imagine being a linguist trying to document a language that only a few dozen people on Earth still speak. The clock is ticking. Estimates suggest that up to 90% of the world’s languages are at risk of disappearing within the next century. Preserving them isn’t just about recording audio; it involves a painstaking process called Interlinear Glossed Text (IGT). This requires transcribing speech, translating it, segmenting words into their smallest meaning-bearing units (morphemes), and tagging each one grammatically. ...

[Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model 🔗](https://arxiv.org/abs/2407.07053)

Why AI Can't Read Clocks: Solving the Abstract Image Gap with Synthetic Data

Introduction We are currently living in the golden age of Large Multimodal Models (LMMs). Models like GPT-4V and Claude-3 have demonstrated astonishing capabilities: they can describe a complex photograph of a busy street, explain a meme, or identify the breed of a dog from a blurry picture. To the casual observer, it seems like the problem of “computer vision” is largely solved. However, a peculiar paradox has emerged. While these models can interpret complex natural scenes, they often stumble over tasks that a human child would find trivial. Ask a state-of-the-art model to read the time from a simple analog clock, navigate a 2D floor plan, or interpret the flow of logic in a basic chart, and you might witness a surprising failure. ...

[Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference 🔗](https://arxiv.org/abs/2410.07673)

Unmasking the Chameleon: How Causal Inference Detects Evolving Clickbait

Introduction: The Evolution of the Trap We have all been there. You are scrolling through your social media feed, and you see an image of a celebrity paired with a shocking headline: “You Won’t Believe What Happened to Emma Watson!” Curiosity gets the better of you. You click. The resulting article, however, has nothing to do with the headline. It is a generic piece of content, perhaps a slide show of unrelated advertisements. You have been “baited.” ...

[Multilingual Topic Classification in X: Dataset and Analysis 🔗](https://arxiv.org/abs/2410.03075)

Breaking Language Barriers: Inside X-Topic, the New Benchmark for Multilingual Social Media Classification

Social media platforms like X (formerly Twitter) are the modern world’s town squares. They are where news breaks, trends are born, and daily lives are documented. However, this town square is global, chaotic, and incredibly noisy. For researchers, data scientists, and companies, making sense of this data—organizing it into coherent topics—is a massive challenge. While we have decent tools for classifying English content, the rest of the world is often left behind. Traditional methods struggle with the linguistic diversity of global platforms, and existing datasets are often limited to specific domains like news or lack the informal nuances of social media text. ...

[Multi-pass Decoding for Grammatical Error Correction 🔗](https://aclanthology.org/2024.emnlp-main.553.pdf)

Iterative Refinement in NLP; How Multi-Pass Decoding and Source Fusion Boost Grammatical Error Correction

Introduction Grammatical Error Correction (GEC) is one of the most practical applications of Natural Language Processing. Whether it’s a student polishing an essay or a professional drafting an email, we rely on these systems to fix syntax, spelling, and fluency errors. For years, the field has been dominated by two main approaches. First, we have Sequence-to-Edit (seq2edit) models, which treat the problem like a tagging task—labelling words to be deleted, kept, or inserted. Second, we have Sequence-to-Sequence (seq2seq) models, which treat error correction like a translation task: “translating” bad grammar into good grammar. ...

[Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models 🔗](https://arxiv.org/abs/2411.00492)

Wisdom of the Artificial Crowd: How Multi-Expert Prompting Fixes LLM Hallucinations

Introduction We often treat Large Language Models (LLMs) like omniscient oracles. We type a question into ChatGPT or Claude, and we expect a single, authoritative, and correct answer. But underneath the hood, these models are probabilistic engines. When you ask an open-ended question—like “Is it ethical to eat meat?” or “How should we solve climate change?"—the model often defaults to the most likely continuation based on its training data. This can lead to generic, one-sided, or even biased answers. ...