Papers

[Neeko: Leveraging Dynamic LoRA for Efficient Multi-Character Role-Playing Agent 🔗](https://arxiv.org/abs/2402.13717)

Meet Neeko: The Shapeshifting AI That Masters Multi-Character Role-Playing

Introduction Imagine having a conversation with Harry Potter about his first Quidditch match, and then, without switching apps or reloading a model, turning to Lord Voldemort to discuss the Dark Arts. While Large Language Models (LLMs) like ChatGPT have mastered open-domain chat, making them truly “stay in character”—especially multiple different characters—remains a significant hurdle. Current role-playing agents (RPAs) usually face a dilemma. They either rely on prompt engineering (telling the model “Act like X”), which often breaks character over long conversations, or they require training a completely separate model for every single character, which is computationally expensive and inefficient. Furthermore, what happens when you want to add a new character? Usually, you have to retrain the whole system, risking “catastrophic forgetting”—where the model learns the new role but forgets how to play the old ones. ...

[Nash CoT: Multi-Path Inference with Preference Equilibrium 🔗](https://arxiv.org/abs/2407.07099)

Game Theory Meets LLMs: How Nash CoT Optimizes Reasoning

In the rapidly evolving landscape of Large Language Models (LLMs), a recurring challenge persists: how do we make models “think” better without breaking the bank? We know that LLMs are capable of impressive feats, but they often stumble on complex reasoning tasks involving math, logic, or symbolic manipulation. To counter this, researchers developed Chain-of-Thought (CoT) prompting—asking the model to “think step by step.” To make this even more robust, we often use Self-Consistency, where we ask the model the same question multiple times (multi-path inference) and vote for the most common answer. ...

[NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian 🔗](https://arxiv.org/abs/2312.01314)

Can AI Speak 'Norwegian'? Building Generative Models for Low-Resource Languages

If you follow the current trajectory of Artificial Intelligence, you might assume that Large Language Models (LLMs) have solved natural language. Models like GPT-4 can write poetry, code in Python, and summarize legal documents with ease. However, there is a hidden disparity in the AI landscape: the dominance of English. While English-centric models flourish, languages with fewer speakers—and consequently less digitized training data—are often left behind. This category, known as Low-Resource Languages (LRLs), includes Norwegian, which is spoken by only about 5 million people. When we test mainstream models on these languages, we often find that translation is not the same as comprehension. A model might translate words correctly but fail spectacularly at understanding cultural nuance or local context. ...

[Multiples Sources are Better Than One: Incorporating External Knowledge in Low-Resource Glossing 🔗](https://arxiv.org/abs/2406.11085)

Saving Languages with AI: How LLMs and Translations Boost Low-Resource Glossing

Introduction Imagine being a linguist trying to document a language that only a few dozen people on Earth still speak. The clock is ticking. Estimates suggest that up to 90% of the world’s languages are at risk of disappearing within the next century. Preserving them isn’t just about recording audio; it involves a painstaking process called Interlinear Glossed Text (IGT). This requires transcribing speech, translating it, segmenting words into their smallest meaning-bearing units (morphemes), and tagging each one grammatically. ...

[Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model 🔗](https://arxiv.org/abs/2407.07053)

Why AI Can't Read Clocks: Solving the Abstract Image Gap with Synthetic Data

Introduction We are currently living in the golden age of Large Multimodal Models (LMMs). Models like GPT-4V and Claude-3 have demonstrated astonishing capabilities: they can describe a complex photograph of a busy street, explain a meme, or identify the breed of a dog from a blurry picture. To the casual observer, it seems like the problem of “computer vision” is largely solved. However, a peculiar paradox has emerged. While these models can interpret complex natural scenes, they often stumble over tasks that a human child would find trivial. Ask a state-of-the-art model to read the time from a simple analog clock, navigate a 2D floor plan, or interpret the flow of logic in a basic chart, and you might witness a surprising failure. ...

[Multimodal Clickbait Detection by De-confounding Biases Using Causal Representation Inference 🔗](https://arxiv.org/abs/2410.07673)

Unmasking the Chameleon: How Causal Inference Detects Evolving Clickbait

Introduction: The Evolution of the Trap We have all been there. You are scrolling through your social media feed, and you see an image of a celebrity paired with a shocking headline: “You Won’t Believe What Happened to Emma Watson!” Curiosity gets the better of you. You click. The resulting article, however, has nothing to do with the headline. It is a generic piece of content, perhaps a slide show of unrelated advertisements. You have been “baited.” ...

[Multilingual Topic Classification in X: Dataset and Analysis 🔗](https://arxiv.org/abs/2410.03075)

Breaking Language Barriers: Inside X-Topic, the New Benchmark for Multilingual Social Media Classification

Social media platforms like X (formerly Twitter) are the modern world’s town squares. They are where news breaks, trends are born, and daily lives are documented. However, this town square is global, chaotic, and incredibly noisy. For researchers, data scientists, and companies, making sense of this data—organizing it into coherent topics—is a massive challenge. While we have decent tools for classifying English content, the rest of the world is often left behind. Traditional methods struggle with the linguistic diversity of global platforms, and existing datasets are often limited to specific domains like news or lack the informal nuances of social media text. ...

[Multi-pass Decoding for Grammatical Error Correction 🔗](https://aclanthology.org/2024.emnlp-main.553.pdf)

Iterative Refinement in NLP; How Multi-Pass Decoding and Source Fusion Boost Grammatical Error Correction

Introduction Grammatical Error Correction (GEC) is one of the most practical applications of Natural Language Processing. Whether it’s a student polishing an essay or a professional drafting an email, we rely on these systems to fix syntax, spelling, and fluency errors. For years, the field has been dominated by two main approaches. First, we have Sequence-to-Edit (seq2edit) models, which treat the problem like a tagging task—labelling words to be deleted, kept, or inserted. Second, we have Sequence-to-Sequence (seq2seq) models, which treat error correction like a translation task: “translating” bad grammar into good grammar. ...

[Multi-expert Prompting Improves Reliability, Safety and Usefulness of Large Language Models 🔗](https://arxiv.org/abs/2411.00492)

Wisdom of the Artificial Crowd: How Multi-Expert Prompting Fixes LLM Hallucinations

Introduction We often treat Large Language Models (LLMs) like omniscient oracles. We type a question into ChatGPT or Claude, and we expect a single, authoritative, and correct answer. But underneath the hood, these models are probabilistic engines. When you ask an open-ended question—like “Is it ethical to eat meat?” or “How should we solve climate change?"—the model often defaults to the most likely continuation based on its training data. This can lead to generic, one-sided, or even biased answers. ...

[MULTI-NEWS+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation 🔗](https://arxiv.org/abs/2404.09682)

Cleaning Up the Mess: How LLMs Can Fix Noisy Datasets Automatically

Introduction: The “Garbage In, Garbage Out” Dilemma In the world of Machine Learning, there is an old adage that every student learns in their first semester: “Garbage In, Garbage Out.” No matter how sophisticated your neural network architecture is—whether it’s a state-of-the-art Transformer or a massive Large Language Model (LLM)—it cannot learn effectively if the data it is fed is flawed. For years, the gold standard for solving this problem was human annotation. If a dataset was messy, you hired humans to read it, label it, and clean it. But as datasets have exploded in size, reaching millions of examples, relying on human labor has become prohibitively expensive and slow. This leaves researchers in a bind: do we accept noisy data and lower performance, or do we burn through budgets cleaning it? ...

[Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models 🔗](https://arxiv.org/abs/2406.17169)

Can AI Think Deeply? Unpacking Multi-LogiEval and the Limits of LLM Logical Reasoning

Introduction: The Illusion of Intelligence Large Language Models (LLMs) like GPT-4 and Gemini have captivated the world with their ability to write code, compose poetry, and pass standardized tests. When you chat with these models, their fluency can easily be mistaken for deep understanding. They seem to reason, argue, and deduce. But are they actually performing logical reasoning, or are they simply excellent pattern matchers mimicking the structure of an argument? ...

[Multi-Level Cross-Modal Alignment for Speech Relation Extraction 🔗](https://aclanthology.org/2024.emnlp-main.668.pdf)

Bridging the Gap between Speech and Knowledge: A Multi-Level Alignment Approach

In the world of Natural Language Processing (NLP), extracting structured knowledge—like relations between entities—from unstructured text is a well-established field. We have sophisticated models that can read a sentence like “Steve Jobs co-founded Apple” and extract the triplet (Steve Jobs, Founder, Apple). But what about speech? A massive amount of human knowledge is exchanged via podcasts, meetings, phone calls, and news broadcasts. Historically, extracting relations from speech (SpeechRE) has been treated as a two-step pipeline: transcribe the audio to text using Automatic Speech Recognition (ASR), and then run a text-based relation extraction model. While functional, this approach is prone to “error propagation”—if the ASR mishears a name, the relation extraction fails immediately. ...

[Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges 🔗](https://arxiv.org/abs/2410.03458)

Decoding the Sound of Vietnam: A Deep Dive into the ViMD Dataset and Multi-Dialect AI

Language is rarely a monolith. If you have ever tried to build a speech recognition system, you know that a “standard” language model often falls apart when faced with the rich tapestry of real-world accents and dialects. This is particularly true for Vietnamese, a tonal language where meaning shifts with the pitch of your voice, and where regional variations can be drastic. For years, research into Vietnamese Speech Recognition (SR) and Dialect Identification (DI) has operated under a simplified assumption: that the country is divided into three broad dialect regions—Northern, Central, and Southern. While linguistically useful, this generalization glosses over the nuanced reality that each of Vietnam’s 63 provinces has its own unique “provincial dialect.” ...

[MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning 🔗](https://arxiv.org/abs/2405.07551)

Bridging the Gap: How MuMath-Code Teaches LLMs to Think and Code Simultaneously

If you have ever asked a standard Large Language Model (LLM) to solve a complex math problem, you might have noticed a frustrating pattern. The model often writes a beautiful, confident explanation, but then stumbles on the actual arithmetic, delivering a wrong answer with absolute conviction. Conversely, models designed to write code can calculate perfectly but often struggle to understand the nuances of a word problem. In the race to achieve GPT-4 level performance with open-source models, researchers have generally split into two camps: those who teach models to “think” better through reasoning (Chain-of-Thought), and those who teach models to “use tools” (like a Python interpreter). ...

[More Than Catastrophic Forgetting: Integrating General Capabilities For Domain-Specific LLMs 🔗](https://arxiv.org/abs/2405.17830)

Beyond Forgetting: How ALoRA Teaches LLMs to Think Like Experts Without Losing Their Common Sense

Large Language Models (LLMs) are the polymaths of the AI world. They can write code, solve math problems, summarize history, and chat about ethics—all in the same session. But when we need an LLM to be an expert—say, a legal consultant or a medical diagnostic tool—we have to send it back to school. This process is called Supervised Fine-Tuning (SFT). Here lies a classic problem in machine learning: when you teach a model too much about a specific domain, it often overwrites what it learned before. It might become a brilliant lawyer but suddenly forget how to do basic arithmetic or lose its ability to chat naturally. This phenomenon is famously known as Catastrophic Forgetting (CF). ...

[More Insightful Feedback for Tutoring: Enhancing Generation Mechanisms and Automatic Evaluation 🔗](https://aclanthology.org/2024.emnlp-main.605.pdf)

Beyond "Try Again": Teaching AI to Give Better Feedback

Introduction Imagine you are learning a new language or studying for a history exam using an online platform. You encounter a question about a text you just read: “Why did the protagonist stay home?” You confidently answer: “Because he was sick.” The system responds: “Incorrect. Try again.” This is the verification stage. It tells you that you are wrong, but it doesn’t tell you why, nor does it help you find the right answer. Now, imagine a better system. One that responds: “Actually, the text mentions he was feeling fine, but look closely at what happened to his car.” ...

[More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages 🔗](https://aclanthology.org/2024.emnlp-main.796.pdf)

Building Better Word Graphs: How More Data and Denser Connections Reveal True Meaning

Language is a moving target. Words like “plane” or “mouse” mean something very different today than they did two hundred years ago. To teach computers how to understand these shifts—a field known as Lexical Semantic Change Detection (LSCD)—researchers need high-quality data. They need a way to map how a word is used in thousands of different contexts. Enter the Word Usage Graph (WUG). This innovative approach moves away from rigid dictionary definitions and instead relies on how words relate to one another in actual sentences. ...

[Moral Foundations of Large Language Models 🔗](https://arxiv.org/abs/2310.15337)

Do Androids Dream of Moral Values? Analyzing the Hidden Ethics of LLMs

Introduction In the last few years, Large Language Models (LLMs) like GPT-3 and PaLM have moved from research labs to the center of our digital lives. We use them to write emails, debug code, and even seek life advice. But as we integrate these systems into society, a critical question arises: Do these models have a moral compass? We know that LLMs are trained on massive datasets scraped from the internet—a corpus of text that contains the best and worst of humanity. It is well-documented that these models can absorb toxic biases related to gender, race, and religion. But what about deeper, more abstract psychological frameworks? Do LLMs exhibit a consistent set of values? And if so, can those values be manipulated? ...

[MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction 🔗](https://arxiv.org/abs/2408.01426)

Beyond SMILES: How MolTRES Revolutionizes Molecular Property Prediction

The pharmaceutical and materials science industries are currently undergoing a massive shift from traditional “wet lab” experiments to computational “dry lab” predictions. Deep Neural Networks (DNNs) are at the forefront of this revolution, promising to reduce the cost and time required to discover new drugs. A popular approach in this field is Chemical Language Representation Learning. Just as Large Language Models (LLMs) like GPT learn to understand English by reading billions of sentences, chemical models learn to understand molecules by reading billions of SMILES (Simplified Molecular-Input Line Entry System) strings. SMILES represents a 3D molecule as a 1D string of text (e.g., CCO for ethanol). ...

[Modular Pluralism: Pluralistic Alignment via Multi-LLM Collaboration 🔗](https://arxiv.org/abs/2406.15951)

Beyond the Average: How Modular Pluralism Teaches LLMs to Represent Diverse Human Values

Beyond the Average: How Modular Pluralism Teaches LLMs to Represent Diverse Human Values In the rapid evolution of Large Language Models (LLMs), “alignment” has become a buzzword. We want our AI assistants to be helpful, harmless, and honest. Typically, this is achieved through techniques like Reinforcement Learning from Human Feedback (RLHF), where models are trained to prefer responses that humans rate highly. But here is the catch: Who are these humans? ...