EMNLP 2024

[DEM: Distribution Edited Model for Training with Mixed Data Distributions 🔗](https://arxiv.org/abs/2406.15570)

Stop Blending Data: Why Editing Model Weights is the Future of Multi-Task LLMs

If you have ever tried to train a Large Language Model (LLM) to be a “jack of all trades,” you know the struggle. You want a model that can solve math problems, write Python code, chat casually, and reason through logic puzzles. The standard approach is Data Mixing. You take all your datasets—math, code, chat—throw them into a giant blender, and train the model on this mixed soup. The problem? It is incredibly expensive and notoriously difficult to tune. If you get the ratio of math-to-chat wrong, the model becomes great at algebra but forgets how to speak English. If you want to add a new skill later, you often have to re-blend and re-train from scratch. ...

[DEFT-UCS: Data Efficient Fine-Tuning for Pre-Trained Language Models via Unsupervised Core-Set Selection for Text-Editing 🔗](https://aclanthology.org/2024.emnlp-main.1132.pdf)

Less is More: How DEFT-UCS Fine-Tunes LLMs with 70% Less Data

Introduction In the current landscape of Artificial Intelligence, the mantra has often been “bigger is better.” We build larger models and feed them massive datasets. For example, fine-tuning a model like Alpaca requires 52k instruction samples; mathematical reasoning models like MetaMath utilize nearly 400k samples. While this brute-force approach works, it creates a significant bottleneck: Data Scarcity. For real-world applications—think specialized medical text editing, legal document refinement, or niche technical writing—acquiring tens of thousands of high-quality, annotated examples is often impossible or prohibitively expensive. This leads to a critical question: Do we actually need all that data? Or are we just feeding models redundant information that they don’t really need to learn? ...

[Decor: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection, Reasoning, and Rewriting 🔗](https://arxiv.org/abs/2406.19650)

Beyond Grammar: Teaching AI to Fix Coherence in Student Writing with DECOR

Introduction Imagine reading an essay where every individual sentence is grammatically perfect, yet the paragraph feels confusing. The ideas jump around, the pronouns don’t seem to refer to anyone specific, and the arguments don’t flow logically. This is a failure of coherence. For Second Language (L2) English learners, mastering grammar is a significant milestone, but achieving coherence is often the harder, final boss. While tools like Grammarly have revolutionized how we fix surface-level errors (spelling, syntax, punctuation), they often fall short when it comes to discourse-level issues. They can tell you how to spell a word, but they rarely tell you if that word makes sense in the context of the previous three sentences. ...

[DC-Instruct: An Effective Framework for Generative Multi-intent Spoken Language Understanding 🔗](https://aclanthology.org/2024.emnlp-main.804.pdf)

How DC-Instruct Teaches LLMs to Reason in Multi-Intent Spoken Language Understanding

In the fast-paced world of conversational AI, the ability of a system to understand human speech is paramount. When we speak to digital assistants like Siri, Alexa, or customer service bots, we rarely stick to simple, single commands. We combine requests, add constraints, and switch contexts mid-sentence. For a machine, distinguishing “Book a flight to New York” from “Book a flight to New York and find me a hotel near the airport” involves complex reasoning. ...

[DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination 🔗](https://arxiv.org/abs/2410.04514)

Fixing Vision-Language Hallucinations: A Deep Dive into Attention Mechanisms (DAMRO)

In the rapidly evolving world of Artificial Intelligence, Large Vision-Language Models (LVLMs) like LLaVA and InstructBLIP have become the superstars. They can look at an image, understand it, and describe it in fluent natural language. Ask them to describe a kitchen, and they will tell you about the fridge, the stove, and the fruit bowl. But there is a catch. Sometimes, they tell you about a toaster that isn’t there. ...

[DA³: A Distribution-Aware Adversarial Attack against Language Models 🔗](https://aclanthology.org/2024.emnlp-main.107.pdf)

The Stealthy Attacker: How DA³ Generates Hard-to-Detect Adversarial Examples

Language Models (LMs) have become ubiquitous, powering everything from customer service chatbots to code generation tools. However, despite their impressive capabilities, they possess a significant fragility: adversarial attacks. By making subtle changes to an input sentence—changes often imperceptible to humans—an attacker can trick a model into making a completely wrong prediction. While researchers have developed highly successful attack methods, defenders have caught up. They have realized that while adversarial examples might fool the model’s prediction logic, they often look statistically “weird.” They break the data distribution patterns that the model is used to seeing. This allows defenders to build simple detectors that flag these inputs before they can do damage. ...

[DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models 🔗](https://arxiv.org/abs/2410.07331)

Can AI Be Your Data Scientist? Inside the DA-Code Benchmark

The role of a data scientist is often dubbed the “sexiest job of the 21st century.” It requires a unique blend of skills: statistical knowledge, coding proficiency (usually Python or SQL), business acumen, and the ability to wrangle messy, unstructured data into actionable insights. With the meteoric rise of Large Language Models (LLMs) like GPT-4 and Claude, a burning question has emerged in the tech community: Can we automate data science? ...

[D3CODE: Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation 🔗](https://arxiv.org/abs/2404.10857)

Beyond the Majority Vote: How Culture and Morality Shape What We Find Offensive

Imagine you are scrolling through a social media feed and you encounter a comment about a sensitive political topic. You might shrug it off as a harmless opinion. Your friend, however, might find it deeply offensive. Now, imagine a third person reading that same comment from a cafe in Cairo, a subway in Tokyo, or a living room in São Paulo. Would they agree on whether that sentence is “toxic”? ...

[Curriculum Consistency Learning for Conditional Sentence Generation 🔗](https://aclanthology.org/2024.emnlp-main.768.pdf)

Mastering the Hard Stuff: How Curriculum Consistency Learning Optimizes AI Training

Introduction In the world of human education, we don’t teach calculus to kindergarteners. We follow a curriculum: a structured path that starts with simple concepts and gradually introduces complexity as the student’s proficiency grows. This approach ensures that the learner builds a solid foundation before tackling difficult problems. In the realm of Artificial Intelligence, specifically Conditional Sentence Generation (CSG)—which covers tasks like machine translation, image captioning, and Large Language Model (LLM) instruction tuning—training often lacks this nuance. Models are frequently exposed to a barrage of data without regard for the difficulty of specific examples or the model’s current capability. ...

[Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting 🔗](https://arxiv.org/abs/2406.11661)

Is Your LLM Culturally Biased or Just Confused? The Placebo Effect in AI Prompting

Is Your LLM Culturally Biased or Just Confused? The Placebo Effect in AI Prompting Large Language Models (LLMs) are the engines of the modern internet, but they have a well-documented problem: they tend to view the world through a Western, Anglo-centric lens. If you ask an LLM to judge a social situation or write a story, it usually defaults to American or European norms. To fix this, researchers and engineers have turned to socio-demographic prompting. The idea is simple: if you want the model to think like a person from India, you preface your prompt with “You are a person from India.” If you want it to adopt Japanese etiquette, you might mention “Sushi” or “Hiroshi” in the context. This technique is used both to study bias (probing) and to force the model to behave differently (alignment). ...

[CryptoTrade: A Reflective LLM-based Agent to Guide Zero-shot Cryptocurrency Trading 🔗](https://aclanthology.org/2024.emnlp-main.63.pdf)

Can LLMs Beat the Crypto Market? Inside the CryptoTrade Agent

The world of cryptocurrency is often described as the “Wild West” of finance. It is characterized by extreme volatility, a 24/7 news cycle, and a unique layer of transparency known as “on-chain data.” For researchers and traders alike, the Holy Grail has always been predicting these market movements. In recent years, Large Language Models (LLMs) like GPT-4 have revolutionized how we process information. We’ve seen them write code, pass bar exams, and analyze stock market sentiment. However, applying LLMs to cryptocurrency trading presents a specific set of challenges. Unlike the stock market, where quarterly reports and standard news cycles drive prices, crypto is driven by a chaotic mix of technical indicators, social media hype, and blockchain network activity. ...

[Cross-lingual Transfer for Automatic Question Generation by Learning Interrogative Structures in Target Languages 🔗](https://arxiv.org/abs/2410.03197)

Breaking the Language Barrier: How QuIST Teaches AI to Ask Questions in Any Language

Introduction In the rapidly evolving world of Natural Language Processing (NLP), we often take for granted how much data is available—for English. If you want to train a chatbot to answer questions about history, science, or pop culture in English, you have access to massive datasets like SQuAD or HotpotQA. But what happens if you want to build that same system for Swahili, Finnish, or Bengali? The data simply isn’t there in the same volume. ...

[Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing 🔗](https://arxiv.org/abs/2410.00513)

Breaking the Language Barrier in Semantic Parsing with Cross-lingual Back-Parsing

Imagine you have built a sophisticated AI assistant capable of querying a complex database. When you ask it, “Show me the nearest hotel to Melania,” it converts your English request into a precise database query (like SQL) and retrieves the answer. This technology is called Semantic Parsing (SP). Now, imagine you want to deploy this same system in Korea, Turkey, or Finland. You immediately face a bottleneck: the lack of labeled training data. Collecting thousands of pairs of “Natural Language Question” and “Database Query” for every new language is incredibly expensive and time-consuming. ...

[Cross-domain NER with Generated Task-Oriented Knowledge: An Empirical Study from Information Density Perspective 🔗](https://aclanthology.org/2024.emnlp-main.95.pdf)

Teaching Models to Reason: How LLM-Generated Knowledge Solves Cross-Domain NER

Teaching Models to Reason: How LLM-Generated Knowledge Solves Cross-Domain NER Imagine you have trained a brilliant assistant to read the New York Times and highlight the names of politicians and companies. They get really good at it. Then, you hand them a technical paper on quantum physics or a fan forum about K-Pop and ask them to do the same thing. Suddenly, they struggle. “Is ‘superposition’ a location? Is ‘BTS’ an organization or a movement?” ...

[Cross-Domain Audio Deepfake Detection: Dataset and Analysis 🔗](https://arxiv.org/abs/2404.04904)

Can We Trust Our Ears? Fighting the New Wave of Zero-Shot Audio Deepfakes

Introduction Imagine receiving a voice message from a family member asking for help, or hearing a politician declare war on a social media clip. The voice sounds unmistakably authentic—the cadence, the timbre, the breath. But it’s all fake. We are living in the age of Zero-Shot Text-to-Speech (TTS). Unlike older technologies that required hours of recorded speech to clone a voice, modern models like VALL-E or OpenVoice can clone a specific person’s voice with a single utterance—sometimes as short as three seconds. While this technology has incredible creative potential, it poses severe risks to privacy, security, and social trust. ...

[Crafting Personalized Agents through Retrieval-Augmented Generation on Editability Memory Graphs 🔗](https://arxiv.org/abs/2409.19401)

Building a Brain for AI Assistants: Inside EMG-RAG

Imagine an AI assistant that actually knows you. Not just one that knows your name, but one that remembers your boss is flying to Amsterdam next week, recalls that you prefer aisle seats, and automatically updates your calendar when the flight time changes. Current Large Language Models (LLMs) are incredible generalists, but they often struggle to be good “personal assistants.” They suffer from what we might call the “Goldfish Memory” problem—context is limited, and specific personal details are often lost in the noise or hallucinated. ...

[CoverICL: Selective Annotation for In-Context Learning via Active Graph Coverage 🔗](https://aclanthology.org/2024.emnlp-main.1185.pdf)

How to Select the Perfect Few-Shot Examples: A Deep Dive into CoverICL

Large Language Models (LLMs) have fundamentally changed how we approach Natural Language Processing (NLP). One of their most powerful features is In-Context Learning (ICL). Instead of fine-tuning the model’s billions of parameters—which is expensive and computationally heavy—you simply provide a few examples of the task within the prompt itself. For instance, to teach a model to classify sentiment, you might provide three examples of movie reviews with their labels before asking it to classify a new review. ...

[CorrSynth - A Correlated Sampling Method for Diverse Dataset Generation from LLMs 🔗](https://arxiv.org/abs/2411.08553)

Solving the Diversity Crisis in Synthetic Data: A Deep Dive into CorrSynth

The era of Large Language Models (LLMs) has revolutionized how we approach machine learning. We have moved from a scarcity mindset—where labeled data was expensive and rare—to an abundance mindset, where models like GPT-4 or Mixtal can generate infinite amounts of text. This has given rise to Knowledge Distillation: using a massive “Teacher” LLM to generate synthetic datasets, which are then used to train smaller, efficient “Student” models (like BERT or DistilBERT) for specific tasks. ...

[COPYBENCH: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation 🔗](https://arxiv.org/abs/2407.07087)

Beyond Verbatim: Uncovering How LLMs Copy Plots and Characters

Introduction In the rapidly evolving landscape of Generative AI, a major legal and ethical storm has been brewing around copyright. We know that Large Language Models (LLMs) are trained on massive datasets that include copyrighted books, articles, and creative writing. A central question for researchers, lawyers, and content creators is: To what extent do these models reproduce protected content? Until recently, the community has largely focused on literal copying—instances where an AI spits out a passage of text word-for-word, identical to the source material. It is relatively easy to check if a model generates the exact opening paragraph of Harry Potter. However, this narrow focus misses a crucial nuance of copyright law and creative expression. Infringement isn’t just about the exact sequence of words; it is also about the “pattern of the work”—the unique arrangement of plots, events, and characters. ...

[Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment 🔗](https://arxiv.org/abs/2402.19085)

Taming the Alignment Tax: How Controllable Preference Optimization Balances Helpfulness, Honesty, and Harmlessness

Taming the Alignment Tax: How Controllable Preference Optimization Balances Helpfulness, Honesty, and Harmlessness If you have used Large Language Models (LLMs) extensively, you have likely encountered the “refusal” phenomenon. You ask a model for help with a complex topic, perhaps something strictly factual but slightly sensitive, and it politely declines or gives a watered-down, overly cautious answer. This is often the result of safety alignment. To make AI safe for public use, we align models with human values, typically summarized as the “3H” principles: Helpfulness, Honesty, and Harmlessness. Ideally, we want a model that is perfect at all three. In reality, these goals often conflict. A model that is perfectly harmless might refuse to answer valid questions (reducing helpfulness). A model that is perfectly helpful might answer dangerous questions (reducing harmlessness). ...