Papers

[CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models 🔗](https://arxiv.org/abs/2409.19984)

Probability or Guesswork? Investigating Consistency in Large Language Models

Large Language Models (LLMs) have become the engines driving modern AI, from chatbots to code generators. In many of these applications, we don’t just care about the text the model generates; we care about the score—the probability the model assigns to a specific sequence of words. These scores are used to detect hallucinations, rank potential answers, and measure the model’s confidence. But here is the uncomfortable question: Can we actually trust these numbers as mathematical probabilities? ...

[CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models 🔗](https://arxiv.org/abs/2407.17467)

Balancing Act: How to Teach LLMs New Tricks Without Forgetting Old Ones

Introduction Large Language Models (LLMs) like Llama or GPT-4 are the polymaths of the digital age. They can write poetry, debug code, and summarize history with impressive fluency. However, their broad knowledge often comes at the expense of depth. When faced with highly specialized tasks—such as interpreting complex financial regulations or analyzing dense academic papers—these generalist models often struggle. They simply haven’t seen enough of that specific data during their initial training. ...

[CMD: a framework for Context-aware Model self-Detoxification 🔗](https://arxiv.org/abs/2308.08295)

Can LLMs Fix Themselves? Inside the Context-aware Model self-Detoxification (CMD) Framework

Large Language Models (LLMs) like GPT-4 and Llama 2 have revolutionized how we interact with technology. They can write poetry, debug code, and summarize history. However, they possess a significant flaw: “garbage in, garbage out.” Because these models are trained on the vast, unfiltered internet, they can inadvertently learn and regurgitate toxic content. When a user provides a toxic prompt (the context), LLMs naturally try to complete the pattern. If you start a sentence with a slur or an aggressive statement, the model’s probability distribution pushes it to continue in that same toxic vein. This poses a massive safety risk for real-world applications. ...

[CItruS Chunked Instruction-aware State Eviction 🔗](https://arxiv.org/abs/2406.12018)

Solving Information Neglect in Long-Context LLMs with CItruS

Solving Information Neglect in Long-Context LLMs with CItruS Large Language Models (LLMs) like Llama 2 and Mistral have revolutionized how we interact with text. However, they possess a significant limitation: the context window. While models are getting better at handling longer sequences, processing an entire book or a massive legal document remains computationally expensive and memory-intensive. To manage this, researchers have developed “State Eviction” methods. These techniques compress the model’s memory (the Key-Value cache) by discarding “unimportant” information as the text gets longer. It sounds like a perfect solution: keep the memory footprint low while processing infinite text. ...

[CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search 🔗](https://arxiv.org/abs/2406.05013)

Can Open-Source Models Beat ChatGPT at Search? Inside CHIQ's History Enhancement Strategy

Imagine you are chatting with a friend about a movie. You ask, “Who directed Inception?” They answer, “Christopher Nolan.” Then you ask, “What other movies did he do?” Your friend instantly understands that “he” refers to Christopher Nolan. But for a search engine, that second question is a nightmare. “He” could be anyone. To get a good answer, a search system needs to rewrite your question into something standalone, like “What other movies did Christopher Nolan direct?” ...

[CHESS : Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification 🔗](https://arxiv.org/abs/2409.01366)

Making LLMs Faster on the Edge: A Deep Dive into CHESS

The dream of running powerful Large Language Models (LLMs) like Llama-3 or Mistral directly on your laptop or phone—without relying on the cloud—is enticing. It promises privacy, lower latency, and offline capabilities. However, the reality is often a struggle against hardware limitations. These models are computationally heavy and memory-hungry. One of the most effective ways to speed up these models is activation sparsification. The idea is simple: if a neuron’s activation value is close to zero, we can treat it as exactly zero and skip the math associated with it. ...

[CELLO: Causal Evaluation of Large Vision-Language Models 🔗](https://arxiv.org/abs/2406.19131)

Can AI Really Understand Cause and Effect? Inside CELLO, a New Benchmark for Vision-Language Models

Imagine you are looking at a photo of an elderly man sitting in a wheelchair next to a window. A child asks you, “I need to reach something high. Can you move this chair for me to use?” As a human, your brain instantly processes a complex web of causal relationships. You see the chair, you see the man, and you understand the relationship: “The chair supports the man.” Moving the chair would cause the man to fall or be displaced. Therefore, the answer is an obvious “No.” ...

[CARER - Clinical Reasoning-Enhanced Representation for Temporal Health Risk Prediction 🔗](https://aclanthology.org/2024.emnlp-main.580.pdf)

Teaching AI to Think Like a Doctor: Inside the CARER Framework

Imagine a seasoned physician reviewing a patient’s file. They don’t just look at a list of numbers—blood pressure 140/90, heart rate 100—and statistically calculate the odds of a heart attack. Instead, they engage in clinical reasoning. They synthesize disparate data points, apply external medical knowledge learned over decades, and construct a narrative about the patient’s physiological progression. They might think, “The patient’s creatinine is rising while their blood pressure is unstable, which, given their history of diabetes, suggests acute kidney injury complicating their cardiovascular status.” ...

[C3PA: An Open Dataset of Expert-Annotated and Regulation-Aware Privacy Policies to Enable Scalable Regulatory Compliance Audits 🔗](https://arxiv.org/abs/2410.03925)

Decoding Legalese: How the C3PA Dataset Revolutionizes Automated Privacy Compliance

Introduction If you have ever clicked “I Agree” on a privacy policy without reading a single word, you are in the overwhelming majority. These documents are notorious for being long, dense, and filled with complex legal jargon. However, for regulators and privacy advocates, these documents are the first line of defense in understanding how corporations handle our personal data. In recent years, the landscape of digital privacy has shifted dramatically. Landmark regulations like the European Union’s GDPR and California’s Consumer Privacy Act (CCPA) have forced companies to be more transparent. They are no longer just required to say that they collect data; they must now disclose specific categories of data, list specific consumer rights, and provide clear methods for users to exercise those rights. ...

[C-LLM: Learn to Check Chinese Spelling Errors Character by Character 🔗](https://arxiv.org/abs/2406.16536)

Why LLMs Struggle with Chinese Spelling (And How Character-Level Tokenization Fixes It)

Large Language Models (LLMs) like GPT-4 and Qwen have revolutionized how we interact with text. They can write poetry, generate code, and summarize complex documents. Yet, there is a specific, seemingly simple task where these giants often stumble: Chinese Spell Checking (CSC). It seems counterintuitive. How can a model capable of passing the Bar Exam fail to correct a simple homophone error in a Chinese sentence? In this deep dive, we are exploring a fascinating research paper, “C-LLM: Learn to Check Chinese Spelling Errors Character by Character.” We will uncover why the standard architecture of modern LLMs creates a fundamental bottleneck for spelling correction and how a new method, C-LLM, proposes a structural shift—changing how the model “sees” text—to achieve state-of-the-art results. ...

[By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting 🔗](https://arxiv.org/abs/2407.10385)

Stop Feeding Numbers to LLMs: Why Visualizing Sensor Data is the Future of AI Sensing

Large Language Models (LLMs) like GPT-4 have conquered the world of text. They can write poetry, debug code, and summarize history. However, the physical world doesn’t always speak in words; it speaks in sensor data. From the accelerometer in your smartwatch tracking your steps to the electrocardiogram (ECG) monitoring a patient’s heart rhythm, ubiquitous sensing generates massive streams of numerical data. The traditional way to use this data with AI is to feed the raw numbers directly into the model. But if you’ve ever tried to read a spreadsheet with 10,000 rows of floating-point numbers, you know the problem: it is overwhelming, expensive to process, and hard to interpret. ...

[Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks 🔗](https://aclanthology.org/2024.emnlp-main.824.pdf)

Breaking the Silence: Building NLP Resources for Mozambique’s Emakhuwa Language

In the fast-paced world of Artificial Intelligence, we often marvel at Large Language Models (LLMs) that can write poetry in English, debug code in Python, or translate French to German with near-human accuracy. However, this technological revolution is not evenly distributed. For billions of people, the digital world remains largely inaccessible in their native tongues. This is the challenge of Low-Resource Languages—languages that lack the massive digital text archives required to train modern AI systems. ...

[Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning 🔗](https://arxiv.org/abs/2410.05600)

Can Text Teach AI to See Hate? Bridging the Gap Between Tweets and Memes

The internet is a battleground. While social media platforms have spent years refining algorithms to detect hateful text, the adversary has evolved. Hate speech is no longer just about nasty words typed into a status update; it has migrated to the visual realm. Internet memes—images overlaid with text—have become a dominant vehicle for spreading animosity, often bypassing traditional text-based filters. This shift presents a massive engineering challenge. Text-based hate speech detection is a mature field with abundant datasets. Vision-language (multimodal) detection, however, is data-starved. Privacy concerns, copyright issues, and the sheer difficulty of scraping memes make building large training sets for memes incredibly difficult. ...

[Bridging Local Details and Global Context in Text-Attributed Graphs 🔗](https://arxiv.org/abs/2406.12608)

Beyond Nodes and Edges: How GraphBridge Unifies Text and Structure in Graph Learning

Beyond Nodes and Edges: How GraphBridge Unifies Text and Structure in Graph Learning In the evolving landscape of Machine Learning, we often find ourselves categorizing data into distinct types. We have Natural Language Processing (NLP) for text and Graph Neural Networks (GNNs) for networked structures. But the real world is rarely so tidy. In reality, data is often a messy, beautiful combination of both. Consider a social network. You have users (nodes) and friendships (edges). But you also have the content those users generate—their bios, posts, and self-introductions. To truly understand a user, you can’t just look at who they know (structure), nor can you only look at what they write in isolation (text). You need to understand how their text relates to the text of the people they hang out with. ...

[Bridging Cultures in the Kitchen: A Framework and Benchmark for Cross-Cultural Recipe Retrieval 🔗](https://aclanthology.org/2024.emnlp-main.61.pdf)

Cooking with AI: Why Retrieval Beats Generation in Cross-Cultural Kitchens

Food is perhaps the most universal language we have, yet it is deeply fragmented by dialects of culture, geography, and history. If you have ever tried to recreate a specific dish from a foreign cuisine using locally available ingredients, you know the struggle. It is not merely a translation problem; it is a cultural adaptation problem. Directly translating a Chinese recipe into English often results in confusion. “Moderate amount of ginger” is vague to a Western cook used to teaspoons and cups. Ingredients like “rice wine” might not be on the shelf at your local grocery store. ...

[Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models 🔗](https://arxiv.org/abs/2401.10440)

Breaking the Curse: How X-ELM Democratizes Multilingual AI

Breaking the Curse: How X-ELM Democratizes Multilingual AI Imagine trying to pack a suitcase for a trip around the world. You need winter gear for Russia, light linens for Egypt, and rain gear for London. If you only have one suitcase, you eventually run out of space. You have to compromise—maybe you leave the umbrella behind or pack a thinner coat. This is the Curse of Multilinguality. In the world of Natural Language Processing (NLP), we typically train massive, “dense” multilingual models (like BLOOM or XGLM) that try to learn 100+ languages simultaneously using a single set of parameters. The result? Competition. Languages fight for capacity within the model. Consequently, these multilingual giants often perform worse on a specific language (like Swahili) than a smaller model trained only on that language. ...

[Breaking ReLU Barrier: Generalized MoEfication for Dense Pretrained Models 🔗](https://aclanthology.org/2024.emnlp-main.563.pdf)

Breaking the ReLU Barrier: How to Turn Arbitrary Dense Models into Efficient Mixtures-of-Experts

The scale of Large Language Models (LLMs) is exploding. From GPT-4 to Llama, models are getting bigger, smarter, and—crucially—much more expensive to run. The primary culprit for this cost is the dense nature of these architectures: every time you ask a question, every single parameter in the model is activated to calculate the answer. Imagine a library where, to answer a single question, the librarian has to open and read every single book on the shelves. That is a dense model. ...

[Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale 🔗](https://arxiv.org/abs/2407.02118)

Don't Start from Scratch: The Science of Cross-Lingual Continual Pre-Training

If you have ever tried to train a Large Language Model (LLM) from scratch, you know the pain. It requires massive computational resources, vast amounts of data, and a budget that usually only large tech giants possess. But here is a question that has puzzled researchers: if we already have excellent models fluent in English (like LLaMA), why do we burn millions of dollars training new models from scratch just to teach them a new language like Chinese? ...

[Bootstrapped Policy Learning for Task-oriented Dialogue through Goal Shaping 🔗](https://aclanthology.org/2024.emnlp-main.263.pdf)

Building the Ladder as You Climb: How Bootstrapped Policy Learning Solves Hard Dialogue Tasks

Introduction Imagine you are trying to teach a computer how to handle complex customer service calls—for example, booking a multi-leg flight while simultaneously reserving a hotel and buying tickets for a local attraction. In the world of Artificial Intelligence, specifically Task-Oriented Dialogue (ToD) systems, this is a massive challenge. The standard approach is Reinforcement Learning (RL). The AI talks to a user simulator, tries to fulfill the request, and gets a “reward” (a positive signal) only if it completes the entire task perfectly. If it fails, it gets nothing or a penalty. This is known as the sparse reward problem. It is akin to trying to learn how to play a piano concerto by hitting random keys and only being told “good job” if you accidentally play the whole piece perfectly on the first try. ...

[Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models? 🔗](https://arxiv.org/abs/2406.11375)

Can AI Teach AI? Using Analogies to Boost Scientific Understanding in Language Models

Introduction Imagine trying to explain the structure of an atom to someone who has never taken a physics class. You could recite a textbook definition about protons, neutrons, and electron shells. Or, you could say: “The atom is like a solar system. The nucleus is the sun in the center, and the electrons are planets orbiting around it.” For most learners, the second explanation—the analogy—is the one that clicks. Analogical reasoning is a cornerstone of human cognition. It allows us to map the familiar (the solar system) onto the unfamiliar (the atom), building a bridge to new understanding. ...