Papers

[Perceptions of Linguistic Uncertainty by Language Models and Humans 🔗](https://arxiv.org/abs/2407.15814)

When "Probable" Means "True": How LLMs Struggle with Theory of Mind

We use vague words every day. When you tell a friend, “It is likely to rain tomorrow,” or “It is doubtful I’ll make the party,” you aren’t outputting a precise mathematical calculation. You are expressing a fuzzy degree of belief. Remarkably, despite the lack of precision, humans are generally on the same page. We instinctively know that “likely” represents a higher probability than “possible,” but a lower probability than “almost certain.” ...

[PepRec: Progressive Enhancement of Prompting for Recommendation 🔗](https://aclanthology.org/2024.emnlp-main.995.pdf)

Can LLMs Master Collaborative Filtering? A Deep Dive into PepRec

In the rapidly evolving landscape of Artificial Intelligence, two giants stand tall but rarely hold hands: Deep Learning Recommendation Models (DLRMs) and Large Language Models (LLMs). DLRMs are the silent engines behind your TikTok feed, your Amazon suggestions, and your Netflix homepage. They excel at “Collaborative Filtering”—predicting what you might like based on the mathematical patterns of millions of other users. However, they are often “black boxes”; they can tell you what to watch, but they can rarely explain why in human terms. ...

[Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification 🔗](https://arxiv.org/abs/2407.02352)

Busting Visual Hallucinations: How Pelican Uses Python to Fact-Check AI Vision Models

Imagine asking an AI to describe a photo of your living room. The model confidently replies, “There is a red vintage motorcycle parked next to the coffee table.” You look at the photo again. There is no motorcycle. There is just a red potted plant. This phenomenon is known as hallucination. It is one of the most persistent and dangerous problems facing Large Visual Language Models (LVLMs) today. While these models have become incredibly good at chatting about images, they have a bad habit of making things up—fabricating objects, misidentifying colors, or describing relationships that simply don’t exist. ...

[PCC-tuning: Breaking the Contrastive Learning Ceiling in Semantic Textual Similarity 🔗](https://arxiv.org/abs/2406.09790)

Breaking the Glass Ceiling: How Pcc-tuning Unlocks the Limits of Contrastive Learning in NLP

Introduction: Hitting the Wall in NLP If you have been following the progress of Natural Language Processing (NLP), particularly in the realm of Sentence Embeddings, you might have noticed a trend. We have moved from simple word vectors like GloVe to sophisticated transformer-based models like BERT, and now to massive Large Language Models (LLMs) like LLaMA and Mistral. Sentence embeddings are the backbone of modern AI applications. They convert text into numerical vectors, allowing computers to understand that “The cat sits on the mat” is semantically similar to “A feline is resting on the rug.” This technology powers everything from Google Search to Retrieval-Augmented Generation (RAG) systems. ...

[Paraphrase Types Elicit Prompt Engineering Capabilities 🔗](https://arxiv.org/abs/2406.19898)

It’s Not What You Ask, It’s How You Ask: The Science of Paraphrasing Prompts

Introduction “It’s not what you say, it’s how you say it.” This age-old adage usually applies to human relationships, implying that tone and delivery matter as much as the message itself. Surprisingly, this rule applies just as strictly to Large Language Models (LLMs). If you have ever spent hours tweaking a prompt for ChatGPT or LLaMA—changing a word here, adding a “please” there—you have engaged in the often frustrating art of prompt engineering. We know that slight variations in instructions can lead to wildly different outputs, but until recently, this process has been largely based on intuition and trial-and-error. ...

[Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks 🔗](https://arxiv.org/abs/2401.02731)

How to Turn Dense LLMs into Efficient Mixture-of-Experts with PESC

Large Language Models (LLMs) like GPT-4 and Llama 3 have become the de facto “experts” in natural language processing. Their ability to handle complex linguistic patterns is largely due to their massive scale. The prevailing wisdom, known as the scaling law, suggests that to get smarter models, we simply need to make them bigger. However, there is a catch. As models grow, the computational cost to train and fine-tune them skyrockets. This is particularly true for Instruction Tuning—the phase where a pre-trained model is refined to follow human instructions across various domains like math, coding, and biology. ...

[PAIRDISTILL: Pairwise Relevance Distillation for Dense Retrieval 🔗](https://arxiv.org/abs/2410.01383)

Beyond Points: How Pairwise Comparisons Are Revolutionizing Search AI

When you type a query into a search engine, you expect relevant results instantly. Behind the scenes, however, there is a constant tug-of-war between speed and accuracy. Modern Information Retrieval (IR) systems often rely on a two-step process to manage this trade-off: a fast “retriever” that finds a broad set of candidate documents, followed by a slower, more precise “reranker” that sorts them. For years, researchers have tried to make the fast retriever smarter so that it relies less on the heavy computational cost of the reranker. The standard approach is Knowledge Distillation: teaching the fast model to mimic the scores of the smart model. ...

[PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL 🔗](https://arxiv.org/abs/2409.14082)

How to Teach LLMs Like Human Students: The PTD-SQL Framework

Introduction Imagine you are studying for a difficult math exam. You open your textbook, but instead of just reading every page in order, you notice the chapters are divided by topic: Geometry, Algebra, Calculus, and Statistics. When you struggle with a specific type of geometry problem, you don’t practice by solving calculus equations. Instead, you perform targeted drilling—you find a set of geometry problems, study the specific formulas required for them, and practice until you master that category. ...

[PSC: Extending Context Window of Large Language Models via Phase Shift Calibration 🔗](https://arxiv.org/abs/2505.12423)

Calibrating the Compass: How Phase Shift Calibration Extends LLM Context Windows

Introduction Imagine trying to summarize a dense novel, but you can only hold ten pages in your memory at any given time. By the time you reach chapter three, chapter one is gone. This is the fundamental struggle of Large Language Models (LLMs) dealing with limited context windows. While models like GPT-4 and LLaMA-2 have revolutionized Natural Language Processing (NLP), their ability to process massive inputs—like entire books or legal repositories—is constrained by their “context window.” ...

[Prompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Heuristic-based Sampling 🔗](https://arxiv.org/abs/2402.08702)

Beyond One-Shot: How PROMST Masters Multi-Step Prompt Engineering

Beyond One-Shot: How PROMST Masters Multi-Step Prompt Engineering If you have ever worked with Large Language Models (LLMs) like GPT-4 or Claude, you are intimately familiar with the “dark art” of prompt engineering. You tweak a word here, add a constraint there, and cross your fingers that the model outputs what you want. While this trial-and-error process is manageable for simple tasks—like summarizing an email or solving a math problem—it becomes a nightmare when building autonomous agents. Imagine an LLM controlling a robot in a warehouse or a web agent navigating a complex e-commerce site. These are multi-step tasks. They require planning, long-horizon reasoning, and adherence to strict environmental constraints. ...

[PREDICT: Multi-Agent-based Debate Simulation for Generalized Hate Speech Detection 🔗](https://aclanthology.org/2024.emnlp-main.1166.pdf)

Can AI Debate Its Way to Better Decisions? Solving the Subjectivity of Hate Speech

If you ask five different people to define “hate speech,” you will likely get five slightly different answers. One person might focus on slurs, another on historical context, and a third on the intent of the speaker. Now, imagine training an Artificial Intelligence model to detect hate speech. If the model is trained on data labeled by the first person, it might fail to recognize the concerns of the second. This is the fundamental problem of generalization in Natural Language Processing (NLP). Models become experts in the specific “rulebook” of their training data but crumble when faced with data annotated under different guidelines. ...

[PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals 🔗](https://aclanthology.org/2024.emnlp-main.711.pdf)

Beyond Roleplay: How PATIENT-Ψ Uses Cognitive Models to Train the Next Generation of Therapists

Introduction Mental health is one of the most critical public health challenges of our time. With one in eight people globally living with mental health conditions, the demand for qualified care far outstrips the supply. However, training a mental health professional is not merely a matter of reading textbooks and passing exams. It requires mastering the subtle, complex, and often unpredictable art of human interaction. Traditionally, therapy training relies on two extremes: static textbook case studies, which are often too “clean” and perfect, and role-playing exercises with peers, which can feel awkward or unrealistic. Trainees eventually move on to real patients, but this transition is often described as a “trial by fire.” Novice therapists must learn to identify deep-seated psychological patterns while managing the delicate emotions of a person in distress—all without causing harm. ...

[PARIKSHA: Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data 🔗](https://arxiv.org/abs/2406.15053)

PARIKSHA: Uncovering the Truth About Multilingual LLM Evaluation

Introduction In the rapidly evolving world of Large Language Models (LLMs), benchmarks are the compass by which we navigate progress. We look at leaderboards to see which model is “smarter,” “faster,” or “safer.” However, there is a glaring blind spot in this landscape: linguistic and cultural diversity. Most standard benchmarks are English-centric. When multilingual benchmarks do exist, they often suffer from two critical flaws. First, test set contamination: because popular benchmarks are available on the web, models often ingest the questions during training, effectively memorizing the answers. Second, lack of cultural nuance: many benchmarks are simply English questions translated into other languages, losing the local context, idioms, and cultural values that define true fluency. ...

[PANDA: Persona Attributes Navigation for Detecting and Alleviating Overuse Problem in Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.670.pdf)

TMI! Why LLMs Share Too Much and How the PANDA Framework Fixes It

Introduction Imagine you are chatting with a new acquaintance. You mention that you enjoy reading mystery novels. A normal response might be, “Oh, I love those too! Who is your favorite author?” Now imagine the acquaintance responds: “I love reading too! I am a 35-year-old accountant living in Chicago. I have three cats named Mittens, Oreo, and Luna. I suffer from anxiety and I go to the gym every Tuesday at 6 PM.” ...

[PALM: Few-Shot Prompt Learning for Audio Language Models 🔗](https://arxiv.org/abs/2409.19806)

Beyond Hand-Crafted Prompts: Optimizing Audio-Language Models with PALM

Introduction In the rapidly evolving world of Artificial Intelligence, multimodal models—systems that can understand and process multiple types of data like text, images, and audio—are breaking new ground. Just as Vision-Language Models (VLMs) like CLIP revolutionized computer vision by connecting images to natural language, Audio-Language Models (ALMs) are doing the same for sound. These models allow for Zero-Shot Audio Recognition. Imagine playing a sound clip of a dog barking to an AI model that has never been explicitly trained to classify “dog barks.” Instead, you simply provide the text “A recording of a dog,” and the model matches the audio features to the text features, correctly identifying the sound. ...

[Overcome Noise and Bias: Segmentation-Aided Multi-Granularity Denoising and Debiasing for Enhanced Quadruples Extraction in Dialogue 🔗](https://aclanthology.org/2024.emnlp-main.49.pdf)

Taming the Chaos: How to Extract Sentiment Quadruples from Messy Dialogues without Noise and Bias

Sentiment analysis has come a long way from simply classifying a movie review as “positive” or “negative.” In the era of granular data analytics, we are interested in Aspect-Based Sentiment Analysis (ABSA). We don’t just want to know if a user is happy; we want to know what they are happy about, which specific feature they like, and what opinion words they used. This brings us to the Sentiment Quadruple: A structured set of four elements: ...

[Outcome-Constrained Large Language Models for Countering Hate Speech 🔗](https://arxiv.org/abs/2403.17146)

Beyond Politeness—Teaching AI to De-escalate Hate Speech

If you have spent any time in the comment sections of social media platforms like Reddit or X (formerly Twitter), you know how quickly conversations can spiral into toxicity. Hate speech remains a persistent challenge in online communities, threatening healthy discourse and driving users away. For years, researchers have been developing automated methods to generate “counterspeech”—direct responses designed to refute or neutralize hate speech. The logic is simple: if we can automate the moderation process or assist human moderators with suggested replies, we can scale up the fight against toxicity. ...

[Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding 🔗](https://arxiv.org/abs/2402.13720)

Ouroboros: Breaking the Speed Limit of LLMs with Phrase-Based Speculative Decoding

Large Language Models (LLMs) have revolutionized how we interact with information, but they suffer from a persistent bottleneck: latency. If you have ever watched ChatGPT type out an answer word by word, you have experienced the limitations of autoregressive decoding. Because every new token depends on the previous one, models must generate output sequentially. This process is slow and computationally inefficient, leaving expensive GPUs idle while waiting for memory access. ...

[Order of Magnitude Speedups for LLM Membership Inference 🔗](https://arxiv.org/abs/2409.14513)

Auditing LLMs for Privacy — How to Slash the Cost of Membership Inference Attacks

Large Language Models (LLMs) are trained on massive datasets scraped from the internet, often containing sensitive personal information, proprietary code, or copyrighted works. This creates a significant privacy risk: these models can “memorize” their training data. If an adversary can query an LLM and determine whether a specific document was part of its training set, they have successfully mounted a Membership Inference Attack (MIA). For organizations deploying LLMs, auditing these models for privacy leaks is crucial. However, the current “gold standard” for auditing—training “shadow models”—is prohibitively expensive. It requires training multiple copies of LLMs just to test one. ...

[Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning 🔗](https://aclanthology.org/2024.emnlp-main.565.pdf)

Juggling Act: How 'Fast RL' Balances Conflicting Goals in LLM Training

Reinforcement Learning from Human Feedback (RLHF) is the secret sauce behind the success of modern Large Language Models (LLMs) like ChatGPT and Llama. It’s the process that turns a raw, text-predicting engine into a helpful assistant. But there is a hidden complexity in this process: we rarely want an AI to do just one thing. We want models to be helpful and harmless. We want them to be creative and factual. We want them to be concise and complete. ...