EMNLP 2024

[Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification 🔗](https://aclanthology.org/2024.emnlp-main.1019.pdf)

Can LLMs Actually Detect Hate Speech? An Analysis of Behavior Patterns and Failures

Imagine you are a content moderator for a social media platform, or perhaps a developer building a chatbot intended for elderly companionship. You want to ensure that the content processed or generated by your system is safe. Naturally, you turn to Large Language Models (LLMs) to help filter out offensive speech. You feed a comment into the model and ask: “Is this text offensive?” You expect a simple “Yes” or “No.” Instead, the model refuses to answer, lectures you on morality, or hallucinates a response that has nothing to do with the question. ...

[Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models 🔗](https://arxiv.org/abs/2410.12011)

Can Images Read? A Deep Dive into the Linguistic Brain of Pixel-Based Models

Imagine trying to read a book not by recognizing letters or words, but by looking at a continuous screenshot of the pages. This is essentially how Pixel-based Language Models work. Instead of breaking text down into a vocabulary of “tokens” (like subwords or characters) as models like BERT or GPT do, these models treat text as images. Why would we do this? The standard approach of using subwords creates a “vocabulary bottleneck.” If you want a model to understand 100 languages, you need a massive vocabulary list that competes for space. Pixel-based models bypass this entirely. If a script can be rendered on a screen, the model can process it. ...

[PhiloGPT: A Philology-Oriented Large Language Model for Ancient Chinese Manuscripts with Dunhuang as Case Study 🔗](https://aclanthology.org/2024.emnlp-main.163.pdf)

Decoding the Past: How PhiloGPT is Revolutionizing the Study of Ancient Chinese Manuscripts

Imagine trying to read a letter written a thousand years ago. The paper is tattered, characters are missing due to wormholes or water damage, and the grammar follows rules that haven’t been used for centuries. Furthermore, the author used a slang term specific to a small village in the 7th century that appears in no modern dictionary. This is the daily reality for philologists—scholars who dedicate their lives to studying ancient texts. It is a field requiring decades of training, immense memorization, and the patience of a saint. ...

[PERSONALIZED PIECES: Efficient Personalized Large Language Models through Collaborative Efforts 🔗](https://arxiv.org/abs/2406.10471)

Building Your Own LLM: How Personalized Pieces (PER-PCS) Revolutionizes Model Customization

Introduction Imagine you have a personal assistant who has read every email you’ve ever written, knows exactly which movies you like, and understands your writing style perfectly. Now, imagine trying to build that assistant using today’s Large Language Models (LLMs). You face a difficult dilemma. Option one is to use a prompt-based approach (like RAG), where you feed your private history into a centralized model like ChatGPT. This works, but it raises serious privacy concerns—do you really want to send your personal data to a remote server for every query? Option two is to fine-tune your own personal model. This keeps your data safer and provides deeper personalization, but it is computationally expensive. If a service has one million users, maintaining one million separate fine-tuned models (a paradigm known as “One-PEFT-Per-User”) creates a storage and cost nightmare. ...

[Personality-aware Student Simulation for Conversational Intelligent Tutoring Systems 🔗](https://arxiv.org/abs/2404.06762)

Can AI Simulate the Classroom? Teaching LLMs to Act Like Students with Personalities

Imagine trying to train a new teacher. You wouldn’t want their very first interaction to be with a struggling student who needs delicate, specialized attention. You would want them to practice first. The same logic applies to Intelligent Tutoring Systems (ITS)—AI-driven educational tools designed to provide personalized instruction. To build truly effective AI tutors, developers need to test them against a wide variety of student behaviors. But recruiting hundreds of real students for pilot studies is slow, expensive, and difficult to scale. Furthermore, testing how an AI handles a frustrated, shy, or over-eager student is challenging when relying solely on available datasets. ...

[Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale 🔗](https://arxiv.org/abs/2411.05045)

Distilling Giants—How to Train Efficient Models Using Feedback Loops and Hard Negatives

In the current landscape of Artificial Intelligence, we are often faced with a dilemma: do we choose intelligence or efficiency? Large Language Models (LLMs) like GPT-4 or Claude are incredibly smart, capable of understanding nuance and context that smaller models miss. However, they are also slow, expensive, and computationally heavy—often too much so for high-volume production environments. On the other hand, smaller Pre-trained Language Models (PLMs) like BERT are lightning-fast and cheap to run, but they often struggle with complex tasks, specifically when labeled training data is scarce or when the task involves hundreds of different categories. ...

[Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models 🔗](https://arxiv.org/abs/2407.06004)

Seeing to Believing: Why LLMs Struggle with Theory of Mind and How to Fix It

Imagine you are watching a child named Sally place a marble into a basket and leave the room. While she is gone, another child, Anne, moves the marble to a box. When Sally returns, where will she look for her marble? If you answered “the basket,” congratulations—you have a functioning Theory of Mind (ToM). You understand that Sally holds a false belief because she didn’t see the switch. You can model her mental state separate from your own knowledge of reality. ...

[Perceptions of Linguistic Uncertainty by Language Models and Humans 🔗](https://arxiv.org/abs/2407.15814)

When "Probable" Means "True": How LLMs Struggle with Theory of Mind

We use vague words every day. When you tell a friend, “It is likely to rain tomorrow,” or “It is doubtful I’ll make the party,” you aren’t outputting a precise mathematical calculation. You are expressing a fuzzy degree of belief. Remarkably, despite the lack of precision, humans are generally on the same page. We instinctively know that “likely” represents a higher probability than “possible,” but a lower probability than “almost certain.” ...

[PepRec: Progressive Enhancement of Prompting for Recommendation 🔗](https://aclanthology.org/2024.emnlp-main.995.pdf)

Can LLMs Master Collaborative Filtering? A Deep Dive into PepRec

In the rapidly evolving landscape of Artificial Intelligence, two giants stand tall but rarely hold hands: Deep Learning Recommendation Models (DLRMs) and Large Language Models (LLMs). DLRMs are the silent engines behind your TikTok feed, your Amazon suggestions, and your Netflix homepage. They excel at “Collaborative Filtering”—predicting what you might like based on the mathematical patterns of millions of other users. However, they are often “black boxes”; they can tell you what to watch, but they can rarely explain why in human terms. ...

[Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification 🔗](https://arxiv.org/abs/2407.02352)

Busting Visual Hallucinations: How Pelican Uses Python to Fact-Check AI Vision Models

Imagine asking an AI to describe a photo of your living room. The model confidently replies, “There is a red vintage motorcycle parked next to the coffee table.” You look at the photo again. There is no motorcycle. There is just a red potted plant. This phenomenon is known as hallucination. It is one of the most persistent and dangerous problems facing Large Visual Language Models (LVLMs) today. While these models have become incredibly good at chatting about images, they have a bad habit of making things up—fabricating objects, misidentifying colors, or describing relationships that simply don’t exist. ...

[PCC-tuning: Breaking the Contrastive Learning Ceiling in Semantic Textual Similarity 🔗](https://arxiv.org/abs/2406.09790)

Breaking the Glass Ceiling: How Pcc-tuning Unlocks the Limits of Contrastive Learning in NLP

Introduction: Hitting the Wall in NLP If you have been following the progress of Natural Language Processing (NLP), particularly in the realm of Sentence Embeddings, you might have noticed a trend. We have moved from simple word vectors like GloVe to sophisticated transformer-based models like BERT, and now to massive Large Language Models (LLMs) like LLaMA and Mistral. Sentence embeddings are the backbone of modern AI applications. They convert text into numerical vectors, allowing computers to understand that “The cat sits on the mat” is semantically similar to “A feline is resting on the rug.” This technology powers everything from Google Search to Retrieval-Augmented Generation (RAG) systems. ...

[Paraphrase Types Elicit Prompt Engineering Capabilities 🔗](https://arxiv.org/abs/2406.19898)

It’s Not What You Ask, It’s How You Ask: The Science of Paraphrasing Prompts

Introduction “It’s not what you say, it’s how you say it.” This age-old adage usually applies to human relationships, implying that tone and delivery matter as much as the message itself. Surprisingly, this rule applies just as strictly to Large Language Models (LLMs). If you have ever spent hours tweaking a prompt for ChatGPT or LLaMA—changing a word here, adding a “please” there—you have engaged in the often frustrating art of prompt engineering. We know that slight variations in instructions can lead to wildly different outputs, but until recently, this process has been largely based on intuition and trial-and-error. ...

[Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks 🔗](https://arxiv.org/abs/2401.02731)

How to Turn Dense LLMs into Efficient Mixture-of-Experts with PESC

Large Language Models (LLMs) like GPT-4 and Llama 3 have become the de facto “experts” in natural language processing. Their ability to handle complex linguistic patterns is largely due to their massive scale. The prevailing wisdom, known as the scaling law, suggests that to get smarter models, we simply need to make them bigger. However, there is a catch. As models grow, the computational cost to train and fine-tune them skyrockets. This is particularly true for Instruction Tuning—the phase where a pre-trained model is refined to follow human instructions across various domains like math, coding, and biology. ...

[PAIRDISTILL: Pairwise Relevance Distillation for Dense Retrieval 🔗](https://arxiv.org/abs/2410.01383)

Beyond Points: How Pairwise Comparisons Are Revolutionizing Search AI

When you type a query into a search engine, you expect relevant results instantly. Behind the scenes, however, there is a constant tug-of-war between speed and accuracy. Modern Information Retrieval (IR) systems often rely on a two-step process to manage this trade-off: a fast “retriever” that finds a broad set of candidate documents, followed by a slower, more precise “reranker” that sorts them. For years, researchers have tried to make the fast retriever smarter so that it relies less on the heavy computational cost of the reranker. The standard approach is Knowledge Distillation: teaching the fast model to mimic the scores of the smart model. ...

[PTD-SQL: Partitioning and Targeted Drilling with LLMs in Text-to-SQL 🔗](https://arxiv.org/abs/2409.14082)

How to Teach LLMs Like Human Students: The PTD-SQL Framework

Introduction Imagine you are studying for a difficult math exam. You open your textbook, but instead of just reading every page in order, you notice the chapters are divided by topic: Geometry, Algebra, Calculus, and Statistics. When you struggle with a specific type of geometry problem, you don’t practice by solving calculus equations. Instead, you perform targeted drilling—you find a set of geometry problems, study the specific formulas required for them, and practice until you master that category. ...

[PSC: Extending Context Window of Large Language Models via Phase Shift Calibration 🔗](https://arxiv.org/abs/2505.12423)

Calibrating the Compass: How Phase Shift Calibration Extends LLM Context Windows

Introduction Imagine trying to summarize a dense novel, but you can only hold ten pages in your memory at any given time. By the time you reach chapter three, chapter one is gone. This is the fundamental struggle of Large Language Models (LLMs) dealing with limited context windows. While models like GPT-4 and LLaMA-2 have revolutionized Natural Language Processing (NLP), their ability to process massive inputs—like entire books or legal repositories—is constrained by their “context window.” ...

[Prompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Heuristic-based Sampling 🔗](https://arxiv.org/abs/2402.08702)

Beyond One-Shot: How PROMST Masters Multi-Step Prompt Engineering

Beyond One-Shot: How PROMST Masters Multi-Step Prompt Engineering If you have ever worked with Large Language Models (LLMs) like GPT-4 or Claude, you are intimately familiar with the “dark art” of prompt engineering. You tweak a word here, add a constraint there, and cross your fingers that the model outputs what you want. While this trial-and-error process is manageable for simple tasks—like summarizing an email or solving a math problem—it becomes a nightmare when building autonomous agents. Imagine an LLM controlling a robot in a warehouse or a web agent navigating a complex e-commerce site. These are multi-step tasks. They require planning, long-horizon reasoning, and adherence to strict environmental constraints. ...

[PREDICT: Multi-Agent-based Debate Simulation for Generalized Hate Speech Detection 🔗](https://aclanthology.org/2024.emnlp-main.1166.pdf)

Can AI Debate Its Way to Better Decisions? Solving the Subjectivity of Hate Speech

If you ask five different people to define “hate speech,” you will likely get five slightly different answers. One person might focus on slurs, another on historical context, and a third on the intent of the speaker. Now, imagine training an Artificial Intelligence model to detect hate speech. If the model is trained on data labeled by the first person, it might fail to recognize the concerns of the second. This is the fundamental problem of generalization in Natural Language Processing (NLP). Models become experts in the specific “rulebook” of their training data but crumble when faced with data annotated under different guidelines. ...

[PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals 🔗](https://aclanthology.org/2024.emnlp-main.711.pdf)

Beyond Roleplay: How PATIENT-Ψ Uses Cognitive Models to Train the Next Generation of Therapists

Introduction Mental health is one of the most critical public health challenges of our time. With one in eight people globally living with mental health conditions, the demand for qualified care far outstrips the supply. However, training a mental health professional is not merely a matter of reading textbooks and passing exams. It requires mastering the subtle, complex, and often unpredictable art of human interaction. Traditionally, therapy training relies on two extremes: static textbook case studies, which are often too “clean” and perfect, and role-playing exercises with peers, which can feel awkward or unrealistic. Trainees eventually move on to real patients, but this transition is often described as a “trial by fire.” Novice therapists must learn to identify deep-seated psychological patterns while managing the delicate emotions of a person in distress—all without causing harm. ...

[PARIKSHA: Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data 🔗](https://arxiv.org/abs/2406.15053)

PARIKSHA: Uncovering the Truth About Multilingual LLM Evaluation

Introduction In the rapidly evolving world of Large Language Models (LLMs), benchmarks are the compass by which we navigate progress. We look at leaderboards to see which model is “smarter,” “faster,” or “safer.” However, there is a glaring blind spot in this landscape: linguistic and cultural diversity. Most standard benchmarks are English-centric. When multilingual benchmarks do exist, they often suffer from two critical flaws. First, test set contamination: because popular benchmarks are available on the web, models often ingest the questions during training, effectively memorizing the answers. Second, lack of cultural nuance: many benchmarks are simply English questions translated into other languages, losing the local context, idioms, and cultural values that define true fluency. ...