EMNLP 2024

[Bayesian Calibration of Win Rate Estimation with LLM Evaluators 🔗](https://arxiv.org/abs/2411.04424)

Judging the Judges—How Bayesian Statistics Fixes LLM Evaluation

Judging the Judges: How Bayesian Statistics Fixes LLM Evaluation If you have played with ChatGPT, Claude, or Llama, you know that evaluating these models is tricky. Unlike a math test, there is no single “correct” answer for writing a poem, summarizing a news article, or chatting about philosophy. For a long time, the gold standard was human evaluation. You would generate two responses and ask a human, “Which one is better?” But human evaluation is slow, expensive, and not scalable. This led to the rise of LLM-as-a-judge: using a strong model (like GPT-4) to evaluate weaker models. It’s fast, cheap, and scales infinitely. ...

[BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting 🔗](https://aclanthology.org/2024.emnlp-main.877.pdf)

Hook, Line, and Sinker: How 'BaitAttack' Manipulates LLMs into Breaking Their Own Rules

The rapid adoption of Large Language Models (LLMs) like GPT-4 and Llama-2 has brought with it a continuous arms race between safety alignment and adversarial attacks. We know LLMs are trained to refuse harmful instructions—if you ask a model “How do I build a bomb?”, it will politely decline. This is the “jailbreak” problem: finding a way to bypass these safety filters. Most research in this area focuses on disguise. Attackers wrap harmful queries in elaborate role-playing scenarios or logical puzzles to trick the model. However, a new paper titled “BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting” highlights a critical flaw in current jailbreak methods: Intention Shift. ...

[Backward Lens: Projecting Language Model Gradients into the Vocabulary Space 🔗](https://arxiv.org/abs/2402.12865)

Peeking into the Brain of a Learning AI: The Backward Lens

If you have been following the explosion of Large Language Models (LLMs) like GPT and Llama, you are likely familiar with the “Forward Pass.” It is the process where the model takes a prompt, processes it through layers of math, and spits out a prediction. We have gotten quite good at analyzing this phase. Tools like the “Logit Lens” allow us to peek into the middle of the model and see what it is “thinking” at layer 12 vs. layer 24. ...

[Back to School: Translation Using Grammar Books 🔗](https://arxiv.org/abs/2410.15263)

Back to School: Teaching AI to Translate Forgotten Languages Using Grammar Books

Introduction: The Language Gap in AI If you ask a modern Large Language Model (LLM) like GPT-4 to translate a sentence from English to French, the result is often indistinguishable from a human translation. The model has seen billions of words of French text during its training. It “knows” French. But what happens if you ask that same model to translate a sentence into Chokwe, a Bantu language spoken in Angola? Or Gitksan, an indigenous language from British Columbia? ...

[BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment 🔗](https://arxiv.org/abs/2406.12168)

Why Your AI Needs to Stay Close to Its Behavior: A Deep Dive into BPO

Aligning Large Language Models (LLMs) with human values is one of the most critical challenges in modern AI. We want models that are helpful, harmless, and concise. For a long time, the gold standard for this was Reinforcement Learning from Human Feedback (RLHF). However, if you have ever tried to train an RLHF pipeline, you know the pain: it involves training a separate reward model, dealing with complex reinforcement learning instability, and managing significant computational costs. ...

[BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training 🔗](https://arxiv.org/abs/2409.04599)

Cleaning Up the Vocabulary: How Picky BPE Removes Junk Tokens to Improve Language Models

Large Language Models (LLMs) are often treated as black boxes, but their foundation lies in a process that is surprisingly simple: tokenization. Before a model can understand “artificial intelligence,” it must break that text down into smaller chunks, or tokens. For years, the industry standard has been Byte-Pair Encoding (BPE), a reliable algorithm that merges frequent characters into subwords. However, reliable doesn’t always mean efficient. Standard BPE has a hoarding problem. It creates and keeps “intermediate” tokens—fragments of words that are necessary to build larger words but are useless on their own. These “junk tokens” clutter the vocabulary, wasting valuable parameters and potentially degrading model performance. ...

[BMRETRIEVER: Tuning Large Language Models as Better Biomedical Text Retrievers 🔗](https://arxiv.org/abs/2404.18443)

BMRetriever: How to Teach LLMs to Master Biomedical Search

Introduction In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and Llama have become household names. We generally think of them as generative engines—tools that write poetry, code, or emails. However, in specialized fields like medicine, the ability to generate text is only half the battle. The other half—and perhaps the more critical half for accuracy—is retrieval. Imagine a doctor needing to find a specific case study regarding a rare side effect of a new drug, or a researcher sifting through millions of papers to find a specific protein interaction. They don’t just need an LLM to hallucinate an answer; they need a system to dig through a massive haystack of medical literature and retrieve the exact needle of truth. This is the foundation of Retrieval-Augmented Generation (RAG). ...

[BLSP-Emo: Towards Empathetic Large Speech-Language Models 🔗](https://arxiv.org/abs/2406.03872)

Beyond Words: Teaching AI to Hear Emotions with BLSP-Emo

Have you ever told a friend, “I’m fine,” but your tone clearly screamed that you were anything but? A good friend picks up on that tone immediately. They don’t just process the word “fine”; they process the pitch, the hesitation, and the heaviness in your voice—the paralinguistic cues—and respond with empathy. Now, imagine saying that same phrase to a standard AI assistant. It processes the text “I am fine,” interprets the semantics literally, and likely responds with, “That is good to hear.” The interaction feels cold and robotic because the emotional context is lost in the conversion from speech to text. ...

[BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models 🔗](https://arxiv.org/abs/2406.17092)

Unmasking the Trojan Horse: How BEEAR Secures LLMs Against Hidden Safety Backdoors

Introduction In the rapidly evolving world of Large Language Models (LLMs), safety is paramount. We spend immense resources on Reinforcement Learning from Human Feedback (RLHF) and safety alignment to ensure models refuse to build bombs or generate hate speech. However, a sinister vulnerability lurks beneath this veneer of safety: the safety backdoor. Imagine an LLM that behaves perfectly during testing. It refuses harmful queries politely and follows instructions helpfully. But, if a user includes a specific, hidden string of text—a “trigger”—the model suddenly sheds its safety guardrails and complies with malicious requests. This is the problem of the “Deceptively Safety-aligned Backdoored LLM.” ...

[BC-Prover: Backward Chaining Prover for Formal Theorem Proving 🔗](https://aclanthology.org/2024.emnlp-main.180.pdf)

Thinking Backwards: How BC-Prover Teaches LLMs to Solve Formal Math

Large Language Models (LLMs) have demonstrated an uncanny ability to generate code, summarize history, and even write poetry. Yet, when it comes to rigorous mathematical reasoning—specifically Formal Theorem Proving—they often hit a wall. In the world of formal math, “almost correct” is the same as “wrong.” A proof must be logically sound, step-by-step, and verifiable by a computer program. While standard LLMs try to solve these problems by guessing the next step based on intuition (a process called forward chaining), human mathematicians often work differently. They look at the goal and ask, “What do I need to prove first to make this goal true?” This is called backward chaining. ...

[Autoregressive Pre-Training on Pixels and Texts 🔗](https://arxiv.org/abs/2404.10710)

Reading Without Words — How PixelGPT Teaches AI to "See" Language

Introduction In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and Llama 2 are the undisputed kings. They write code, compose poetry, and answer complex queries. But underneath the hood, these models share a common constraint: Tokenization. Before an LLM sees your text, a “tokenizer” chops sentences into discrete numbers (tokens). While efficient, this process strips away the visual richness of language. It struggles with complex PDFs, non-standard layouts, and “visually rich” text like emojis or mixed scripts. Furthermore, tokenization creates a “vocabulary bottleneck”—if a word or character isn’t in the model’s pre-defined dictionary, the model struggles to process it. ...

[Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards 🔗](https://arxiv.org/abs/2409.17472)

Grading the Grader: How Reinforcement Learning and QWK Are Revolutionizing Automated Essay Scoring

Grading an essay is subjective, nuanced, and exhausting. A teacher doesn’t just look at a paper and say “8 out of 10.” They evaluate structure, vocabulary, grammar, and content simultaneously. Automated Essay Scoring (AES) systems attempt to replicate this, but they have historically faced a significant technical hurdle: a mismatch between how they are trained and how they are evaluated. Most AES systems are trained to minimize simple error margins (like Mean Squared Error), but they are evaluated in the real world using a metric called Quadratic Weighted Kappa (QWK). QWK measures how well the AI agrees with a human grader, heavily penalizing large discrepancies. The problem? QWK is mathematically “non-differentiable,” meaning you can’t easily use it to train a neural network using standard backpropagation. ...

[Automatically Generated Definitions and their utility for Modeling Word Meaning 🔗](https://aclanthology.org/2024.emnlp-main.776.pdf)

LlamaDictionary: When LLMs Become Dynamic Dictionaries

LlamaDictionary: When LLMs Become Dynamic Dictionaries In the world of Natural Language Processing (NLP), we have a bit of an interpretability problem. When a state-of-the-art model processes a word, it converts it into a vector—a long string of numbers representing that word in a high-dimensional geometric space. While these vectors (embeddings) are mathematically powerful, they are opaque to humans. If you look at a vector, you can’t “read” what the model thinks the word means. ...

[Automatic sentence segmentation of clinical record narratives in real-world data 🔗](https://aclanthology.org/2024.emnlp-main.1156.pdf)

Breaking Boundaries - How to Segment Sentences in Messy Clinical Data

Breaking Boundaries: How to Segment Sentences in Messy Clinical Data If you have ever tried to process text data for a Natural Language Processing (NLP) project, you know that the very first step is often the most deceptive. Before you can perform sentiment analysis, name entity recognition (NER), or machine translation, you have to answer a simple question: Where does one sentence end and the next one begin? In formal writing—like a novel or a news article—this is trivial. You look for a period, a question mark, or an exclamation point, and you split the text. But what happens when the text is written by a doctor rushing between patients? What if the text is a stream of consciousness, a list of vital signs, or a fragmented note like “pt sedated no response to verbal stimuli”? ...

[Automatic Instruction Evolving for Large Language Models 🔗](https://arxiv.org/abs/2406.00770)

Beyond Human Heuristics - How Auto Evol-Instruct Automates Dataset Creation

If you have been following the explosion of Large Language Models (LLMs), you know that the secret sauce isn’t just the sheer number of parameters—it’s the data. specifically, instruction tuning. This is the process that turns a raw text predictor into a helpful assistant capable of following complex commands. To get good performance, you need high-quality, complex instruction datasets. But here lies the bottleneck: creating these datasets by hand is unscalable and expensive. Recently, the “Evol-Instruct” method (popularized by models like WizardLM) proposed a solution: use an LLM to rewrite simple instructions into complex ones. ...

[Automated Essay Scoring: A Reflection on the State of the Art 🔗](https://aclanthology.org/2024.emnlp-main.991.pdf)

Beyond the Scoreboard: Why Automated Essay Scoring Needs a New Direction

Imagine a high school English teacher sitting at their desk on a Sunday evening. In front of them is a stack of 150 essays on “The Effects of Computers on Society.” Grading each one takes at least 10 minutes. That is 25 hours of work—just for one assignment. This scenario is the driving force behind Automated Essay Scoring (AES). For over 50 years, researchers have been chasing the holy grail of Natural Language Processing (NLP): a system that can read a student’s essay and instantly assign a score that matches what a human expert would give. ...

[AUTOSCRAPER: A Progressive Understanding Web Agent for Web Scraper Generation 🔗](https://arxiv.org/abs/2404.12753)

Building Better Bots: How AUTOSCRAPER Uses LLMs to Automate Web Scraping

Data is the lifeblood of modern research and business analytics. Whether it’s tracking competitor prices, aggregating news, or building datasets for machine learning, the ability to extract structured data from the web—web scraping—is a critical skill. However, anyone who has tried to build a web scraper knows the pain. Websites change structure, HTML tags are messy, and maintaining a scraper for hundreds of different sites is a logistical nightmare. Traditionally, we have had two choices: spend hours manually coding rules for every single website, or pay a fortune to have Large Language Models (LLMs) parse every single page individually. Neither is scalable. ...

[AutoPersuade: A Framework for Evaluating and Explaining Persuasive Arguments 🔗](https://arxiv.org/abs/2410.08917)

Decoding Persuasion: How the AutoPersuade Framework Uses Causal Inference to Build Better Arguments

How do you change someone’s mind? For centuries, this question was the domain of rhetoricians, politicians, and philosophers. In the internet age, it became the domain of A/B testing. Companies and political campaigns generate hundreds of message variations, show them to thousands of people, and keep the ones that get the most clicks or donations. But there is a flaw in the A/B testing approach: it tells you which message won, but it rarely tells you why. Was it the tone? The specific vocabulary? The appeal to emotion versus logic? Without understanding the “why,” generating the next successful message is just a guessing game. ...

[Attribute or Abstain: Large Language Models as Long Document Assistants 🔗](https://arxiv.org/abs/2407.07799)

Trust Issues: Teaching LLMs to Cite Sources in Long Documents

We live in an era where Large Language Models (LLMs) can summarize a book or analyze a legal contract in seconds. However, for anyone using these tools for serious research or work, a nagging question remains: Can I trust this? LLMs are notorious for hallucinations—generating plausible-sounding but completely incorrect information. When you are using an LLM as a “long document assistant”—for example, asking it to extract specific clauses from a 50-page PDF—accuracy is non-negotiable. To build trust, we need the model to do two things better: Attribute (provide evidence for its claims) or Abstain (admit when the answer isn’t there). ...

[Attribute Diversity Determines the Systematicity Gap in VQA 🔗](https://arxiv.org/abs/2311.08695)

More Data Isn't the Answer: Unlocking Systematic Generalization in AI with Attribute Diversity

Introduction Imagine you have taught a child what a “red ball” looks like and what a “blue cube” looks like. If you then show them a “red cube” or a “blue ball,” they will likely identify it immediately. This ability to understand new combinations of familiar concepts is called systematicity, or compositional generalization. It is a fundamental cornerstone of human intelligence. We don’t need to see every possible combination of color and shape in the universe to understand how they fit together. ...