Papers

[BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment 🔗](https://arxiv.org/abs/2406.12168)

Why Your AI Needs to Stay Close to Its Behavior: A Deep Dive into BPO

Aligning Large Language Models (LLMs) with human values is one of the most critical challenges in modern AI. We want models that are helpful, harmless, and concise. For a long time, the gold standard for this was Reinforcement Learning from Human Feedback (RLHF). However, if you have ever tried to train an RLHF pipeline, you know the pain: it involves training a separate reward model, dealing with complex reinforcement learning instability, and managing significant computational costs. ...

[BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training 🔗](https://arxiv.org/abs/2409.04599)

Cleaning Up the Vocabulary: How Picky BPE Removes Junk Tokens to Improve Language Models

Large Language Models (LLMs) are often treated as black boxes, but their foundation lies in a process that is surprisingly simple: tokenization. Before a model can understand “artificial intelligence,” it must break that text down into smaller chunks, or tokens. For years, the industry standard has been Byte-Pair Encoding (BPE), a reliable algorithm that merges frequent characters into subwords. However, reliable doesn’t always mean efficient. Standard BPE has a hoarding problem. It creates and keeps “intermediate” tokens—fragments of words that are necessary to build larger words but are useless on their own. These “junk tokens” clutter the vocabulary, wasting valuable parameters and potentially degrading model performance. ...

[BMRETRIEVER: Tuning Large Language Models as Better Biomedical Text Retrievers 🔗](https://arxiv.org/abs/2404.18443)

BMRetriever: How to Teach LLMs to Master Biomedical Search

Introduction In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and Llama have become household names. We generally think of them as generative engines—tools that write poetry, code, or emails. However, in specialized fields like medicine, the ability to generate text is only half the battle. The other half—and perhaps the more critical half for accuracy—is retrieval. Imagine a doctor needing to find a specific case study regarding a rare side effect of a new drug, or a researcher sifting through millions of papers to find a specific protein interaction. They don’t just need an LLM to hallucinate an answer; they need a system to dig through a massive haystack of medical literature and retrieve the exact needle of truth. This is the foundation of Retrieval-Augmented Generation (RAG). ...

[BLSP-Emo: Towards Empathetic Large Speech-Language Models 🔗](https://arxiv.org/abs/2406.03872)

Beyond Words: Teaching AI to Hear Emotions with BLSP-Emo

Have you ever told a friend, “I’m fine,” but your tone clearly screamed that you were anything but? A good friend picks up on that tone immediately. They don’t just process the word “fine”; they process the pitch, the hesitation, and the heaviness in your voice—the paralinguistic cues—and respond with empathy. Now, imagine saying that same phrase to a standard AI assistant. It processes the text “I am fine,” interprets the semantics literally, and likely responds with, “That is good to hear.” The interaction feels cold and robotic because the emotional context is lost in the conversion from speech to text. ...

[BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models 🔗](https://arxiv.org/abs/2406.17092)

Unmasking the Trojan Horse: How BEEAR Secures LLMs Against Hidden Safety Backdoors

Introduction In the rapidly evolving world of Large Language Models (LLMs), safety is paramount. We spend immense resources on Reinforcement Learning from Human Feedback (RLHF) and safety alignment to ensure models refuse to build bombs or generate hate speech. However, a sinister vulnerability lurks beneath this veneer of safety: the safety backdoor. Imagine an LLM that behaves perfectly during testing. It refuses harmful queries politely and follows instructions helpfully. But, if a user includes a specific, hidden string of text—a “trigger”—the model suddenly sheds its safety guardrails and complies with malicious requests. This is the problem of the “Deceptively Safety-aligned Backdoored LLM.” ...

[BC-Prover: Backward Chaining Prover for Formal Theorem Proving 🔗](https://aclanthology.org/2024.emnlp-main.180.pdf)

Thinking Backwards: How BC-Prover Teaches LLMs to Solve Formal Math

Large Language Models (LLMs) have demonstrated an uncanny ability to generate code, summarize history, and even write poetry. Yet, when it comes to rigorous mathematical reasoning—specifically Formal Theorem Proving—they often hit a wall. In the world of formal math, “almost correct” is the same as “wrong.” A proof must be logically sound, step-by-step, and verifiable by a computer program. While standard LLMs try to solve these problems by guessing the next step based on intuition (a process called forward chaining), human mathematicians often work differently. They look at the goal and ask, “What do I need to prove first to make this goal true?” This is called backward chaining. ...

[Autoregressive Pre-Training on Pixels and Texts 🔗](https://arxiv.org/abs/2404.10710)

Reading Without Words — How PixelGPT Teaches AI to "See" Language

Introduction In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and Llama 2 are the undisputed kings. They write code, compose poetry, and answer complex queries. But underneath the hood, these models share a common constraint: Tokenization. Before an LLM sees your text, a “tokenizer” chops sentences into discrete numbers (tokens). While efficient, this process strips away the visual richness of language. It struggles with complex PDFs, non-standard layouts, and “visually rich” text like emojis or mixed scripts. Furthermore, tokenization creates a “vocabulary bottleneck”—if a word or character isn’t in the model’s pre-defined dictionary, the model struggles to process it. ...

[Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards 🔗](https://arxiv.org/abs/2409.17472)

Grading the Grader: How Reinforcement Learning and QWK Are Revolutionizing Automated Essay Scoring

Grading an essay is subjective, nuanced, and exhausting. A teacher doesn’t just look at a paper and say “8 out of 10.” They evaluate structure, vocabulary, grammar, and content simultaneously. Automated Essay Scoring (AES) systems attempt to replicate this, but they have historically faced a significant technical hurdle: a mismatch between how they are trained and how they are evaluated. Most AES systems are trained to minimize simple error margins (like Mean Squared Error), but they are evaluated in the real world using a metric called Quadratic Weighted Kappa (QWK). QWK measures how well the AI agrees with a human grader, heavily penalizing large discrepancies. The problem? QWK is mathematically “non-differentiable,” meaning you can’t easily use it to train a neural network using standard backpropagation. ...

[Automatically Generated Definitions and their utility for Modeling Word Meaning 🔗](https://aclanthology.org/2024.emnlp-main.776.pdf)

LlamaDictionary: When LLMs Become Dynamic Dictionaries

LlamaDictionary: When LLMs Become Dynamic Dictionaries In the world of Natural Language Processing (NLP), we have a bit of an interpretability problem. When a state-of-the-art model processes a word, it converts it into a vector—a long string of numbers representing that word in a high-dimensional geometric space. While these vectors (embeddings) are mathematically powerful, they are opaque to humans. If you look at a vector, you can’t “read” what the model thinks the word means. ...

[Automatic sentence segmentation of clinical record narratives in real-world data 🔗](https://aclanthology.org/2024.emnlp-main.1156.pdf)

Breaking Boundaries - How to Segment Sentences in Messy Clinical Data

Breaking Boundaries: How to Segment Sentences in Messy Clinical Data If you have ever tried to process text data for a Natural Language Processing (NLP) project, you know that the very first step is often the most deceptive. Before you can perform sentiment analysis, name entity recognition (NER), or machine translation, you have to answer a simple question: Where does one sentence end and the next one begin? In formal writing—like a novel or a news article—this is trivial. You look for a period, a question mark, or an exclamation point, and you split the text. But what happens when the text is written by a doctor rushing between patients? What if the text is a stream of consciousness, a list of vital signs, or a fragmented note like “pt sedated no response to verbal stimuli”? ...

[Automatic Instruction Evolving for Large Language Models 🔗](https://arxiv.org/abs/2406.00770)

Beyond Human Heuristics - How Auto Evol-Instruct Automates Dataset Creation

If you have been following the explosion of Large Language Models (LLMs), you know that the secret sauce isn’t just the sheer number of parameters—it’s the data. specifically, instruction tuning. This is the process that turns a raw text predictor into a helpful assistant capable of following complex commands. To get good performance, you need high-quality, complex instruction datasets. But here lies the bottleneck: creating these datasets by hand is unscalable and expensive. Recently, the “Evol-Instruct” method (popularized by models like WizardLM) proposed a solution: use an LLM to rewrite simple instructions into complex ones. ...

[Automated Essay Scoring: A Reflection on the State of the Art 🔗](https://aclanthology.org/2024.emnlp-main.991.pdf)

Beyond the Scoreboard: Why Automated Essay Scoring Needs a New Direction

Imagine a high school English teacher sitting at their desk on a Sunday evening. In front of them is a stack of 150 essays on “The Effects of Computers on Society.” Grading each one takes at least 10 minutes. That is 25 hours of work—just for one assignment. This scenario is the driving force behind Automated Essay Scoring (AES). For over 50 years, researchers have been chasing the holy grail of Natural Language Processing (NLP): a system that can read a student’s essay and instantly assign a score that matches what a human expert would give. ...

[AUTOSCRAPER: A Progressive Understanding Web Agent for Web Scraper Generation 🔗](https://arxiv.org/abs/2404.12753)

Building Better Bots: How AUTOSCRAPER Uses LLMs to Automate Web Scraping

Data is the lifeblood of modern research and business analytics. Whether it’s tracking competitor prices, aggregating news, or building datasets for machine learning, the ability to extract structured data from the web—web scraping—is a critical skill. However, anyone who has tried to build a web scraper knows the pain. Websites change structure, HTML tags are messy, and maintaining a scraper for hundreds of different sites is a logistical nightmare. Traditionally, we have had two choices: spend hours manually coding rules for every single website, or pay a fortune to have Large Language Models (LLMs) parse every single page individually. Neither is scalable. ...

[AutoPersuade: A Framework for Evaluating and Explaining Persuasive Arguments 🔗](https://arxiv.org/abs/2410.08917)

Decoding Persuasion: How the AutoPersuade Framework Uses Causal Inference to Build Better Arguments

How do you change someone’s mind? For centuries, this question was the domain of rhetoricians, politicians, and philosophers. In the internet age, it became the domain of A/B testing. Companies and political campaigns generate hundreds of message variations, show them to thousands of people, and keep the ones that get the most clicks or donations. But there is a flaw in the A/B testing approach: it tells you which message won, but it rarely tells you why. Was it the tone? The specific vocabulary? The appeal to emotion versus logic? Without understanding the “why,” generating the next successful message is just a guessing game. ...

[Attribute or Abstain: Large Language Models as Long Document Assistants 🔗](https://arxiv.org/abs/2407.07799)

Trust Issues: Teaching LLMs to Cite Sources in Long Documents

We live in an era where Large Language Models (LLMs) can summarize a book or analyze a legal contract in seconds. However, for anyone using these tools for serious research or work, a nagging question remains: Can I trust this? LLMs are notorious for hallucinations—generating plausible-sounding but completely incorrect information. When you are using an LLM as a “long document assistant”—for example, asking it to extract specific clauses from a 50-page PDF—accuracy is non-negotiable. To build trust, we need the model to do two things better: Attribute (provide evidence for its claims) or Abstain (admit when the answer isn’t there). ...

[Attribute Diversity Determines the Systematicity Gap in VQA 🔗](https://arxiv.org/abs/2311.08695)

More Data Isn't the Answer: Unlocking Systematic Generalization in AI with Attribute Diversity

Introduction Imagine you have taught a child what a “red ball” looks like and what a “blue cube” looks like. If you then show them a “red cube” or a “blue ball,” they will likely identify it immediately. This ability to understand new combinations of familiar concepts is called systematicity, or compositional generalization. It is a fundamental cornerstone of human intelligence. We don’t need to see every possible combination of color and shape in the universe to understand how they fit together. ...

[Atomic Self-Consistency for Better Long Form Generations 🔗](https://arxiv.org/abs/2405.13131)

Beyond the 'Best' Answer: How Atomic Self-Consistency Merges the Truth to Fix LLM Hallucinations

Large Language Models (LLMs) have revolutionized how we interact with information. We ask them to write code, solve math problems, and explain complex historical events. However, anyone who has used these models extensively knows they have a significant weakness: hallucination. They can sound incredibly confident while stating completely incorrect facts. In recent years, researchers have developed clever ways to mitigate this. One popular method is “consistency checking”—asking the model the same question multiple times and picking the answer that appears most often. This works wonders for math problems where the answer is a single number. But what happens when you ask a long-form question like, “What are the main causes of climate change?” ...

[Atomic Inference for NLI with Generated Facts as Atoms 🔗](https://arxiv.org/abs/2305.13214)

Unlocking the Black Box with FGLR: How Generated Facts Make AI Reasoning Transparent

In the rapidly evolving world of Natural Language Processing (NLP), we face a recurring “black box” dilemma. We have models that can read a complex paragraph and accurately answer questions about it, but we rarely know why they chose a specific answer. Imagine a model denying a loan application or flagging a news article as fake. If the model cannot explain its reasoning faithfully, how can we trust it? Today, we are diving into a research paper that tackles this problem head-on. The paper, “Atomic Inference for NLI with Generated Facts as Atoms,” introduces a novel framework called FGLR (Fact-Generated Logical Reasoning). This approach doesn’t just ask an AI to guess an answer; it forces the AI to break the problem down into atomic facts, evaluate each one individually, and build a logical conclusion. ...

[ASSISTANTBENCH: Can Web Agents Solve Realistic and Time-Consuming Tasks? 🔗](https://arxiv.org/abs/2407.15711)

Can AI Agents Actually Surf the Web? Introducing AssistantBench and SPA

Introduction Imagine you are planning a move to a new city. You need to find a high-rise apartment in a specific neighborhood that sold for a certain price range in 2021. Or perhaps you are a fitness enthusiast visiting New York, and you need to find a gym near Tompkins Square Park that offers classes specifically before 7:00 AM. For a human, these tasks are tedious but straightforward. They require opening a browser, searching for locations, opening multiple tabs (maps, gym websites, schedules), comparing information, and synthesizing an answer. It takes time—minutes, not seconds—and requires “navigation logic.” ...

[Assessing "Implicit" Retrieval Robustness of Large Language Models 🔗](https://arxiv.org/abs/2406.18134)

Can LLMs Learn to Ignore Bad Advice? The Case for Implicit Retrieval Robustness

Large Language Models (LLMs) have transformed how we interact with information, but they have a well-known flaw: their knowledge is static. They only know what they were trained on, which means they can’t answer questions about current events or private enterprise data. The standard solution to this problem is Retrieval-Augmented Generation (RAG). In a RAG system, when you ask a question, the system first searches a database for relevant documents (context) and feeds them to the LLM alongside your query. Ideally, the LLM uses this context to generate a precise, up-to-date answer. ...