Papers

[RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Generation 🔗](https://arxiv.org/abs/2406.05794)

RE-RAG: Giving RAG the Confidence to Say 'I Don't Know' (or 'I Know Better')

Retrieval-Augmented Generation (RAG) has become the backbone of modern AI knowledge systems. By combining a parametric memory (the weights of a Large Language Model) with non-parametric memory (an external database of documents), we can build systems that answer questions with up-to-date, specific information. But anyone who has built a RAG system knows the dirty secret: Retrievers are noisy. If a user asks a question and the retriever fetches irrelevant documents, the generator (the LLM) is placed in a difficult position. It might try to force an answer from the bad context, leading to hallucinations, or it might ignore the context entirely, leading to answers that lack citation. Standard retrievers provide a similarity score, but this score is relative—it tells you that Document A is better than Document B, but it doesn’t tell you if Document A is actually good. ...

[RAt: Injecting Implicit Bias for Text-To-Image Prompt Refinement Models 🔗](https://aclanthology.org/2024.emnlp-main.1144.pdf)

The Hidden Manipulator: How RAt Injects Implicit Bias into AI Art Generators

The rise of Text-to-Image (T2I) generation models like Stable Diffusion and Midjourney has revolutionized digital creativity. However, getting a high-quality image out of these models often requires “prompt engineering”—the art of crafting detailed, specific text descriptions. Because most users aren’t experts at writing these complex prompts, a new class of tools has emerged: Text-to-Image Prompt Refinement (T2I-Refine) services. These services take a user’s simple input (e.g., “a smart phone”) and expand it into a rich description (e.g., “a smart phone, intricate details, 8k resolution, cinematic lighting”). While helpful, this intermediate layer introduces a fascinating security and ethical question: Can a prompt refinement model be “poisoned” to secretly manipulate the output? ...

[RAR: Retrieval-augmented retrieval for code generation in low-resource languages 🔗](https://aclanthology.org/2024.emnlp-main.1199.pdf)

Cracking the Code for Low-Resource Languages: An Introduction to Retrieval-Augmented Retrieval (RAR)

Introduction If you have ever asked ChatGPT or GitHub Copilot to write a Python script or a JavaScript function, you know the results can be magically accurate. These models have been trained on billions of lines of code from popular languages, making them incredibly proficient at standard programming tasks. But what happens when you step off the beaten path? When you ask an LLM to generate code for low-resource languages—domain-specific languages (DSLs) like Microsoft Power Query M, OfficeScript, or complex Excel formulas—the performance drops significantly. These languages don’t have the massive repositories of open-source code required to train a model effectively. ...

[RAG-QA Arena: Evaluating Domain Robustness for Long-Form Retrieval-Augmented Question Answering 🔗](https://arxiv.org/abs/2407.13998)

Beyond Wikipedia: Benchmarking Long-Form RAG with LFRQA and RAG-QA Arena

Beyond Wikipedia: Benchmarking Long-Form RAG with LFRQA and RAG-QA Arena Retrieval-Augmented Generation (RAG) has become the de facto architecture for building reliable AI systems. By connecting Large Language Models (LLMs) to external data sources, we give them a “memory” that is up-to-date and verifiable. However, a significant gap exists in how we evaluate these systems. Most current benchmarks rely on Wikipedia data and expect short, punchy answers (like “Paris” or “1984”). But in the real world, we use RAG to generate comprehensive reports, summarize financial trends, or explain complex biological mechanisms. When an LLM generates a nuanced, three-paragraph explanation, comparing it to a three-word ground truth using standard “exact match” metrics is like grading a history essay with a math answer key. ...

[RAFT: Realistic Attacks to Fool Text Detectors 🔗](https://aclanthology.org/2024.emnlp-main.939.pdf)

Breaking the Watchdog: How RAFT Generates Realistic Attacks to Fool AI Detectors

The release of Large Language Models (LLMs) like ChatGPT and LLaMA has fundamentally changed how we generate text. From writing emails to coding, the utility is undeniable. However, this power comes with a shadow: academic dishonesty, disinformation campaigns, and sophisticated phishing. To counter this, a new industry of “AI Detectors” has emerged—tools designed to distinguish between human and machine-written content. But how robust are these guardians? In this post, we dive deep into a paper titled “RAFT: Realistic Attacks to Fool Text Detectors,” which proposes a novel framework for “red-teaming” (attacking) these detectors. Unlike previous methods that often produce garbled or grammatically incorrect text, RAFT generates attacks that are essentially invisible to the human eye but confusing enough to break the best detectors available. ...

[RA2FD: Distilling Faithfulness into Efficient Dialogue Systems 🔗](https://aclanthology.org/2024.emnlp-main.685.pdf)

Solved? The Trade-off Between Speed and Truth in AI Dialogue Systems

In the rapidly evolving world of Large Language Models (LLMs), engineers and researchers are constantly battling a frustrating trade-off: Speed versus Faithfulness. Imagine you are building a chatbot for a hotel. A user asks, “Do you have vegan breakfast options?” Option A: The bot answers instantly (0.5 seconds), but it hallucinates and says “Yes,” when you actually don’t. This is fast but dangerous. Option B: The bot pauses, retrieves the latest menu from a database, reads it, and answers correctly. This takes 3-5 seconds. This is accurate but feels sluggish to the user. Option A represents Retrieval Free Generation (RFG), where the model relies on its internal memory. Option B represents Retrieval Augmented Generation (RAG), the current gold standard for accuracy. ...

[Quantum Recurrent Architectures for Text Classification 🔗](https://aclanthology.org/2024.emnlp-main.1000.pdf)

Can 4 Qubits Beat a Neural Network? Inside Quantum Recurrent Architectures

The world of Natural Language Processing (NLP) has been dominated by classical giants—Recurrent Neural Networks (RNNs), LSTMs, and more recently, Transformers. These models rely on massive computational resources and millions (sometimes billions) of parameters to understand human language. But a new paradigm is emerging: Quantum Computing. For years, the intersection of Quantum Computing and NLP—often called QNLP—was largely theoretical. However, as we approach the era of “quantum advantage,” researchers are asking a pivotal question: Can quantum models perform sequential tasks like text classification, and can they do it efficiently? ...

[Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval 🔗](https://arxiv.org/abs/2410.02027)

Lost in Translation—Why AI Models Need Native Cultural Perception

When we look at a photograph, we assume we are seeing objective reality. If you show a picture of a park to a person in New York and a person in Munich, surely they are seeing the same grass, the same benches, and the same sky. But are they? Cognitive science and psychology suggest that visual perception is deeply tied to culture. A Western viewer might focus intently on the foreground objects—the specific people or the specific make of a car. An East Asian viewer might place significantly more weight on the context, the background, and the relationships between objects. ...

[Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs 🔗](https://arxiv.org/abs/2409.16341)

Why "Less is More" When Training Tool-Using LLMs: A Deep Dive into Data Quality

The capabilities of Large Language Models (LLMs) have exploded in recent years, but they still have a critical limitation: they are confined to their training data. To solve real-world problems—like checking the weather, querying a database, or performing complex math—LLMs need to use external tools. Training a model to use tools (APIs) usually requires massive amounts of data containing instructions and the corresponding API calls. Because creating this data by hand is expensive, researchers rely heavily on synthetic data generated by other LLMs (like GPT-4). ...

[QuBE: Question-based Belief Enhancement for Agentic LLM Reasoning 🔗](https://aclanthology.org/2024.emnlp-main.1193.pdf)

Stop the Spiral: How QuBE Fixes LLM Reasoning Derailment with Belief States

Imagine you are trying to find your keys in a pitch-black room. You reach out and feel a smooth, cold surface. If you assume you are touching the kitchen counter, you might start feeling around for the fruit bowl. But if you were actually touching a glass table in the living room, every subsequent move you take is based on a false premise. You aren’t just lost; you are actively moving further away from your goal. ...

[QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios 🔗](https://arxiv.org/abs/2410.10449)

Why LLMs Struggle with Uncertainty: Inside the QUITE Benchmark and the Neuro-Symbolic Solution

Introduction We often treat Large Language Models (LLMs) like omniscient reasoning engines. We feed them complex scenarios, legal documents, or medical summaries and ask them to derive conclusions. But how good are these models really when the answers aren’t black and white? In the real world, certainty is a luxury. A doctor doesn’t say “This symptom always means X.” They say, “Given this symptom, the diagnosis is highly likely to be X, unless Y is also present.” This is probabilistic reasoning—the ability to handle uncertainty, weigh evidence, and update beliefs. ...

[QUIK: TOWARDS END-TO-END 4-BIT INFERENCE ON GENERATIVE LARGE LANGUAGE MODELS 🔗](https://arxiv.org/abs/2310.09259)

Breaking the Speed Limit: How QUIK Enables Efficient 4-Bit LLM Inference

Large Language Models (LLMs) like LLaMA-2 and Falcon have revolutionized AI, but they come with a hefty price tag: computational cost. Running these massive models requires significant GPU memory and processing power. To make these models accessible on local hardware or cheaper cloud instances, researchers have been racing to compress them. The most common technique is quantization—reducing the precision of the numbers used to represent the model. Instead of using 16-bit floating-point numbers (FP16), we try to squeeze them into 8 bits or even 4 bits. ...

[QUDSELECT: Selective Decoding for Questions Under Discussion Parsing 🔗](https://arxiv.org/abs/2408.01046)

Unlocking Discourse Structure: A Deep Dive into QUDSELECT and Selective Decoding

When we read a news article or follow a complex narrative, we don’t process sentences in isolation. We instinctively look for connections. We ask ourselves: Why is the author saying this? What question does this sentence answer regarding the previous one? This cognitive process is the foundation of a linguistic framework known as Question Under Discussion (QUD). It treats discourse as a hierarchy of questions and answers. While humans do this naturally, teaching machines to parse these structures is notoriously difficult. Existing models often struggle to generate questions that make sense within the context or fail to link sentences correctly. ...

[QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation 🔗](https://arxiv.org/abs/2406.05707)

Beyond BLEU: Why We Need Multi-Dimensional Evaluation for Question Generation Models

Imagine you are building an AI tutor designed to help students study for history exams. You feed the AI a textbook chapter about the Mongol Empire and ask it to generate quiz questions. The AI outputs: “Who was the name of Ögedei’s wife?” Grammatically, it’s a bit clunky. But technically, if you look at the source text, the words match. Traditional evaluation metrics might give this a decent score. However, a human student would find it confusing. ...

[Puzzle Solving using Reasoning of Large Language Models: A Survey 🔗](https://arxiv.org/abs/2402.11291)

Can GPT-4 Solve a Rubik’s Cube? The Reality of AI Puzzle Solving

The rise of Large Language Models (LLMs) like GPT-4 and LLaMA has been nothing short of meteoric. We have seen them pass bar exams, write code, and compose poetry. However, there is a distinct difference between generating fluent text and performing rigorous logical reasoning. While an LLM can explain the history of chess, can it actually play chess at a high level? Can it solve a cryptic crossword or deduce the location of a mine in Minesweeper? ...

[PsyGUARD: An Automated System for Suicide Detection and Risk Assessment in Psychological Counseling 🔗](https://arxiv.org/abs/2409.20243)

Beyond Binary Detection: How PsyGUARD Revolutionizes Automated Suicide Risk Assessment

Introduction: The Crisis and the Gap Suicide remains one of the most critical public health challenges globally. Every loss of life is a tragedy that ripples through families and communities. As mental health awareness grows, more individuals are turning to online counseling services for help. These platforms offer immediate, confidential support, breaking down barriers of time and space. However, as the volume of users increases, human counselors can become overwhelmed. This is where Artificial Intelligence (AI) steps in. For years, researchers have been developing automated systems to detect suicidal ideation in text. Yet, there is a significant flaw in the current landscape: most existing systems treat suicide detection as a simple binary problem—Suicidal or Non-Suicidal. ...

[PsFuture: A Pseudo-Future-based Zero-Shot Adaptive Policy for Simultaneous Machine Translation 🔗](https://arxiv.org/abs/2410.04075)

Faking the Future: How PsFuture Brings Zero-Shot Adaptivity to Simultaneous Translation

Imagine the high-pressure job of a simultaneous interpreter at the United Nations. They listen to a speech in one language and must translate it into another in real-time. If they wait too long to hear the full sentence, they fall behind (high latency). If they translate too soon, they might guess wrong and make a mistake (low quality). They must constantly decide: Do I speak now, or do I listen for one more word? ...

[Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging 🔗](https://arxiv.org/abs/2406.16330)

Don't Just Prune, Merge: How Manifold Learning is Revolutionizing LLM Compression

Introduction The race for larger, more capable Large Language Models (LLMs) like Llama-3 and Mistral has led to incredible breakthroughs in artificial intelligence. However, this progress comes with a massive cost. As these models scale to billions of parameters, they become increasingly difficult to deploy in resource-limited environments. Running a 70-billion parameter model on a consumer-grade GPU—or worse, a mobile device—is often a non-starter due to memory and energy constraints. ...

[Prove Your Point!: Bringing Proof-Enhancement Principles to Argumentative Essay Generation 🔗](https://arxiv.org/abs/2410.22642)

Beyond Text Generation: Teaching AI to Argue Logically with PESA

Have you ever asked an AI to write an essay on a controversial topic? Often, the result looks impressive at first glance. The grammar is perfect, the vocabulary is sophisticated, and the structure seems sound. But if you look closer, cracks begin to appear. The AI might make a bold claim in the first sentence, only to provide evidence that contradicts it three sentences later. Or, it might list facts that are technically true but irrelevant to the argument at hand. ...

[Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing? 🔗](https://arxiv.org/abs/2407.01119)

Man vs. Machine: The First True Creative Writing Duel Between GPT-4 and a World-Class Novelist

In the history of Artificial Intelligence, we mark progress by the fallen champions of humanity. We remember when Deep Blue defeated Garry Kasparov at chess. We remember when AlphaGo stunned Lee Sedol. These were pivotal moments where machines proved they could out-calculate the best human minds in closed systems of logic and strategy. But art is not a closed system. For years, we have comforted ourselves with the idea that while machines crunch numbers, humans create meaning. However, the rise of Large Language Models (LLMs) like GPT-4 has brought an uneasy question to the surface: Are we losing the creative frontier, too? We know AI can write competent emails and passable high school essays. We know it can outperform the average human writer. ...