Papers

[Atomic Self-Consistency for Better Long Form Generations 🔗](https://arxiv.org/abs/2405.13131)

Beyond the 'Best' Answer: How Atomic Self-Consistency Merges the Truth to Fix LLM Hallucinations

Large Language Models (LLMs) have revolutionized how we interact with information. We ask them to write code, solve math problems, and explain complex historical events. However, anyone who has used these models extensively knows they have a significant weakness: hallucination. They can sound incredibly confident while stating completely incorrect facts. In recent years, researchers have developed clever ways to mitigate this. One popular method is “consistency checking”—asking the model the same question multiple times and picking the answer that appears most often. This works wonders for math problems where the answer is a single number. But what happens when you ask a long-form question like, “What are the main causes of climate change?” ...

[Atomic Inference for NLI with Generated Facts as Atoms 🔗](https://arxiv.org/abs/2305.13214)

Unlocking the Black Box with FGLR: How Generated Facts Make AI Reasoning Transparent

In the rapidly evolving world of Natural Language Processing (NLP), we face a recurring “black box” dilemma. We have models that can read a complex paragraph and accurately answer questions about it, but we rarely know why they chose a specific answer. Imagine a model denying a loan application or flagging a news article as fake. If the model cannot explain its reasoning faithfully, how can we trust it? Today, we are diving into a research paper that tackles this problem head-on. The paper, “Atomic Inference for NLI with Generated Facts as Atoms,” introduces a novel framework called FGLR (Fact-Generated Logical Reasoning). This approach doesn’t just ask an AI to guess an answer; it forces the AI to break the problem down into atomic facts, evaluate each one individually, and build a logical conclusion. ...

[ASSISTANTBENCH: Can Web Agents Solve Realistic and Time-Consuming Tasks? 🔗](https://arxiv.org/abs/2407.15711)

Can AI Agents Actually Surf the Web? Introducing AssistantBench and SPA

Introduction Imagine you are planning a move to a new city. You need to find a high-rise apartment in a specific neighborhood that sold for a certain price range in 2021. Or perhaps you are a fitness enthusiast visiting New York, and you need to find a gym near Tompkins Square Park that offers classes specifically before 7:00 AM. For a human, these tasks are tedious but straightforward. They require opening a browser, searching for locations, opening multiple tabs (maps, gym websites, schedules), comparing information, and synthesizing an answer. It takes time—minutes, not seconds—and requires “navigation logic.” ...

[Assessing "Implicit" Retrieval Robustness of Large Language Models 🔗](https://arxiv.org/abs/2406.18134)

Can LLMs Learn to Ignore Bad Advice? The Case for Implicit Retrieval Robustness

Large Language Models (LLMs) have transformed how we interact with information, but they have a well-known flaw: their knowledge is static. They only know what they were trained on, which means they can’t answer questions about current events or private enterprise data. The standard solution to this problem is Retrieval-Augmented Generation (RAG). In a RAG system, when you ask a question, the system first searches a database for relevant documents (context) and feeds them to the LLM alongside your query. Ideally, the LLM uses this context to generate a precise, up-to-date answer. ...

[Assessing and Verifying Task Utility in LLM-Powered Applications 🔗](https://arxiv.org/abs/2405.02178)

Beyond "Did It Work?": Measuring the True Utility of LLM Apps with AgentEval

The explosion of Large Language Models (LLMs) has shifted the landscape of software development. We are no longer just building chatbots; we are building agents—applications capable of planning, coding, and collaborating to solve complex problems. From solving intricate math equations to managing household logistics in simulated environments, these agents are becoming increasingly autonomous. But this rapid capability growth has created a new bottleneck: Evaluation. How do you know if an LLM application is actually “good”? In traditional software, we have unit tests. Pass or fail. In machine learning, we have accuracy metrics. But for a generative agent helping a human, “success” is nuanced. An agent might solve a math problem correctly but explain it in a confusing, roundabout way. It might complete a household chore but break three other things in the process. ...

[ARXIVDIGESTABLES: Synthesizing Scientific Literature into Tables using Language Models 🔗](https://arxiv.org/abs/2410.22360)

Can AI Write Your Literature Review? Inside the ARXIVDIGESTABLES Framework

If you are a student or a researcher, you are likely familiar with the overwhelming sensation of staring at a mountain of papers. The number of scientific publications is growing exponentially. Staying abreast of a field doesn’t just mean reading; it means synthesizing. You have to read dozens of papers, identify common themes, compare methodologies, and contrast results. The gold standard for this synthesis is the Literature Review Table. These are the structured grids found in survey papers where rows represent specific publications and columns represent “aspects” (like Model Architecture, Dataset Size, or Evaluation Metric). Creating these tables is one of the most laborious tasks in academia. It requires not just extracting data, but identifying the schema—the set of aspects that make for a meaningful comparison. ...

[Argument Relation Classification through Discourse Markers and Adversarial Training 🔗](https://aclanthology.org/2024.emnlp-main.1054.pdf)

Mastering Arguments with AI: How Discourse Markers and Adversarial Training Improve Relation Classification

Introduction Imagine you are reading a transcript of a heated political debate or analyzing a complex legal case. Your brain naturally categorizes the statements being made. When a speaker says, “The project is expensive,” and follows it with, “However, the long-term benefits are undeniable,” you instantly recognize a conflict or an “attack” on the first premise. Conversely, if they say, “Thus, we should proceed,” you recognize support. This ability to map the logical connections between sentences is known as Argument Relation Classification (ARC). It is a fundamental task in Natural Language Processing (NLP) that enables machines to understand not just what is being said, but how arguments are constructed. ...

[Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions 🔗](https://arxiv.org/abs/2410.02028)

Beyond Chatbots: Unlocking the Hidden Classification Power of Large Language Models

Beyond Chatbots: Unlocking the Hidden Classification Power of Large Language Models When we think of Large Language Models (LLMs) like GPT-4 or Llama, we usually think of generation. We use them to write emails, debug code, or compose poetry. But there is a massive subset of Natural Language Processing (NLP) where generation takes a back seat to precision: Classification. Can a model designed to chatter actually be a rigorous classifier? ...

[Are Large Language Models Capable of Generating Human-Level Narratives? 🔗](https://arxiv.org/abs/2407.13248)

Why AI Stories Feel Flat — A Deep Dive into Narrative Discourse Analysis

Introduction We are living in the golden age of automated text generation. With the rise of Large Language Models (LLMs) like GPT-4 and Claude, generating a fluent, grammatically perfect story takes seconds. Yet, if you have ever asked an AI to write a screenplay or a novel, you likely noticed something missing. The text is readable, but the soul of the story often feels hollow. The plot might wander, the emotional stakes feel low, or the ending feels rushed and unearned. ...

[Are LLMs Good Zero-Shot Fallacy Classifiers? 🔗](https://arxiv.org/abs/2410.15050)

Can AI Detect Flawed Logic? Investigating Zero-Shot Fallacy Classification with LLMs

“I am a great leader because I make great leadership decisions.” At first glance, that sentence might sound confident. But if you look closer, it’s empty. It’s a classic example of Circular Reasoning—the conclusion is just a restatement of the premise. We encounter defective arguments like this every day. Whether it’s “Appeal to Emotion” in advertisements, “Ad Hominem” attacks in political debates, or “False Dilemmas” in social media comments, logical fallacies are the building blocks of misinformation and manipulation. Detecting them automatically is a crucial task for Natural Language Processing (NLP), but it has historically been very difficult. ...

[Are Data Augmentation Methods in Named Entity Recognition Applicable for Uncertainty Estimation? 🔗](https://arxiv.org/abs/2407.02062)

Confidence Check: Can Data Augmentation Fix Overconfidence in NER Models?

Confidence Check: Can Data Augmentation Fix Overconfidence in NER Models? Imagine a doctor using an AI assistant to scan medical records for patient allergies. The AI flags “Penicillin” with 99% confidence. The doctor trusts it. But what if the AI misses a rare drug name, or worse, identifies a vitamin as a dangerous allergen with that same 99% confidence? This scenario highlights a critical flaw in modern Deep Neural Networks (DNNs): miscalibration. Modern models are often “overconfident,” assigning high probability scores to predictions even when they are wrong. In safety-critical fields like healthcare, finance, or autonomous driving, accurate predictions aren’t enough—we need to know how much to trust those predictions. ...

[ArMeme: Propagandistic Content in Arabic Memes 🔗](https://arxiv.org/abs/2406.03916)

Beyond the Laugh: Detecting Propaganda in Arabic Memes with AI

When you scroll through your social media feed, you likely pause for a meme. It’s a quick laugh—a funny caption overlaying a recognizable image, shared instantly with friends. But memes have evolved into something far more potent than simple internet humor. They have become vehicles for cultural expression, political campaigns, and, increasingly, propaganda. While the English-speaking world has seen significant research into detecting harmful content in memes, other languages have been left behind. This “resource gap” makes the digital world a dangerous place for non-English speakers, where disinformation can spread unchecked by AI filters. ...

[Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation 🔗](https://arxiv.org/abs/2406.00787)

Lost in Translation: Does Removing Bias from Word Embeddings Actually Fix Machine Translation?

Introduction Imagine typing the following sentence into a translation engine: “The doctor asked the nurse to help her in the procedure.” If you translate this into a language with grammatical gender—like Spanish, German, or Hebrew—the model has to make a choice. Is the doctor male or female? Is the nurse male or female? Historically, Natural Language Processing (NLP) models have relied heavily on stereotypes found in their training data. As a result, they frequently translate “doctor” as male and “nurse” as female, even when the sentence explicitly uses the pronoun “her” to refer to the doctor. ...

[Applying Contrastive Learning to Code Vulnerability Type Classification 🔗](https://aclanthology.org/2024.emnlp-main.666.pdf)

Beyond Binary: Categorizing Software Vulnerabilities with Hierarchical Contrastive Learning

Introduction In the modern digital landscape, software permeates nearly every aspect of our daily lives. As these systems grow in scale and complexity, so does the variety of security loopholes they contain. For security analysts, the sheer volume of code to review is overwhelming. In 2023 alone, the National Vulnerability Database (NVD) published over 28,900 new Common Vulnerabilities and Exposures (CVE) entries. Disturbingly, over 4,000 of those cases remained unclassified in terms of their specific type for extended periods. ...

[AppBench: Planning of Multiple APIs from Various Apps for Complex User Instruction 🔗](https://arxiv.org/abs/2410.19743)

Beyond Simple Tools: Can LLMs Master the Chaos of Multi-App Planning?

Introduction: The Dream of the “Meta-Planner” Imagine asking your digital assistant to plan a weekend getaway. You say: “Find me a train from Portland to Vancouver departing next Saturday, and then book a hotel in Vancouver for two people with a rating of at least 4.2.” To a human, this is a straightforward sequence of tasks. To a Large Language Model (LLM), however, this is a nightmare of dependencies, context switching, and permission management. The model cannot simply “know” the answers. It must interface with the real world using tools—specifically, Application Programming Interfaces (APIs). ...

[ApiQ: Finetuning of 2-Bit Quantized Large Language Model 🔗](https://arxiv.org/abs/2402.05147)

Can We Finetune 2-Bit LLMs? Deep Dive into ApiQ

Introduction The race to scale Large Language Models (LLMs) has hit a physical wall: GPU memory. With models now routinely exceeding 50 billion parameters, the computational resources required to fine-tune them for specific tasks are astronomical. A 65B parameter model, for instance, is not something you can easily load, let alone train, on a standard consumer GPU. To address this, the community turned to Parameter-Efficient Finetuning (PEFT) and Quantization. Methods like QLoRA (Quantized Low-Rank Adaptation) have become the industry standard, allowing us to freeze a model, compress it to 4 bits, and train a tiny set of adapter parameters. This was a massive leap forward. ...

[Annotator-Centric Active Learning for Subjective NLP Tasks 🔗](https://arxiv.org/abs/2404.15720)

Beyond the Gold Label: How to Train AI on Subjective Human Opinions

In the world of Natural Language Processing (NLP), we often cling to a comforting myth: the myth of the “Gold Label.” Imagine you are training an AI to detect hate speech. You show a sentence to three human annotators. Two say it’s offensive; one says it’s satire. In traditional machine learning, we take a majority vote, label the sentence “offensive,” and move on. The dissenting voice is treated as noise—an error to be smoothed over. ...

[Annotation alignment: Comparing LLM and human annotations of conversational safety 🔗](https://arxiv.org/abs/2406.06369)

Can AI Judge Safety? Measuring Alignment Between LLMs and Human Annotators

As Large Language Models (LLMs) become central to our digital interactions, the question of “safety” has moved from a theoretical concern to a practical necessity. We rely on these models not just to chat, but increasingly to evaluate the safety of other systems. This creates a recursive loop: AI is being used to police AI. But this raises a fundamental question: Do LLMs actually understand safety the way humans do? ...

[Analyzing Key Factors Influencing Emotion Prediction Performance of VLLMs in Conversational Contexts 🔗](https://aclanthology.org/2024.emnlp-main.331.pdf)

Can AI Understand How You Feel? Evaluating Vision-Language Models with the Cast of 'Friends'

Can AI Understand How You Feel? Evaluating Vision-Language Models with the Cast of ‘Friends’ Emotional Intelligence (EI) is often considered the final frontier for Artificial Intelligence. We have models that can write code, compose poetry, and pass bar exams, but can they understand the subtle sigh of a disappointed friend or the sarcastic eye-roll of a colleague? For a long time, researchers focused on text-based Large Language Models (LLMs) to answer this question. Studies showed that models like GPT-4 possess a surprisingly high “Emotional Quotient” (EQ) when analyzing text. But human communication is rarely just text. It is a complex symphony of words, facial expressions, body language, and environmental context. To truly possess Emotional Intelligence, an AI must see as well as read. ...

[Analysis of Plan-based Retrieval for Grounded Text Generation 🔗](https://arxiv.org/abs/2408.10490)

Stop Guessing, Start Planning: How Blueprints Solve LLM Hallucinations

Stop Guessing, Start Planning: How Blueprints Solve LLM Hallucinations We have all seen it happen. You ask a Large Language Model (LLM) to write a biography about a niche author or summarize a recent news event. The output looks perfect—the grammar is flawless, the tone is authoritative, and the structure is logical. But upon closer inspection, you realize the model has invented a university degree the author never earned or cited an award that doesn’t exist. ...