Papers

[Evaluating LLMs for Targeted Concept Simplification for Domain-Specific Texts 🔗](https://arxiv.org/abs/2410.20763)

Beyond "Explain Like I'm 5": How LLMs Can Help Us Read Complex Scientific Text

Have you ever tried to read a research paper from a field you aren’t familiar with? Maybe you are a computer scientist trying to parse a biology paper, or a sociologist reading about quantum mechanics. You likely encountered a sentence where you understood the grammar, but a specific term—like “arbitrary-precision arithmetic” or “hyperaemia”—stopped you in your tracks. When this happens, you might open a new tab and search for the definition. But standard definitions can be dry, disconnected from the text, or just as confusing as the original term. ...

[Evaluating Diversity in Automatic Poetry Generation 🔗](https://arxiv.org/abs/2406.15267)

Beyond the Turing Test: Is AI Poetry Actually Creative or Just Repetitive?

Artificial Intelligence has stormed the castle of creativity. From DALL-E generating surrealist art to ChatGPT penning sonnets, the line between human and machine creativity is blurring. But when you ask an LLM to write a poem, is it actually being creative? Or is it simply a “stochastic parrot,” reshuffling lines it memorized during training? For years, the gold standard for evaluating AI art has been the Turing Test: Can a human tell if this poem was written by a machine? If the answer is “no,” we assume the model is successful. ...

[Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets 🔗](https://arxiv.org/abs/2311.08662)

Breaking and Fixing Language Models: A Guide to Concurrent Robustness

Introduction Imagine you are using a large language model (LLM) to summarize a financial report. The model works perfectly. Then, you fix a small typo in the input data—changing “5000” to “5,000” or correcting a misspelled company name. Suddenly, the model’s output flips completely. It contradicts its previous summary. This scenario highlights a critical vulnerability in modern NLP: brittleness. While Language Models (LMs) display impressive capabilities, they are often “black boxes” that are highly sensitive to minor input perturbations. A model might understand a sentence perfectly, but if you add a double negative or swap a word for a synonym, the model crashes. ...

[Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works 🔗](https://arxiv.org/abs/2404.12726)

Do LLMs Truly Understand Fictional Characters? The Art of AI Character Profiling

If you have ever played around with a “Role-Playing Agent” (RPA)—an AI chatbot designed to act like Harry Potter, Sherlock Holmes, or a character from your favorite anime—you might have been impressed by its ability to mimic their speech style. But have you ever wondered: does the AI actually understand the character? Or is it merely parroting catchphrases and surface-level traits? As Large Language Models (LLMs) like GPT-4 and Claude 3 continue to evolve, the demand for sophisticated RPAs is skyrocketing. However, ensuring these agents truly grasp the depth of a character—their complex relationships, evolving personalities, and hidden motivations—remains a massive challenge. ...

[Error Analysis of Multilingual Language Models in Machine Translation: A Case Study of English-Amharic Translation 🔗](https://aclanthology.org/2024.emnlp-main.1102.pdf)

Lost in Translation: Can AI Master the Amharic Language?

Imagine you are traveling through Ethiopia. You want to read a local news article, translate a street sign, or communicate with a local vendor in Amharic. You pull out your phone and type the sentence into a translation app. The app churns for a second and spits out a translation. You assume it’s correct. But what if the app just translated the name of the Prime Minister into “Al-Qaeda”? What if it translated a request for a soft drink into a statement about drug smuggling? ...

[Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia 🔗](https://arxiv.org/abs/2410.04254)

Beyond Ctrl+F: How AI Solves the 'Entity Insertion' Problem in Wikipedia

Imagine you are editing a Wikipedia article about a 1950s actress. You want to add a link to the page for “Private School” because it is relevant to her early life. You scan the text. The words “Private School” do not appear anywhere in the article. What do you do? You don’t just give up. You write a new sentence—perhaps, “She also worked at a private school”—and insert it into the biography. ...

[Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration 🔗](https://arxiv.org/abs/2410.01285)

Unmasking the Black Box: How to Accurately Trace LLM Knowledge Back to Source Data

Large Language Models (LLMs) like LLaMA and Qwen have revolutionized how we interact with information. They draft emails, write code, and summarize complex texts with eerie proficiency. However, these models operate as massive “black boxes.” When an LLM generates a specific fact—or worse, a hallucination—it is notoriously difficult to pinpoint exactly which document in its massive training dataset taught it that specific piece of information. This problem is not just academic curiosity. It is central to issues of data copyright, fairness, and safety. If a model generates hate speech or plagiarizes a protected work, developers need to know the source. ...

[Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic 🔗](https://arxiv.org/abs/2402.14798)

Right for the Right Reasons: Teaching AI to Argue Like a Human Using Informal Logic

Right for the Right Reasons: Teaching AI to Argue Like a Human Using Informal Logic Imagine asking a student to explain why gravity keeps the Moon in orbit. If they reply, “Because the Moon is made of cheese,” and then somehow circle “Gravity” on the multiple-choice test, they got the right answer, but their reasoning was catastrophic. In the world of Artificial Intelligence, Large Language Models (LLMs) are that student. They are incredibly good at selecting the correct answer, but when asked to show their work—to generate a chain of reasoning that leads to that answer—they often hallucinate, use irrelevant facts, or descend into circular logic. ...

[Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic 🔗](https://aclanthology.org/2024.emnlp-main.515.pdf)

Breaking the Bottleneck: How LLM Critics Solve the Sparse Reward Problem in Reinforcement Learning

Introduction If you have followed the explosion of Large Language Models (LLMs) like GPT-4 or Llama 2, you are likely familiar with the concept of Reinforcement Learning from Human Feedback (RLHF). It is the secret sauce that turns a raw, unruly text predictor into a helpful assistant. By using reinforcement learning (RL), we can align models with complex human preferences that are difficult to write down as simple code. However, there is a fundamental inefficiency at the heart of this process. In a typical RLHF setup, the model generates an entire paragraph or response, and only then does it receive a reward signal (a score indicating how good the response was). ...

[Enhancing Post-Hoc Attributions in Long Document Comprehension via Coarse Grained Answer Decomposition 🔗](https://arxiv.org/abs/2409.17073)

Fixing the Trust Gap: How Coarse-Grained Decomposition Improves AI Citations

Fixing the Trust Gap: How Coarse-Grained Decomposition Improves AI Citations In the rapidly evolving world of Generative AI, trust is the new currency. We all marvel at the fluency of Large Language Models (LLMs) like GPT-4 or Claude, but a persistent shadow hangs over their output: hallucinations. When an AI answers a complex question based on a long document, how do we know it isn’t making things up? The industry standard solution is attribution—citing sources. Just like a student writing a thesis, an AI should point to the exact sentence in a source document that supports its claims. However, this is easier said than done, especially when the AI generates a long, complex answer that synthesizes multiple facts. ...

[Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs 🔗](https://arxiv.org/abs/2410.06581)

Solving the Legal Data Bottleneck: How to Train AI Judges with Synthetic Data

If you have ever tried to search for a specific legal precedent, you know it is not as simple as Googling a recipe. Legal Case Retrieval (LCR) is a high-stakes, complex task where a judge or lawyer inputs a fact description to find historically relevant cases. The goal is judicial fairness: similar cases should receive similar judgments. To achieve this, legal professionals need tools that can dig through millions of documents to find the right precedent. However, training Artificial Intelligence to do this is notoriously difficult. ...

[Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding 🔗](https://arxiv.org/abs/2406.13230)

Inside the Mind of the Model: Improving LLM Truthfulness with ACTCAB and CODEC

Large Language Models (LLMs) are often compared to confident students who, when they don’t know an answer, prefer to make up a plausible-sounding lie rather than admit ignorance. This phenomenon, known as hallucination, remains one of the significant hurdles in deploying LLMs for high-stakes applications like healthcare, law, or finance. The core of the problem isn’t just that models make mistakes; it’s that they are often miscalibrated. A perfectly calibrated model would have a confidence score that matches its accuracy—if it says “I am 80% sure,” it should be correct 80% of the time. Unfortunately, modern LLMs tend to be overconfident, assigning high probabilities even to complete fabrications. ...

[Enhancing Language Model Alignment: A Confidence-Based Approach to Label Smoothing 🔗](https://aclanthology.org/2024.emnlp-main.1189.pdf)

Smoothing the Path to Alignment: How Confidence-Aware Label Smoothing Improves DPO

The training of Large Language Models (LLMs) has evolved into a sophisticated three-stage pipeline: Pretraining (learning the language), Supervised Fine-Tuning (learning the task), and Reinforcement Learning with Human Feedback (RLHF). While the first two stages build capability, the third stage—RLHF—is arguably the most critical for safety and utility. It aligns the model with human values, ensuring the AI is helpful rather than harmful. Recently, Direct Preference Optimization (DPO) has emerged as a popular alternative to traditional RLHF methods like PPO. DPO simplifies the process by treating alignment as a classification problem. However, like many classification tasks, DPO suffers from noisy data. Humans don’t always agree on which response is better, leading to inconsistencies in the training labels. ...

[Enhancing High-order Interaction Awareness in LLM-based Recommender Model 🔗](https://arxiv.org/abs/2409.19979)

Bridging the Gap: How ELMRec Teaches LLMs to Understand User-Item Graphs

Large Language Models (LLMs) have revolutionized how we interact with information. From writing code to composing poetry, their reasoning capabilities are undeniable. Naturally, researchers have been eager to apply this power to Recommender Systems. After all, if an LLM can understand the semantics of a movie review, surely it can predict what movie you want to watch next, right? The answer is “yes, but…” While LLMs are fantastic at processing text, they often struggle with the fundamental structure of recommendation data: the Interaction Graph. A recommendation dataset isn’t just a list of sentences; it is a complex web connecting users to items, and users to other users. When we force this graph data into a linear text prompt for an LLM, we lose a massive amount of “high-order” information—the subtle ripples of influence that travel through the network. ...

[Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research 🔗](https://arxiv.org/abs/2410.03545)

The Duplicate Dilemma: Why Your Social Media Dataset Might Be Lying to You

Introduction In the world of Natural Language Processing (NLP) and Computational Social Science (CSS), we are often obsessed with the “State of the Art.” We chase higher F1 scores and accuracy percentages, celebrating every fractional increase on the leaderboard. But what if those high scores are an illusion? What if our models aren’t actually learning to understand language, but are simply memorizing repeated data points hidden within our training sets? ...

[Enhancing Advanced Visual Reasoning Ability of Large Language Models 🔗](https://arxiv.org/abs/2409.13980)

Seeing Through Text: How CVR-LLM Unlocks Complex Visual Reasoning

Artificial Intelligence has made massive strides in seeing the world. Modern models can easily identify a cat in a photo or tell you that a car is red. This is known as visual perception. However, if you show an AI a picture of a person ironing a sandwich and ask, “Why is this funny?”, traditional models often fall apart. They might see the iron and the sandwich, but they fail to grasp the absurdity of the situation. This is the challenge of complex visual reasoning. ...

[Enhancing AI Assisted Writing with One-Shot Implicit Negative Feedback 🔗](https://arxiv.org/abs/2410.11009)

NIFTY: How Rejected Smart Replies Can Supercharge AI Writing

Have you ever opened an email or a chat message and seen those little “Smart Reply” bubbles at the bottom of the screen? They offer quick, canned responses like “Sounds good!” or “I’ll take a look.” Sometimes, they are helpful. But often, they are completely off the mark. You ignore them and start typing your own response manually. In the world of AI research, that moment—where you look at the suggestions and decide not to click them—is usually treated as a dead end. It is a failed interaction. However, researchers Benjamin Towle and Ke Zhou from the University of Nottingham and Nokia Bell Labs see it differently. They view that rejection as a valuable signal. By ignoring the suggestions, you have implicitly told the system what you don’t want to say. ...

[Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation 🔗](https://arxiv.org/abs/2402.13331)

Better Together: How Aggregating Detectors Solves NMT Hallucinations

Neural Machine Translation (NMT) has revolutionized how we communicate. From Google Translate to advanced enterprise tools, these systems have become staples of modern interaction. However, despite their widespread adoption and general reliability, NMT systems suffer from a critical pathology: Hallucinations. Imagine using a translation tool to decipher instructions for a hotel stay. The original German text suggests opening the window to enjoy the view. The translation model, however, confidently outputs: “The staff were very friendly.” This isn’t just a grammatical error; it is a complete detachment from the source material. ...

[Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective 🔗](https://arxiv.org/abs/2406.17969)

Untangling the Black Box: Why Monosemanticity is Key to Better LLM Alignment

Introduction Imagine trying to understand how a complex alien brain works. You probe a single neuron, hoping it corresponds to a specific thought like “happiness” or “the color red.” Instead, that single neuron fires for a chaotic mix of concepts: a specific preposition, the mention of the French Revolution, and the closing bracket of a Python function. This is the reality of polysemanticity in Large Language Models (LLMs). For years, researchers in mechanistic interpretability have struggled with the fact that neural networks are “black boxes.” A major hurdle is that individual neurons often represent multiple, unrelated concepts simultaneously. The “holy grail” of interpretability is achieving monosemanticity—a state where one neuron (or feature) corresponds to exactly one understandable concept. ...

[Encoding and Controlling Global Semantics for Long-form Video Question Answering 🔗](https://arxiv.org/abs/2405.19723)

Beyond the Clip: Mastering Long-Form Video Understanding with Gated State Space Models

Imagine you are watching a superhero movie. In the first act, the protagonist realizes a specific component in their suit is poisoning them. An hour later, they discover a new element to replace it. In the final battle, that new element powers the suit to victory. Now, imagine I ask you: “What would have happened if the hero hadn’t replaced the component?” To answer this, you need to connect the poisoning event from hour 0 to the victory in hour 2. You need the global context—the entire narrative arc. ...