Papers

[Annotation alignment: Comparing LLM and human annotations of conversational safety 🔗](https://arxiv.org/abs/2406.06369)

Can AI Judge Safety? Measuring Alignment Between LLMs and Human Annotators

As Large Language Models (LLMs) become central to our digital interactions, the question of “safety” has moved from a theoretical concern to a practical necessity. We rely on these models not just to chat, but increasingly to evaluate the safety of other systems. This creates a recursive loop: AI is being used to police AI. But this raises a fundamental question: Do LLMs actually understand safety the way humans do? ...

[Analyzing Key Factors Influencing Emotion Prediction Performance of VLLMs in Conversational Contexts 🔗](https://aclanthology.org/2024.emnlp-main.331.pdf)

Can AI Understand How You Feel? Evaluating Vision-Language Models with the Cast of 'Friends'

Can AI Understand How You Feel? Evaluating Vision-Language Models with the Cast of ‘Friends’ Emotional Intelligence (EI) is often considered the final frontier for Artificial Intelligence. We have models that can write code, compose poetry, and pass bar exams, but can they understand the subtle sigh of a disappointed friend or the sarcastic eye-roll of a colleague? For a long time, researchers focused on text-based Large Language Models (LLMs) to answer this question. Studies showed that models like GPT-4 possess a surprisingly high “Emotional Quotient” (EQ) when analyzing text. But human communication is rarely just text. It is a complex symphony of words, facial expressions, body language, and environmental context. To truly possess Emotional Intelligence, an AI must see as well as read. ...

[Analysis of Plan-based Retrieval for Grounded Text Generation 🔗](https://arxiv.org/abs/2408.10490)

Stop Guessing, Start Planning: How Blueprints Solve LLM Hallucinations

Stop Guessing, Start Planning: How Blueprints Solve LLM Hallucinations We have all seen it happen. You ask a Large Language Model (LLM) to write a biography about a niche author or summarize a recent news event. The output looks perfect—the grammar is flawless, the tone is authoritative, and the structure is logical. But upon closer inspection, you realize the model has invented a university degree the author never earned or cited an award that doesn’t exist. ...

[ANALOBENCH: Benchmarking the Identification of Abstract and Long-context Analogies 🔗](https://arxiv.org/abs/2402.12370)

Can AI Read Between the Lines? Benchmarking Abstract and Long-Context Analogies in LLMs

Introduction: The “Empty Cup” Problem Isaac Newton once famously remarked, “If I have seen further, it is by standing on the shoulders of giants.” He wasn’t literally standing on people; he was using an analogy to describe how scientific progress is built upon previous discoveries. Analogical reasoning—the ability to recognize that Situation A is like Situation B because of a shared underlying structure—is a cornerstone of human cognition. It allows us to learn new concepts, solve problems creatively, and communicate complex ideas. We do this effortlessly. If I tell you a story about a tree that collapsed because its trunk was rotten, and then a story about a person who collapsed from burnout because they didn’t care for themselves, you instantly recognize the connection: internal neglect leading to external failure. ...

[An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance 🔗](https://arxiv.org/abs/2404.01247)

Beyond Translation: Can AI Adapt Images for Different Cultures?

We have all heard the idiom: “An image speaks a thousand words.” It is a universal truth about the power of visual communication. But there is a caveat we rarely discuss: does everyone listen to that image in the same way? In our increasingly globalized world, we consume content from everywhere. A movie made in the US is watched in Japan; an educational worksheet created in India might be used in Nigeria. While we have become quite good at translating the text (the words) using Machine Translation, we often neglect the visuals. ...

[An Unsupervised Approach to Achieve Supervised-Level Explainability in Healthcare Records 🔗](https://arxiv.org/abs/2406.08958)

Unlocking the Black Box: How to Explain Medical AI Without Expensive Human Labels

In the high-stakes world of healthcare, accuracy is paramount. But when Artificial Intelligence (AI) enters the room, accuracy alone isn’t enough—trust is the currency that matters. Imagine a scenario: A machine learning model analyzes a patient’s discharge summary and predicts a specific medical code for billing and statistical tracking. The prediction is correct, but the doctor asks, “Why?” If the AI cannot point to the specific symptoms or procedures in the text that led to that decision, the doctor is unlikely to trust it. ...

[An LLM Feature-based Framework for Dialogue Constructiveness Assessment 🔗](https://arxiv.org/abs/2406.14760)

Breaking the Black Box: A Hybrid Approach to Analyzing Dialogue Constructiveness

Have you ever read a comment thread on the internet and thought, “Wow, this is actually a productive conversation”? It’s rare. Most online disagreements devolve into shouting matches. But for researchers in Natural Language Processing (NLP) and social science, understanding what makes a conversation “constructive”—where participants open their minds, reach consensus, or simply disagree politely—is a massive, complex puzzle. To solve this, we generally have two toolkits. On one side, we have feature-based models. These are the “old reliable” tools (like logistic regression) that use specific, hand-crafted rules. They are easy to interpret—you know exactly why the model made a decision—but they often require expensive human annotation to label data. ...

[An L* Algorithm for Deterministic Weighted Regular Languages 🔗](https://arxiv.org/abs/2411.06228)

Unlocking the Black Box: A New Algorithm for Learning Deterministic Weighted Automata

Unlocking the Black Box: A New Algorithm for Learning Deterministic Weighted Automata In the world of computer science and Natural Language Processing (NLP), we are often faced with powerful “black box” models. We feed them an input, and they give us an output—often a probability score or a classification. But understanding how they arrived at that conclusion is notoriously difficult. This is the realm of automata extraction: the process of reverse-engineering a complex model into a simpler, interpretable Finite State Automaton (FSA). ...

[An Experimental Analysis on Evaluating Patent Citations 🔗](https://aclanthology.org/2024.emnlp-main.23.pdf)

Predicting the Next Big Invention - How Graph Neural Networks Analyze Patent Citations

Predicting the Next Big Invention: How Graph Neural Networks Analyze Patent Citations Innovation is the engine of the modern economy, and the patent system is its fuel. Every year, hundreds of thousands of patents are granted, representing billions of dollars in research and development. But here is the trillion-dollar question: Which of these patents actually matter? Most patents fade into obscurity, never generating significant value. A select few, however, become foundational technologies that define industries. Historically, identifying these “high-quality” patents has been a retrospective game. We know a patent is valuable after it has been cited hundreds of times by other inventors. But what if we could predict that impact before it happens, using nothing but the text of the patent itself? ...

[An Empirical Study of Multilingual Reasoning Distillation for Question Answering 🔗](https://aclanthology.org/2024.emnlp-main.442.pdf)

Can Wrong Answers Help Models Learn? A Deep Dive into Multilingual Reasoning Distillation

Introduction In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) like GPT-4 have set a high bar for performance. One of their most impressive features is the ability to perform Chain-of-Thought (CoT) reasoning—breaking down complex problems into step-by-step logical explanations before arriving at an answer. This capability has revolutionized how models handle math word problems, symbolic logic, and multi-step planning. However, there is a catch: this reasoning capability usually “emerges” only in massive models with billions of parameters. Smaller models, which are more cost-effective and faster to deploy, often struggle to reason through complex tasks independently. To solve this, researchers use a technique called distillation, where a large “teacher” model generates reasoning steps that a smaller “student” model learns to mimic. ...

[An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models 🔗](https://arxiv.org/abs/2411.06048)

Why Can't GPT-4o Tell Left from Right? A Deep Dive into Spatial Reasoning in LMMs

Introduction Imagine you are sitting at a dinner table. A friend asks, “Where is the salt?” You glance at the table and reply, “It’s just to the right of your glass.” This interaction seems effortless. It requires you to identify objects, understand the scene from your friend’s perspective, and articulate a spatial relationship. Now, imagine asking a state-of-the-art Artificial Intelligence the same question. You might expect a model that can pass the Bar Exam or write complex code to easily handle basic directions. However, recent research suggests otherwise. ...

[An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making 🔗](https://arxiv.org/abs/2410.15168)

Democracy for AI: Why LLM Agents Need Better Voting Systems

Democracy for AI: Why LLM Agents Need Better Voting Systems Imagine a boardroom meeting. The attendees are not humans, but advanced Large Language Models (LLMs), each acting as an autonomous agent. They have been tasked with solving a complex medical diagnosis or debugging a massive software codebase. Each agent has reasoned through the problem and come up with a solution. But here lies the problem: they disagree. How does the group decide on the final answer? ...

[An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification 🔗](https://arxiv.org/abs/2409.03203)

DiffusionCLS: Mastering Low-Resource Sentiment Analysis with Diffusion Models

DiffusionCLS: Mastering Low-Resource Sentiment Analysis with Diffusion Models In the era of Large Language Models (LLMs), it is easy to assume that Natural Language Processing (NLP) is a solved problem. We have models that can write poetry, code in Python, and summarize history books. However, a significant gap remains in the practical application of these models: Data Scarcity. When researchers or engineers move into specific domains—such as analyzing disaster tweets for emergency response, monitoring localized disease outbreaks, or handling low-resource languages—they often hit a wall. The massive, pre-trained models are “data-hungry.” They perform poorly when fine-tuned on tiny, imbalanced datasets. ...

[An Audit on the Perspectives and Challenges of Hallucinations in NLP 🔗](https://arxiv.org/abs/2404.07461)

The Tower of Babel in AI: Why We Can’t Agree on What 'Hallucination' Means

If you have used ChatGPT, Gemini, or any modern Large Language Model (LLM) for any significant amount of time, you have likely encountered it: the moment the machine confidently asserts something that is factually untrue. It might invent a court case that never happened, attribute a quote to the wrong historical figure, or generate a URL that leads nowhere. We call this “hallucination.” It is widely considered the Achilles’ heel of modern Artificial Intelligence. ...

[An Analysis of Multilingual FActScore 🔗](https://arxiv.org/abs/2406.19415)

Lost in Translation? Why Evaluating AI Factuality Across Languages is Harder Than You Think

The rise of Large Language Models (LLMs) like GPT-4 and Gemini has revolutionized how we interact with information. We ask complex questions, and these models generate fluent, human-like responses. However, there is a ghost in the machine: Hallucination. LLMs are notorious for confidently stating falsehoods as facts. For English speakers, we have developed sophisticated tools to catch these lies. One of the gold standards is FActScore (Fine-grained Atomic Evaluation of Factual Precision), a metric designed to break down long generated texts into individual facts and verify them. ...

[Altogether: Image Captioning via Re-aligning Alt-text 🔗](https://arxiv.org/abs/2410.17251)

Stop Generating, Start Re-aligning: A Better Approach to Synthetic Image Captions

The quest for Artificial General Intelligence often feels like a hardware race—bigger clusters, more GPUs. But seasoned researchers know that the bottleneck is increasingly becoming data quality. To build AI agents that surpass average human intelligence, we need training data that encapsulates superhuman knowledge. In the realm of computer vision, specifically image captioning, we have a significant problem. Most existing training datasets consist of naive, generic descriptions. If you show a model a picture of a rare “common iguana,” a standard dataset might just label it “a lizard on a branch.” This offers minimal utility. ...

[AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality 🔗](https://arxiv.org/abs/2410.10054)

Stop Guessing: Optimizing LoRA-MoE with AlphaLoRA

The landscape of Large Language Models (LLMs) is dominated by a single, crushing constraint: size. As models grow larger to become smarter, fine-tuning them for specific tasks becomes computationally prohibitive. To solve this, the community adopted Parameter-Efficient Fine-Tuning (PEFT) methods, with LoRA (Low-Rank Adaptation) being the standout star. LoRA freezes the massive pre-trained model and injects tiny, trainable adapters. It works wonders, but it has a ceiling. Because LoRA parameters are so few, they sometimes struggle to capture complex new behaviors. ...

[Alignment-Enhanced Decoding: Defending Jailbreaks via Token-Level Adaptive Refining of Probability Distributions 🔗](https://aclanthology.org/2024.emnlp-main.164.pdf)

Solving the AI Tug-of-War: How Alignment-Enhanced Decoding Stops Jailbreaks

Large Language Models (LLMs) have become ubiquitous, acting as coding assistants, creative writers, and general-purpose chatbots. To make these models safe for public deployment, developers invest heavily in “alignment”—training the model to be helpful while strictly refusing to generate harmful content, such as instructions for illegal acts or hate speech. However, this safety layer is often thinner than we’d like to admit. Adversarial actors have developed “jailbreaks”—sophisticated prompting strategies that trick the model into bypassing its safety filters. If you’ve ever seen a prompt that asks a model to “roleplay as a villain who doesn’t care about rules,” you’ve seen a jailbreak attempt. ...

[Aligning Translation-Specific Understanding to General Understanding in Large Language Models 🔗](https://arxiv.org/abs/2401.05072)

Lost in Translation? How DUAT Aligns LLM Understanding for Better Machine Translation

Introduction Large Language Models (LLMs) like GPT-4 have revolutionized how we interact with text. We treat them as omniscient oracles—capable of answering complex questions, writing code, and summarizing novels. Naturally, we expect them to be exceptional translators. If an LLM knows who a specific celebrity is when asked in a Q&A format, it should surely be able to translate a sentence containing that celebrity’s name, right? ...

[Aligning Large Language Models with Diverse Political Viewpoints 🔗](https://arxiv.org/abs/2406.14155)

Beyond the Bias: How to Teach AI to Speak with Diverse Political Voices

Introduction If you have ever asked a Large Language Model (LLM) like ChatGPT about a controversial political topic, you have likely encountered a very specific type of response. It might be a bland refusal to answer, a “both-sides” hedge that says nothing of substance, or—as recent research has increasingly shown—a response that subtly (or overtly) leans toward a specific socio-political worldview. Most off-the-shelf LLMs exhibit what researchers call “normative stances.” They tend to reflect the biases present in their training data or the specific “safety” tuning applied by their creators. Often, this results in models that exhibit progressive, liberal, and pro-environmental biases. While these are not inherently negative traits, they pose a problem for the utility of AI in a democratic society. If a voter uses an AI to understand the political landscape, but the AI can only speak in the voice of a liberal progressive, that voter is getting a distorted view of reality. ...