EMNLP 2024

[Evaluating Large Language Models on Time Series Feature Understanding: A Comprehensive Taxonomy and Benchmark 🔗](https://arxiv.org/abs/2404.16563)

Can LLMs Read Charts? Benchmarking Time Series Understanding in Large Language Models

The capabilities of Large Language Models (LLMs) like GPT-4 and Llama 2 have exploded in recent years. We know they can write poetry, debug code, and summarize history. But can they look at a string of numbers representing a stock price or a patient’s heart rate and “understand” what is happening? Time series analysis—the study of data points collected over time—is critical for finance, healthcare, climate science, and energy. Traditionally, this domain belongs to statistical models (like ARIMA) or specialized deep learning architectures. However, researchers from J.P. Morgan AI Research recently asked a compelling question: Can general-purpose LLMs analyze time series data without specific fine-tuning? ...

[Evaluating Large Language Models along Dimensions of Language Variation: A Systematic Investigation of Cross-lingual Generalization 🔗](https://arxiv.org/abs/2406.13718)

Breaking the Language Barrier: Simulating Dialects to Stress-Test LLMs

The current generation of Large Language Models (LLMs) often feels like magic. Ask a model like BLOOM or GPT-4 to translate French to English, and the result is usually flawless. Switch to Hindi, and it still performs admirably. But what happens when you step just slightly outside the spotlight of these “High-Resource Languages” (HRLs)? There are approximately 7,000 languages spoken today, but LLMs are typically trained on a tiny fraction of them—usually around 100. The vast majority of the world’s languages, including thousands of dialects and closely related variations, are left in the dark. ...

[Evaluating LLMs for Targeted Concept Simplification for Domain-Specific Texts 🔗](https://arxiv.org/abs/2410.20763)

Beyond "Explain Like I'm 5": How LLMs Can Help Us Read Complex Scientific Text

Have you ever tried to read a research paper from a field you aren’t familiar with? Maybe you are a computer scientist trying to parse a biology paper, or a sociologist reading about quantum mechanics. You likely encountered a sentence where you understood the grammar, but a specific term—like “arbitrary-precision arithmetic” or “hyperaemia”—stopped you in your tracks. When this happens, you might open a new tab and search for the definition. But standard definitions can be dry, disconnected from the text, or just as confusing as the original term. ...

[Evaluating Diversity in Automatic Poetry Generation 🔗](https://arxiv.org/abs/2406.15267)

Beyond the Turing Test: Is AI Poetry Actually Creative or Just Repetitive?

Artificial Intelligence has stormed the castle of creativity. From DALL-E generating surrealist art to ChatGPT penning sonnets, the line between human and machine creativity is blurring. But when you ask an LLM to write a poem, is it actually being creative? Or is it simply a “stochastic parrot,” reshuffling lines it memorized during training? For years, the gold standard for evaluating AI art has been the Turing Test: Can a human tell if this poem was written by a machine? If the answer is “no,” we assume the model is successful. ...

[Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets 🔗](https://arxiv.org/abs/2311.08662)

Breaking and Fixing Language Models: A Guide to Concurrent Robustness

Introduction Imagine you are using a large language model (LLM) to summarize a financial report. The model works perfectly. Then, you fix a small typo in the input data—changing “5000” to “5,000” or correcting a misspelled company name. Suddenly, the model’s output flips completely. It contradicts its previous summary. This scenario highlights a critical vulnerability in modern NLP: brittleness. While Language Models (LMs) display impressive capabilities, they are often “black boxes” that are highly sensitive to minor input perturbations. A model might understand a sentence perfectly, but if you add a double negative or swap a word for a synonym, the model crashes. ...

[Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works 🔗](https://arxiv.org/abs/2404.12726)

Do LLMs Truly Understand Fictional Characters? The Art of AI Character Profiling

If you have ever played around with a “Role-Playing Agent” (RPA)—an AI chatbot designed to act like Harry Potter, Sherlock Holmes, or a character from your favorite anime—you might have been impressed by its ability to mimic their speech style. But have you ever wondered: does the AI actually understand the character? Or is it merely parroting catchphrases and surface-level traits? As Large Language Models (LLMs) like GPT-4 and Claude 3 continue to evolve, the demand for sophisticated RPAs is skyrocketing. However, ensuring these agents truly grasp the depth of a character—their complex relationships, evolving personalities, and hidden motivations—remains a massive challenge. ...

[Error Analysis of Multilingual Language Models in Machine Translation: A Case Study of English-Amharic Translation 🔗](https://aclanthology.org/2024.emnlp-main.1102.pdf)

Lost in Translation: Can AI Master the Amharic Language?

Imagine you are traveling through Ethiopia. You want to read a local news article, translate a street sign, or communicate with a local vendor in Amharic. You pull out your phone and type the sentence into a translation app. The app churns for a second and spits out a translation. You assume it’s correct. But what if the app just translated the name of the Prime Minister into “Al-Qaeda”? What if it translated a request for a soft drink into a statement about drug smuggling? ...

[Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia 🔗](https://arxiv.org/abs/2410.04254)

Beyond Ctrl+F: How AI Solves the 'Entity Insertion' Problem in Wikipedia

Imagine you are editing a Wikipedia article about a 1950s actress. You want to add a link to the page for “Private School” because it is relevant to her early life. You scan the text. The words “Private School” do not appear anywhere in the article. What do you do? You don’t just give up. You write a new sentence—perhaps, “She also worked at a private school”—and insert it into the biography. ...

[Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration 🔗](https://arxiv.org/abs/2410.01285)

Unmasking the Black Box: How to Accurately Trace LLM Knowledge Back to Source Data

Large Language Models (LLMs) like LLaMA and Qwen have revolutionized how we interact with information. They draft emails, write code, and summarize complex texts with eerie proficiency. However, these models operate as massive “black boxes.” When an LLM generates a specific fact—or worse, a hallucination—it is notoriously difficult to pinpoint exactly which document in its massive training dataset taught it that specific piece of information. This problem is not just academic curiosity. It is central to issues of data copyright, fairness, and safety. If a model generates hate speech or plagiarizes a protected work, developers need to know the source. ...

[Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic 🔗](https://arxiv.org/abs/2402.14798)

Right for the Right Reasons: Teaching AI to Argue Like a Human Using Informal Logic

Right for the Right Reasons: Teaching AI to Argue Like a Human Using Informal Logic Imagine asking a student to explain why gravity keeps the Moon in orbit. If they reply, “Because the Moon is made of cheese,” and then somehow circle “Gravity” on the multiple-choice test, they got the right answer, but their reasoning was catastrophic. In the world of Artificial Intelligence, Large Language Models (LLMs) are that student. They are incredibly good at selecting the correct answer, but when asked to show their work—to generate a chain of reasoning that leads to that answer—they often hallucinate, use irrelevant facts, or descend into circular logic. ...

[Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic 🔗](https://aclanthology.org/2024.emnlp-main.515.pdf)

Breaking the Bottleneck: How LLM Critics Solve the Sparse Reward Problem in Reinforcement Learning

Introduction If you have followed the explosion of Large Language Models (LLMs) like GPT-4 or Llama 2, you are likely familiar with the concept of Reinforcement Learning from Human Feedback (RLHF). It is the secret sauce that turns a raw, unruly text predictor into a helpful assistant. By using reinforcement learning (RL), we can align models with complex human preferences that are difficult to write down as simple code. However, there is a fundamental inefficiency at the heart of this process. In a typical RLHF setup, the model generates an entire paragraph or response, and only then does it receive a reward signal (a score indicating how good the response was). ...

[Enhancing Post-Hoc Attributions in Long Document Comprehension via Coarse Grained Answer Decomposition 🔗](https://arxiv.org/abs/2409.17073)

Fixing the Trust Gap: How Coarse-Grained Decomposition Improves AI Citations

Fixing the Trust Gap: How Coarse-Grained Decomposition Improves AI Citations In the rapidly evolving world of Generative AI, trust is the new currency. We all marvel at the fluency of Large Language Models (LLMs) like GPT-4 or Claude, but a persistent shadow hangs over their output: hallucinations. When an AI answers a complex question based on a long document, how do we know it isn’t making things up? The industry standard solution is attribution—citing sources. Just like a student writing a thesis, an AI should point to the exact sentence in a source document that supports its claims. However, this is easier said than done, especially when the AI generates a long, complex answer that synthesizes multiple facts. ...

[Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs 🔗](https://arxiv.org/abs/2410.06581)

Solving the Legal Data Bottleneck: How to Train AI Judges with Synthetic Data

If you have ever tried to search for a specific legal precedent, you know it is not as simple as Googling a recipe. Legal Case Retrieval (LCR) is a high-stakes, complex task where a judge or lawyer inputs a fact description to find historically relevant cases. The goal is judicial fairness: similar cases should receive similar judgments. To achieve this, legal professionals need tools that can dig through millions of documents to find the right precedent. However, training Artificial Intelligence to do this is notoriously difficult. ...

[Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding 🔗](https://arxiv.org/abs/2406.13230)

Inside the Mind of the Model: Improving LLM Truthfulness with ACTCAB and CODEC

Large Language Models (LLMs) are often compared to confident students who, when they don’t know an answer, prefer to make up a plausible-sounding lie rather than admit ignorance. This phenomenon, known as hallucination, remains one of the significant hurdles in deploying LLMs for high-stakes applications like healthcare, law, or finance. The core of the problem isn’t just that models make mistakes; it’s that they are often miscalibrated. A perfectly calibrated model would have a confidence score that matches its accuracy—if it says “I am 80% sure,” it should be correct 80% of the time. Unfortunately, modern LLMs tend to be overconfident, assigning high probabilities even to complete fabrications. ...

[Enhancing Language Model Alignment: A Confidence-Based Approach to Label Smoothing 🔗](https://aclanthology.org/2024.emnlp-main.1189.pdf)

Smoothing the Path to Alignment: How Confidence-Aware Label Smoothing Improves DPO

The training of Large Language Models (LLMs) has evolved into a sophisticated three-stage pipeline: Pretraining (learning the language), Supervised Fine-Tuning (learning the task), and Reinforcement Learning with Human Feedback (RLHF). While the first two stages build capability, the third stage—RLHF—is arguably the most critical for safety and utility. It aligns the model with human values, ensuring the AI is helpful rather than harmful. Recently, Direct Preference Optimization (DPO) has emerged as a popular alternative to traditional RLHF methods like PPO. DPO simplifies the process by treating alignment as a classification problem. However, like many classification tasks, DPO suffers from noisy data. Humans don’t always agree on which response is better, leading to inconsistencies in the training labels. ...

[Enhancing High-order Interaction Awareness in LLM-based Recommender Model 🔗](https://arxiv.org/abs/2409.19979)

Bridging the Gap: How ELMRec Teaches LLMs to Understand User-Item Graphs

Large Language Models (LLMs) have revolutionized how we interact with information. From writing code to composing poetry, their reasoning capabilities are undeniable. Naturally, researchers have been eager to apply this power to Recommender Systems. After all, if an LLM can understand the semantics of a movie review, surely it can predict what movie you want to watch next, right? The answer is “yes, but…” While LLMs are fantastic at processing text, they often struggle with the fundamental structure of recommendation data: the Interaction Graph. A recommendation dataset isn’t just a list of sentences; it is a complex web connecting users to items, and users to other users. When we force this graph data into a linear text prompt for an LLM, we lose a massive amount of “high-order” information—the subtle ripples of influence that travel through the network. ...

[Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research 🔗](https://arxiv.org/abs/2410.03545)

The Duplicate Dilemma: Why Your Social Media Dataset Might Be Lying to You

Introduction In the world of Natural Language Processing (NLP) and Computational Social Science (CSS), we are often obsessed with the “State of the Art.” We chase higher F1 scores and accuracy percentages, celebrating every fractional increase on the leaderboard. But what if those high scores are an illusion? What if our models aren’t actually learning to understand language, but are simply memorizing repeated data points hidden within our training sets? ...

[Enhancing Advanced Visual Reasoning Ability of Large Language Models 🔗](https://arxiv.org/abs/2409.13980)

Seeing Through Text: How CVR-LLM Unlocks Complex Visual Reasoning

Artificial Intelligence has made massive strides in seeing the world. Modern models can easily identify a cat in a photo or tell you that a car is red. This is known as visual perception. However, if you show an AI a picture of a person ironing a sandwich and ask, “Why is this funny?”, traditional models often fall apart. They might see the iron and the sandwich, but they fail to grasp the absurdity of the situation. This is the challenge of complex visual reasoning. ...

[Enhancing AI Assisted Writing with One-Shot Implicit Negative Feedback 🔗](https://arxiv.org/abs/2410.11009)

NIFTY: How Rejected Smart Replies Can Supercharge AI Writing

Have you ever opened an email or a chat message and seen those little “Smart Reply” bubbles at the bottom of the screen? They offer quick, canned responses like “Sounds good!” or “I’ll take a look.” Sometimes, they are helpful. But often, they are completely off the mark. You ignore them and start typing your own response manually. In the world of AI research, that moment—where you look at the suggestions and decide not to click them—is usually treated as a dead end. It is a failed interaction. However, researchers Benjamin Towle and Ke Zhou from the University of Nottingham and Nokia Bell Labs see it differently. They view that rejection as a valuable signal. By ignoring the suggestions, you have implicitly told the system what you don’t want to say. ...

[Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation 🔗](https://arxiv.org/abs/2402.13331)

Better Together: How Aggregating Detectors Solves NMT Hallucinations

Neural Machine Translation (NMT) has revolutionized how we communicate. From Google Translate to advanced enterprise tools, these systems have become staples of modern interaction. However, despite their widespread adoption and general reliability, NMT systems suffer from a critical pathology: Hallucinations. Imagine using a translation tool to decipher instructions for a hotel stay. The original German text suggests opening the window to enjoy the view. The translation model, however, confidently outputs: “The staff were very friendly.” This isn’t just a grammatical error; it is a complete detachment from the source material. ...