Papers

[Can Transformers Learn n-gram Language Models? 🔗](https://arxiv.org/abs/2410.03001)

Beyond the Hype—Are Transformers Actually Good at Learning Basic n-grams?

If you have been following the explosion of Natural Language Processing (NLP) in recent years, you know that the Transformer architecture is the engine behind the revolution. From GPT-4 to Claude, Transformers seem capable of mastering complex reasoning, coding, and creative writing. But in the research world, a fundamental question remains: Do we actually understand how they learn? There is a significant body of theoretical work exploring what Transformers can represent. For example, we know mathematically that a Transformer is capable of mimicking an n-gram language model (a simple model that predicts the next word based on the previous \(n-1\) words). But just because a neural network can represent a function doesn’t mean it will actually learn that function from data using gradient descent. ...

[Can Large Language Models Learn Independent Causal Mechanisms? 🔗](https://arxiv.org/abs/2402.02636)

Beyond Stochastic Parrots—Teaching LLMs to Think with Independent Causal Mechanisms

Introduction We are living in the golden age of Large Language Models (LLMs). Systems like GPT-4 and LLaMA have revolutionized how we interact with technology, demonstrating linguistic prowess that often feels like genuine intelligence. However, there is a “ghost in the machine.” Despite their fluency, these models often fail spectacularly when faced with tasks that require rigorous logical consistency or when the data distribution shifts slightly from what they saw during training. ...

[Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? 🔗](https://arxiv.org/abs/2405.16908)

Why Your LLM Sounds So Confident Even When It's Wrong: The Challenge of Faithful Uncertainty

Introduction We have all experienced it: you ask a Large Language Model (LLM) a specific factual question—perhaps about an obscure historical figure or a specific coding error—and it responds with absolute conviction. The grammar is perfect, the tone is authoritative, and the delivery is decisive. There is just one problem: the answer is completely wrong. This phenomenon highlights a critical gap in modern Artificial Intelligence. LLMs are trained to generate fluent, persuasive text, often at the expense of accuracy. While we call these “hallucinations,” the danger isn’t just that the model is wrong; it is that the model is persuasively wrong. It mimics the cadence of an expert even when it is essentially guessing. ...

[Can Large Language Models Enhance Predictions of Disease Progression? Investigating Through Disease Network Link Prediction 🔗](https://aclanthology.org/2024.emnlp-main.980.pdf)

ComLLM: How Large Language Models and Graphs Are Revolutionizing Disease Prediction

Introduction The digital transformation of healthcare has provided us with a staggering amount of data. Electronic health records (EHRs) track everything from routine checkups to critical diagnoses, creating a rich history of patient health. Yet, having data and effectively using it to predict the future are two very different things. One of the most critical challenges in modern medical AI is predicting disease progression and comorbidity—the likelihood that a patient with one condition (like diabetes) will develop another (like heart disease). ...

[Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? 🔗](https://arxiv.org/abs/2406.12809)

The Paradox of Intelligence—Why LLMs Fail at Easy Tasks While Acing Hard Ones

Introduction Imagine you are tutoring a student in calculus. They effortlessly solve a complex Gaussian integral, showing a deep understanding of advanced mathematical concepts. Impressed, you ask them a follow-up question: “What is 17 times 8?” The student stares blankly and answers, “106.” You would be baffled. In human cognition, capabilities are generally hierarchical; if you have mastered advanced calculus, it is taken for granted that you have mastered basic arithmetic. This is the essence of consistency. ...

[Can Language Models Induce Grammatical Knowledge from Indirect Evidence? 🔗](https://arxiv.org/abs/2410.06022)

The "Wug" Test for AI: Do LLMs Learn Like Humans?

The “Wug” Test for AI: Do LLMs Learn Like Humans? If you have ever taken an introductory linguistics class, you are likely familiar with the “Wug Test.” In 1958, Jean Berko Gleason showed children a picture of a bird-like creature and said, “This is a wug.” She then showed two of them and said, “Now there are another one. There are two of them. There are two…?” The children correctly answered “wugs.” ...

[Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators 🔗](https://arxiv.org/abs/2409.14037)

The Hallucinating Professor: Why LLMs Might Not Be Ready to Teach Science

Introduction Imagine a world where every student, regardless of their location or resources, has access to a personal tutor with the knowledge of Neil deGrasse Tyson, the mathematical intuition of Terence Tao, and the chemical expertise of Marie Curie. This is the promise of Large Language Models (LLMs) like GPT-4 and Llama-3. We have rapidly transitioned from using chatbots for writing emails to relying on them for summarizing complex research papers and explaining scientific concepts. ...

[Can LLMs Learn Uncertainty on Their Own? Expressing Uncertainty Effectively in a Self-Training Manner 🔗](https://aclanthology.org/2024.emnlp-main.1205.pdf)

Teaching LLMs to Doubt: How Self-Training Can Fix AI Overconfidence

Introduction: The Confidently Wrong Machine Imagine asking an AI assistant for medical advice or a legal precedent. It gives you an answer immediately, with perfect grammar and an authoritative tone. But there is a problem: the answer is completely made up. This phenomenon, often called “hallucination,” is one of the biggest hurdles in deploying Large Language Models (LLMs) like LLaMA or GPT-4 in critical real-world applications. The core issue isn’t just that models make mistakes—humans do that too. The issue is that LLMs often lack the ability to know when they are making a mistake. They are frequently “confidently wrong.” ...

[Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese 🔗](https://arxiv.org/abs/2402.17302)

Can AI Understand Culture? A Deep Dive into Synthetic Data for Low-Resource Languages

Introduction: The “Snow” Problem in AI Imagine you are training an Artificial Intelligence to understand “commonsense.” You feed it thousands of questions to test its reasoning capabilities. One question asks: “The man needed to shovel his driveway. What season is it?” The answer, obviously, is winter. Now, imagine asking that same question to a student in Jakarta, Indonesia. They might look at you with confusion. Indonesia is a tropical country; people don’t shovel driveways, and it certainly doesn’t snow. The concept isn’t “commonsense”—it’s culturally irrelevant. ...

[Can Automatic Metrics Assess High-Quality Translations? 🔗](https://arxiv.org/abs/2405.18348)

The "Good Enough" Trap—Why AI Metrics Fail at Evaluating High-Quality Translations

In the rapidly evolving world of Machine Translation (MT), we have reached a pivotal moment. A few years ago, the goal of translation systems was simply to produce understandable text. Today, systems like Google Translate, DeepL, and GPT-4 produce translations that are often indistinguishable from human output. We are no longer dealing with “word salad”; we are dealing with nuance, style, and high-fidelity accuracy. But this success brings a new, insidious problem. The tools we use to grade these systems—automatic metrics like BLEU, COMET, and BLEURT—were designed and validated in an era where the difference between a “good” and a “bad” translation was obvious. ...

[Can Active Label Correction Improve LLM-based Modular AI Systems? 🔗](https://arxiv.org/abs/2401.05467)

Taming the Noise: How to Upgrade LLM Agents into Efficient, Fine-Tuned Systems

Taming the Noise: How to Upgrade LLM Agents into Efficient, Fine-Tuned Systems The rapid rise of Large Language Models (LLMs) like GPT-4 and LLaMA has popularized “Modular AI Systems.” Think of frameworks like LangChain, AutoGPT, or HuggingGPT. These systems chain together multiple LLM calls to perform complex tasks—planning a trip, writing code, or analyzing financial documents. They are incredibly powerful because they require zero training; you just write a prompt, and the system works. ...

[Calibrating the Confidence of Large Language Models by Eliciting Fidelity 🔗](https://arxiv.org/abs/2404.02655)

Why LLMs Lie About Being Sure: Introducing UF Calibration

Large Language Models (LLMs) like GPT-4 and LLaMA-2 have revolutionized how we interact with information. They are incredibly helpful, harmless, and creative. However, they have a notorious flaw: they don’t know when to shut up. Or, more accurately, they don’t know how to admit they are unsure. You have likely experienced an LLM hallucinating a fact with the same supreme confidence it uses to state that \(2+2=4\). This is a failure of calibration. ...

[Calibrating Language Models with Adaptive Temperature Scaling 🔗](https://arxiv.org/abs/2409.19817)

Fixing Overconfidence in LLMs with Adaptive Temperature Scaling

Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable fluency and reasoning capabilities. However, a persistent issue plagues even the most advanced models: calibration. Ideally, when an LLM says it is 80% confident in an answer, it should be correct 80% of the time. Unfortunately, this is rarely the case. Modern LLMs, particularly those fine-tuned with Reinforcement Learning from Human Feedback (RLHF), tend to be notoriously overconfident. They might hallucinate a completely incorrect fact but assign it a 99% probability score. In high-stakes fields like medicine, law, or autonomous coding, this disconnect between confidence and accuracy is dangerous. ...

[CAT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans 🔗](https://arxiv.org/abs/2406.15823)

Do LLMs Actually Understand Plans? Benchmarking Causal Reasoning with CAT-BENCH

Large Language Models (LLMs) have become exceptionally good at generating procedural text. If you ask a state-of-the-art model for a recipe to bake a cake, it will likely produce a perfectly coherent list of steps: mix the dry ingredients, beat the eggs, combine them, and bake at a specific temperature. On the surface, it looks like the model understands the process. But there is a significant difference between memorizing a sequence of words and understanding the causal logic that binds those steps together. Does the model know why you must mix the flour before baking? Does it understand that you can chop the nuts while the oven preheats, but you can’t frost the cake before it cools? ...

[CUTE: Measuring LLMs' Understanding of Their Tokens 🔗](https://arxiv.org/abs/2409.15452)

Do LLMs Actually Know How to Spell? Inside the CUTE Benchmark

Do LLMs Actually Know How to Spell? Inside the CUTE Benchmark When we interact with Large Language Models (LLMs) like GPT-4 or Llama 3, we often attribute human-like literacy to them. We assume that because a model can write a sonnet or debug Python code, it understands text the same way we do: letter by letter, word by word. However, this is a fundamental misconception. LLMs do not “read” text character by character. Instead, they process text through a tokenizer, which chunks characters into tokens. A common word like “the” is processed as a single atomic unit, not as the sequence t-h-e. To the model, the token the is just an integer ID in a massive list, distinct from The, theoretical, or lathe. ...

[CURE: Context- and Uncertainty-Aware Mental Disorder Detection 🔗](https://aclanthology.org/2024.emnlp-main.994.pdf)

Beyond Symptoms: How Context and Uncertainty Improve Mental Health AI

Beyond Symptoms: How Context and Uncertainty Improve Mental Health AI Mental health disorders affect over one billion people worldwide. With the rise of social media, platforms have become a space for self-disclosure, offering researchers a massive dataset to help detect conditions like depression or anxiety early. However, detecting mental disorders from text is notoriously difficult. Early deep learning models were “black boxes”—they could predict a disorder but couldn’t explain why. Recent “symptom-based” approaches improved this by first identifying specific symptoms (like “insomnia” or “fatigue”) and then predicting a disorder. But even these models have a critical flaw: they often ignore context. ...

[CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free-Word-Ordered and Morphologically-Rich Low-Resource Languages 🔗](https://arxiv.org/abs/2410.06944)

Taming Free Word Order: How Contrastive Learning Improves Parsing for Morphologically Rich Languages

In the world of Natural Language Processing (NLP), we often take word order for granted. If you speak English, “The dog chased the cat” and “The cat chased the dog” mean two very different things. The syntax—the structure of the sentence—is rigidly defined by the sequence of the words. But what if the order didn’t matter? What if you could say “Chased cat dog the” and, due to the way the words are modified or “tagged,” the meaning remained exactly the same? ...

[CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models 🔗](https://arxiv.org/abs/2409.19984)

Probability or Guesswork? Investigating Consistency in Large Language Models

Large Language Models (LLMs) have become the engines driving modern AI, from chatbots to code generators. In many of these applications, we don’t just care about the text the model generates; we care about the score—the probability the model assigns to a specific sequence of words. These scores are used to detect hallucinations, rank potential answers, and measure the model’s confidence. But here is the uncomfortable question: Can we actually trust these numbers as mathematical probabilities? ...

[CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models 🔗](https://arxiv.org/abs/2407.17467)

Balancing Act: How to Teach LLMs New Tricks Without Forgetting Old Ones

Introduction Large Language Models (LLMs) like Llama or GPT-4 are the polymaths of the digital age. They can write poetry, debug code, and summarize history with impressive fluency. However, their broad knowledge often comes at the expense of depth. When faced with highly specialized tasks—such as interpreting complex financial regulations or analyzing dense academic papers—these generalist models often struggle. They simply haven’t seen enough of that specific data during their initial training. ...

[CMD: a framework for Context-aware Model self-Detoxification 🔗](https://arxiv.org/abs/2308.08295)

Can LLMs Fix Themselves? Inside the Context-aware Model self-Detoxification (CMD) Framework

Large Language Models (LLMs) like GPT-4 and Llama 2 have revolutionized how we interact with technology. They can write poetry, debug code, and summarize history. However, they possess a significant flaw: “garbage in, garbage out.” Because these models are trained on the vast, unfiltered internet, they can inadvertently learn and regurgitate toxic content. When a user provides a toxic prompt (the context), LLMs naturally try to complete the pattern. If you start a sentence with a slur or an aggressive statement, the model’s probability distribution pushes it to continue in that same toxic vein. This poses a massive safety risk for real-world applications. ...