[Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge 🔗](https://arxiv.org/abs/2410.04784)

Why LLMs Trust Textbooks Over Tweets: Unraveling Learning Preferences in Conflicting Data

Imagine you are browsing the internet trying to find the birth date of a historical figure. You find two conflicting sources. One is a scanned PDF of an academic biography written by a historian. The other is a comment on a social media thread that is riddled with spelling errors. Which one do you trust? Almost instinctively, you trust the academic biography. You rely on heuristics—mental shortcuts—that tell you formal language, proper editing, and authoritative tone correlate with truth. ...

2024-10 · 9 min · 1784 words
[Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models 🔗](https://arxiv.org/abs/2410.04727)

Beyond Perplexity: Measuring LLM Memory with the Forgetting Curve

In the rapidly evolving landscape of Large Language Models (LLMs), there is a massive push for longer context windows. We’ve gone from models that could handle a few paragraphs to beasts claiming to process 128k, 200k, or even 1 million tokens. But here is the critical question: just because a model accepts a million tokens, does it actually remember them? For students and researchers entering this field, this is a tricky problem. We traditionally rely on metrics like Perplexity (PPL) or tasks like “Needle in a Haystack” to evaluate models. However, a new research paper titled “Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models” argues that these existing methods are fundamentally flawed when it comes to long-range memory. ...

2024-10 · 7 min · 1418 words
[Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting 🔗](https://aclanthology.org/2024.emnlp-main.1051.pdf)

The Trap of Eloquence: Why Textual AI Explanations Can Fool Doctors

The integration of Artificial Intelligence into healthcare is no longer a futuristic concept; it is happening now. From diagnosing skin lesions to predicting patient outcomes, AI models are becoming powerful tools in the clinician’s arsenal. However, with great power comes the “black box” problem. Deep learning models, particularly in medical imaging, are notoriously opaque. We know what they decide, but we rarely know why. To bridge this gap, the field of Explainable AI (XAI) has exploded in popularity. The logic is sound: if an AI can explain its reasoning, doctors can trust it when it’s right and—crucially—catch it when it’s wrong. ...

9 min · 1727 words
[FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture 🔗](https://arxiv.org/abs/2406.11030)

Can AI Order Dinner? Testing Cultural Nuance with FoodieQA

Introduction: The Hotpot Dilemma Imagine walking into a restaurant in Beijing. You order “hotpot.” You receive a copper pot with plain water and ginger, accompanied by thinly sliced mutton and sesame dipping sauce. Now, imagine doing the same thing in Chongqing. You receive a bubbling cauldron of beef tallow, packed with chili peppers and numbing peppercorns, served with duck intestines. Same name, entirely different cultural experiences. For humans, these distinctions are part of our cultural fabric. We understand that food isn’t just a collection of ingredients; it is tied to geography, history, and local tradition. But for Artificial Intelligence, specifically Vision-Language Models (VLMs), this level of fine-grained cultural understanding is a massive blind spot. ...

2024-06 · 8 min · 1522 words
[Focused Large Language Models are Stable Many-Shot Learners 🔗](https://arxiv.org/abs/2408.13987)

Why More Isn't Always Better: Fixing Attention Dispersion in Many-Shot In-Context Learning

Large Language Models (LLMs) have transformed the landscape of Artificial Intelligence, largely due to their ability to perform In-Context Learning (ICL). This is the capability where a model learns to solve a task simply by looking at a few examples (demonstrations) provided in the prompt, without any parameter updates. The prevailing wisdom—and the scaling law that governs much of deep learning—suggests that “more is better.” If giving an LLM five examples helps it understand a task, giving it a hundred examples should make it an expert. However, recent empirical studies have uncovered a baffling phenomenon: as the number of demonstrations increases from “few-shot” to “many-shot,” performance often plateaus or even degrades. ...

2024-08 · 9 min · 1820 words
[Flee the Flaw: Annotating the Underlying Logic of Fallacious Arguments Through Templates and Slot-filling 🔗](https://arxiv.org/abs/2406.12402)

Beyond Labeling: Teaching AI to Deconstruct the Logic of Fallacies

If you have ever spent time in the comment section of a social media platform, you have likely encountered an argument that just felt wrong. It wasn’t necessarily that the facts were incorrect, but rather that the way the dots were connected didn’t make sense. Perhaps someone argued, “If we don’t ban all cars immediately, the planet is doomed.” You know this is an extreme stance that ignores middle-ground solutions, a classic False Dilemma. Or maybe you read, “My uncle ate bacon every day and lived to be 90, so bacon is healthy.” This is a Faulty Generalization—taking a single data point and applying it to the whole population. ...

2024-06 · 9 min · 1807 words
[Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models 🔗](https://arxiv.org/abs/2410.00131)

Faster, Smarter, Lighter: How FibecFed Revolutionizes Federated Fine-Tuning for LLMs

Introduction The artificial intelligence landscape has been irrevocably changed by Large Language Models (LLMs) like ChatGPT and LLaMA. These models possess incredible capabilities, but they have a massive hunger for data. Traditionally, training or fine-tuning these giants requires aggregating massive datasets into a centralized server. However, in the real world, data doesn’t live in a single data center. It lives on our phones, our laptops, and in decentralized local servers—often protected by stringent privacy regulations like GDPR. ...

2024-10 · 10 min · 2020 words
[First Heuristic Then Rational: Dynamical Use of Heuristics in Language Model Reasoning 🔗](https://arxiv.org/abs/2406.16078)

How LLMs Think: The Shift from Lazy Shortcuts to Rational Logic

When you are faced with a complex problem that requires multiple steps to solve, how do you approach it? Psychological research suggests that humans often start with “heuristics”—mental shortcuts or shallow associations. If you are looking for your keys, you might first look on the kitchen counter simply because “keys often go there,” not because you remember putting them there. However, as you eliminate options and get closer to the solution, your thinking shifts. You become more rational, deducing exactly where you must have been last. ...

2024-06 · 8 min · 1626 words
[Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models 🔗](https://arxiv.org/abs/2402.16315)

The Devil is in the Pixels: Why GPT-4V Struggles with Details and How to Fix It

If you have played with recent Large Vision-Language Models (LVLMs) like GPT-4V, LLaVA, or InstructBLIP, you’ve likely been impressed. You can upload a photo of a messy room and ask, “What’s on the table?” or upload a meme and ask, “Why is this funny?” and the model usually responds with eerie accuracy. These models have bridged the gap between pixels and text, allowing for high-level reasoning and captioning. However, there is a catch. While these models are excellent generalists, they are surprisingly poor specialists. If you upload a photo of a bird and ask, “Is this a bird?”, the model says yes. But if you ask, “Is this a Cerulean Warbler or a Black-throated Blue Warbler?”, the model often falls apart. ...

2024-02 · 7 min · 1441 words
[FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension 🔗](https://arxiv.org/abs/2409.14750)

Breaking the Hallucination: Why MLLMs Struggle with Fine-Grained Visual Grounding

Introduction In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs) like GPT-4V have dazzled us with their ability to chat about images. You can upload a photo of your fridge, and the model can suggest recipes. However, beneath this conversational fluency lies a persistent issue: visual grounding. When you ask a model to pinpoint the exact location of “the red mug to the left of the blue book,” it often struggles. Instead of truly “seeing” the spatial relationships, many models rely on linguistic probabilities—essentially guessing based on word associations rather than visual evidence. This leads to hallucinations, where the model confidently identifies an object that isn’t there or misinterprets complex instructions. ...

2024-09 · 7 min · 1435 words
[Fine-grained Pluggable Gradient Ascent for Knowledge Unlearning in Language Models 🔗](https://aclanthology.org/2024.emnlp-main.566.pdf)

Surgical Precision in AI - How Fine-Grained Gradient Ascent Makes LLMs Forget Secrets Without Losing Intelligence

Large Language Models (LLMs) are voracious readers. During their pre-training phase, they consume massive datasets scraped from the open web. While this allows them to learn grammar, reasoning, and world knowledge, it also means they inadvertently memorize sensitive information—ranging from Personally Identifiable Information (PII) to toxic hate speech. This poses a significant security and ethical dilemma. If a model memorizes a user’s address or internalizes harmful biases, how do we remove that specific knowledge? The traditional approach would be to scrub the dataset and retrain the model from scratch. However, for models with billions of parameters, retraining is prohibitively expensive and time-consuming. ...

3 min · 456 words
[Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs 🔗](https://arxiv.org/abs/2312.05934)

Fine-Tuning vs. RAG: The Battle for Knowledge Injection in LLMs

Introduction Imagine you are a college student about to take a difficult exam on a subject you have never studied before—perhaps Advanced Astrophysics or the history of a fictional country. You have two ways to prepare. Option A: You lock yourself in a room for a week before the exam and memorize every fact in the textbook until your brain hurts. Option B: You don’t study at all, but during the exam, you are allowed to keep the open textbook on your desk and look up answers as you go. ...

2023-12 · 9 min · 1827 words
[Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together 🔗](https://arxiv.org/abs/2407.10930)

Why Choose? How Combining Fine-Tuning and Prompt Optimization Unlocks LLM Potential

Introduction In the rapidly evolving world of Large Language Models (LLMs), engineers and researchers often face a dilemma when trying to improve performance: should they spend their time engineering better prompts, or should they collect data to fine-tune the model weights? Conventionally, this is viewed as an “either/or” decision. Prompt engineering is lightweight and iterative, while fine-tuning is resource-intensive but generally more powerful. However, recent research suggests that we might be looking at this the wrong way. What if these two distinct optimization strategies aren’t competitors, but rather complementary steps in a unified pipeline? ...

2024-07 · 8 min · 1499 words
[Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice? 🔗](https://arxiv.org/abs/2404.14122)

Less Data, More Translation: Unlocking LLM Potential with Minimal Fine-Tuning

If you have ever taken a course on Neural Machine Translation (NMT), you likely learned the “Golden Rule” of the field: data is king. To build a system capable of translating between English and German, you traditionally need millions of high-quality, aligned sentence pairs. If you want a multilingual model, you need massive datasets covering every direction you intend to support. But the era of Large Language Models (LLMs) like Llama-2 has shaken the foundations of this dogma. These models have read terabytes of text during pre-training. They have likely seen German, Chinese, and Russian before they ever encounter a specific translation dataset. ...

2024-04 · 8 min · 1647 words
[Fine-Grained Prediction of Reading Comprehension from Eye Movements 🔗](https://arxiv.org/abs/2410.04484)

Reading Between the Lines: Can Eye Movements Predict Comprehension?

Introduction Reading is one of the most fundamental skills required to navigate modern society, yet assessing how well someone understands what they read remains a complex challenge. Traditionally, the only practical way to measure reading comprehension is through standardized testing—giving someone a passage and asking them questions afterwards. While effective, this approach has significant limitations. It is “offline,” meaning we only get the results after the reading process is finished. It doesn’t tell us where the reader got confused, when their attention drifted, or how they processed the information in real-time. It is essentially a “black box” approach: text goes in, an answer comes out, and we miss everything that happened in between. ...

2024-10 · 9 min · 1850 words
[Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates 🔗](https://arxiv.org/abs/2210.04359)

Decoding 155 Years of Political Debate: How AI Uncovers the Evolution of Solidarity

How does a society hold itself together? In sociology, the answer is often solidarity—the cohesive bond that unites individuals. But solidarity is not a static concept; it shifts with wars, economic crises, and cultural revolutions. Understanding these shifts requires analyzing millions of words spoken over decades, a task that has historically been impossible for human researchers to perform at scale. In a recent paper, researchers from Bielefeld University and partnering institutions undertook an ambitious project: analyzing 155 years of German parliamentary debates (from 1867 to 2022) to track solidarity towards women and migrants. By combining deep sociological theory with state-of-the-art Large Language Models (LLMs) like GPT-4, they didn’t just automate a reading task—they uncovered profound changes in how political leaders frame empathy, exclusion, and belonging. ...

2022-10 · 7 min · 1371 words
[Finding Blind Spots in Evaluator LLMs with Interpretable Checklists 🔗](https://arxiv.org/abs/2406.13439)

Can We Trust AI Judges? Inside the FBI Framework for Auditing Evaluator LLMs

The Rise of the AI Judge In the rapidly evolving landscape of Artificial Intelligence, we face a bottleneck: evaluation. As Large Language Models (LLMs) become more capable, evaluating their outputs has become prohibitively expensive and time-consuming for humans. If you are developing a new model, you cannot wait weeks for human annotators to grade thousands of responses. The industry solution has been “LLM-as-a-Judge.” We now rely on powerful models, like GPT-4, to grade the homework of smaller or newer models. These “Evaluator LLMs” decide rankings on leaderboards and influence which models get deployed. But this reliance rests on a massive assumption: that the Evaluator LLM actually knows what a good (or bad) answer looks like. ...

2024-06 · 8 min · 1546 words
[FINDVER: Explainable Claim Verification over Long and Hybrid-Content Financial Documents 🔗](https://arxiv.org/abs/2411.05764)

Can AI Audit the Books? Introducing FINDVER, a Benchmark for Financial Claim Verification

Introduction We live in an era of information explosion. Every day, news outlets, social media, and forums are flooded with claims about company performance. “Company X increased its revenue by 20%,” or “Company Y’s debt load has doubled.” For investors and analysts, the stakes of acting on false information are incredibly high. The antidote to misinformation is verification—checking these claims against the original source documents, such as 10-K (annual) and 10-Q (quarterly) reports filed with the SEC. ...

2024-11 · 8 min · 1623 words
[Fill In The Gaps: Model Calibration and Generalization with Synthetic Data 🔗](https://arxiv.org/abs/2410.10864)

Can Fake Data Fix Real Confidence? Improving Model Calibration with LLMs

In the fast-moving world of Artificial Intelligence, we often obsess over a single metric: accuracy. We want to know if the model got the answer right. But in high-stakes environments—like healthcare diagnosis, legal analysis, or autonomous driving—being “right” isn’t enough. We also need to know how confident the model is in its decision. Imagine a doctor who is wrong 10% of the time but insists they are 100% sure of every diagnosis. That doctor is dangerous. Similarly, a machine learning model that predicts incorrect outcomes with high confidence is “miscalibrated.” ...

2024-10 · 7 min · 1367 words
[Fewer is More: Boosting Math Reasoning with Reinforced Context Pruning 🔗](https://aclanthology.org/2024.emnlp-main.758.pdf)

Fewer is More: How CoT-Influx Turbocharges LLM Math Reasoning

Large Language Models (LLMs) like GPT-4 and LLaMA-2 are linguistic wizards, capable of writing poetry, code, and essays with ease. Yet, ask them a multi-step grade school math problem, and they often stumble. The standard solution to this problem is Chain-of-Thought (CoT) prompting—giving the model a few examples of how to solve similar problems step-by-step before asking it to solve a new one. This is known as few-shot learning. Intuitively, the more examples you show the model, the better it should perform. But there is a hard ceiling: the context window. ...

8 min · 1546 words