[Rethinking the Evaluation of In-Context Learning for LLMs 🔗](https://aclanthology.org/2024.emnlp-main.779.pdf)

The Hidden Cost of Prompting: Why We Need a New Standard for In-Context Learning

If you have ever played around with Large Language Models (LLMs) like GPT-4 or Llama, you have likely encountered In-Context Learning (ICL). It is the fascinating ability of these models to learn a new task simply by seeing a few examples in the prompt, without any gradient updates or weight changes. For instance, if you want a model to classify movie reviews, you might provide three examples of reviews and their sentiment (Positive/Negative) before asking it to classify a fourth one. This process seems magical and, crucially, it seems “free” compared to fine-tuning a model. ...

7 min · 1348 words
[Rethinking Token Reduction for State Space Models 🔗](https://arxiv.org/abs/2410.14725)

Why Transformer Optimizations Fail on Mamba (and How to Fix It)

Introduction The landscape of sequence modeling is shifting. For years, the Transformer architecture has reigned supreme, driving the revolution in Large Language Models (LLMs). However, a new contender has emerged: State Space Models (SSMs), most notably the Mamba architecture. Mamba has generated significant excitement because it solves the Transformer’s biggest bottleneck: the quadratic computational cost of attention. Mamba scales linearly with sequence length, making it a potential “Transformer killer” for processing massive contexts. However, scaling Mamba to billions of parameters still presents a massive computational challenge. To deploy these models in real-world applications, we need to make them more efficient. ...

2024-10 · 8 min · 1626 words
[Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization 🔗](https://arxiv.org/abs/2406.15524)

The Pruning Paradox: Why Minimizing Error in LLMs Can Backfire

The Pruning Paradox: Why Minimizing Error in LLMs Can Backfire Large Language Models (LLMs) like LLaMA and GPT have revolutionized artificial intelligence, but they come with a heavy price tag: computational cost. Running these massive models requires significant memory and energy, creating a barrier to entry for many researchers and developers. To solve this, the field has turned to Neural Network Pruning—the art of removing parameters (weights) from a model to make it smaller and faster without losing too much intelligence. The standard approach is to treat pruning as a math problem: remove weights in a way that minimizes the difference between the original “dense” model and the new “sparse” model. This difference is called reconstruction error. ...

2024-06 · 8 min · 1507 words
[Rethinking Pragmatics in Large Language Models: Towards Open-Ended Evaluation and Preference Tuning 🔗](https://aclanthology.org/2024.emnlp-main.1258.pdf)

Teaching AI Subtlety: Why Multiple Choice Fails Social Reasoning

Imagine you are sitting in a room with a window open. A friend walks in, shivers slightly, and says, “It’s chilly in here.” If you are a literal thinker, you might simply agree: “Yes, the temperature is low.” But if you have social-pragmatic awareness, you understand the implicature: your friend wants you to close the window. This gap between literal meaning and intended meaning is the domain of Pragmatics. For humans, navigating these social nuances—implicatures, irony, humor, and metaphors—is intuitive. For Large Language Models (LLMs), it is notoriously difficult. While LLMs have mastered syntax and semantics, they often struggle to grasp the “unsaid” rules of human interaction. ...

7 min · 1454 words
[Resampled Datasets Are Not Enough: Mitigating Societal Bias Beyond Single Attributes 🔗](https://arxiv.org/abs/2407.03623)

Can We Fix AI Bias by Hallucinating Better Data? A Deep Dive into Synthetic Dataset Generation

Introduction: The Motorcycle Problem Imagine showing an AI model a picture of a person riding a motorcycle. You ask the model to describe what it sees. It replies: “A man riding a motorcycle.” Now, imagine that the rider is actually a woman. Why did the AI get it wrong? The answer lies in spurious correlations. In the vast datasets used to train these models, motorcycles appear significantly more often with men than with women. The model stops looking at the person and starts relying on the context: if there is a motorcycle, the model bets it’s a man. ...

2024-07 · 9 min · 1766 words
[Representational Analysis of Binding in Language Models 🔗](https://arxiv.org/abs/2409.05448)

Geometry of Thought: How Language Models Use 'Ordering Subspaces' to Track Entities

Introduction: The “Coffee in Box Z” Problem Imagine you are given a logic puzzle: “The coffee is in Box Z, the stone is in Box M, the map is in Box H. What does Box Z contain?” For a human, this is trivial. You scan the sentence, find “Box Z,” look at what is associated with it (“coffee”), and give the answer. In cognitive science and linguistics, this process is known as binding. You are binding an entity (Box Z) to an attribute (coffee). ...

2024-09 · 9 min · 1738 words
[Repairs in a Block World: A New Benchmark for Handling User Corrections with Multi-Modal Language Models 🔗](https://arxiv.org/abs/2409.14247)

Oops, Not That One! Teaching Vision-Language Models to Handle Human Corrections

Imagine you are cooking with a robot assistant. You ask it to “pass the large bowl.” The robot reaches for a colander. You immediately say, “No, the ceramic one on the left.” The robot pauses, processes your correction, and successfully hands you the mixing bowl. This interaction seems trivial for humans. We constantly negotiate meaning in conversation. If we misunderstand something, we fix it and move on. However, for Artificial Intelligence—specifically Vision-Language Models (VLMs)—this process is incredibly difficult. Most current AI benchmarks focus on getting things right the first time based on a single instruction. But what happens when the AI gets it wrong? Can it recover? ...

2024-09 · 7 min · 1460 words
[RepMatch: Quantifying Cross-Instance Similarities in Representation Space 🔗](https://arxiv.org/abs/2410.09642)

Seeing Data Through the Model's Eyes: An Introduction to RepMatch

Introduction In the world of machine learning, the mantra “data is fuel” has become a cliché, but it remains fundamentally true. The characteristics of a training dataset—its quality, diversity, and hidden biases—dictate the capabilities of the final model. However, analyzing this “fuel” is notoriously difficult. Currently, if a data scientist wants to understand their training data, they often look at attributes like “difficulty” (how hard is this sample to learn?) or “noisiness.” While useful, these metrics usually focus on individual instances in isolation. They fail to answer broader questions: How similar is Dataset A to Dataset B? Does this specific subset of 100 examples represent the knowledge of the entire dataset? ...

2024-10 · 8 min · 1685 words
[RepEval: Effective Text Evaluation with LLM Representation 🔗](https://arxiv.org/abs/2404.19563)

Look Inside the Model: Why LLM Hidden States Are Better Judges Than the LLMs Themselves

Introduction In the rapidly evolving landscape of Large Language Models (LLMs), generating text is only half the battle. The other half—and arguably the harder half—is evaluating that text. How do we know if a response is harmful, helpful, fluent, or consistent? Traditionally, we relied on metrics like BLEU or ROUGE, which simply count word overlaps between a model’s output and a human reference. But these metrics are rigid; they fail to capture nuance or semantic meaning. Recently, the industry has shifted toward “LLM-as-a-Judge,” where we ask a powerful model like GPT-4 to score a response. While effective, this approach is incredibly expensive, slow, and relies heavily on the model’s ability to articulate a critique. ...

2024-04 · 8 min · 1614 words
[Relevance Is a Guiding Light: Relevance-aware Adaptive Learning for End-to-end Task-oriented Dialogue System 🔗](https://aclanthology.org/2024.emnlp-main.309.pdf)

Solving the Distraction Problem: How Relevance-Aware Adaptive Learning Improves Task-Oriented Dialogue

Imagine you are asking a digital assistant to book a hotel. You specify “The Grand Budapest” in the “East” district. The bot replies confidently, “I have booked a room at The Grand Budapest.” But when you ask for the address, it gives you the location of a different hotel, simply because that other hotel is also in the East district and has a similar price range. In the world of Natural Language Processing (NLP), this is a classic case of a Task-Oriented Dialogue (TOD) system failing due to “distractive attributes.” The system retrieved the wrong knowledge entity because it looked deceptively similar to the right one. ...

8 min · 1550 words
[Related Work and Citation Text Generation: A Survey 🔗](https://arxiv.org/abs/2404.11588)

Automating the Literature Review: Can AI Write Your Related Work Section?

Every researcher knows the feeling. You have a brilliant idea, you’ve run the experiments, and you’ve drafted the core methodology. Then, you hit the wall: the Related Work Section (RWS). To write a good RWS, you cannot simply list papers that sound similar to yours. You must craft a coherent narrative. You have to explain the history of the problem, group existing solutions by their approach, point out their flaws, and seamlessly transition into how your work fills the gap. It is a task that requires deep domain expertise, high-level synthesis skills, and the time to read hundreds of papers. ...

2024-04 · 9 min · 1814 words
[Red Teaming Language Models for Processing Contradictory Dialogues 🔗](https://arxiv.org/abs/2405.10128)

When AI Changes Its Mind: Solving Self-Contradiction in Dialogues with Red Teaming

Imagine you are texting a friend for dinner recommendations. They tell you, “I absolutely hate spicy food; I can’t handle it at all.” You agree to go to a mild Italian place. Then, five minutes later, they text, “Actually, let’s get Indian, I eat spicy curry every single day.” You would probably be confused. You might scroll up to check if you misread the first message. You might ask them, “Wait, didn’t you just say you hate spice?” ...

2024-05 · 8 min · 1603 words
[Recurrent Alignment with Hard Attention for Hierarchical Text Rating 🔗](https://arxiv.org/abs/2402.08874)

Can LLMs Grade Papers? Introducing Recurrent Alignment and Hard Attention

Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how we interact with text. They can write poetry, summarize emails, and even code. However, when you ask an LLM to perform a task that requires analyzing a complex, structured document—like an academic paper with dozens of citations—and assign it a specific numerical rating (such as a “disruption score”), the model often falters. The struggle stems from two main issues: structure and precision. First, standard LLMs read text linearly, but real-world documents are often hierarchical (trees of information). Second, LLMs are probabilistic text generators, not calculators; they struggle to output precise, continuous numerical values directly. ...

2024-02 · 8 min · 1510 words
[Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models 🔗](https://arxiv.org/abs/2402.02987)

Your Chat History Isn't Safe: How Attackers Reconstruct Private Conversations from LLMs

In the rapidly evolving landscape of Large Language Models (LLMs), we have moved beyond simple Q&A sessions. Users now engage in complex, multi-round conversations to refine code, write stories, or analyze data. Even more significantly, OpenAI’s introduction of “Custom GPTs” allows users to upload private data or have preparatory conversations to “prime” a bot for specific tasks. Ideally, these interactions are private. We assume that the context of our conversation—the “state” of the chat—is invisible to the outside world. ...

2024-02 · 7 min · 1457 words
[Reconsidering Sentence-Level Sign Language Translation 🔗](https://arxiv.org/abs/2406.11049)

Why Context is King—Re-evaluating How We Teach Machines to Translate Sign Language

Imagine you are watching a movie, but instead of seeing the whole film, you are shown random, five-second clips in a shuffled order. In one clip, a character points to their left and laughs. In the next, they form a specific handshape and move it rapidly through the air. If you were asked to translate exactly what those actions meant, could you do it? Likely not. Without knowing who is standing to the left, or what object was established in the scene prior, you are guessing. ...

2024-06 · 9 min · 1771 words
[Rebuilding ROME : Resolving Model Collapse during Sequential Model Editing 🔗](https://arxiv.org/abs/2403.07175)

Rebuilding ROME: How a Single Line of Code Fixed Model Collapse in LLMs

Large Language Models (LLMs) suffer from a critical limitation: they are frozen in time. Once trained, their knowledge is static. If the President of the United States changes, or if a new scientific discovery corrects a previous theory, the model remains ignorant until it undergoes expensive retraining or fine-tuning. To solve this, researchers developed Model Editing—techniques to surgical update specific facts inside a model without retraining the whole network. One of the most popular methods is ROME (Rank-One Model Editing). It has been hailed as a breakthrough for its ability to locate and edit specific factual associations. ...

2024-03 · 7 min · 1477 words
[Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs 🔗](https://arxiv.org/abs/2410.20200)

The Illusion of Intelligence: Deconstructing Transitive Reasoning in Large Language Models

When we interact with Large Language Models (LLMs) like GPT-4 or LLaMA, it is easy to be seduced by their apparent intelligence. You ask a complex multi-step question, and the model produces a coherent, logical answer. It feels like thinking. But under the hood, is the model actually reasoning? Or is it simply engaging in a sophisticated form of pattern matching, stitching together cues from your prompt to hallucinate a logical structure? ...

2024-10 · 8 min · 1606 words
[Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies 🔗](https://arxiv.org/abs/2406.06461)

Is Your LLM Smarter or Just Richer? A Budget-Aware Look at Reasoning Strategies

Introduction In the rapidly evolving landscape of Large Language Models (LLMs), a new “reasoning strategy” seems to drop every week. We’ve moved far beyond simple prompts. We now have agents that debate each other, algorithms that build “trees” of thoughts, and systems that reflect on their own errors to self-correct. Papers introducing these methods often show impressive tables where their new, complex strategy dominates the leaderboard, leaving simple prompting techniques in the dust. But there is a catch—a hidden variable that is often overlooked in academic comparisons: Compute Budget. ...

2024-06 · 10 min · 1945 words
[Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models 🔗](https://arxiv.org/abs/2409.09788)

Can AI Measure the World? How Reference Objects Unlock Spatial Reasoning in VLMs

Imagine you are looking at a photograph of a room you’ve never visited. Someone asks, “Will that couch fit through the doorway?” Even though you don’t have a tape measure, you can probably make a very good guess. You intuitively know that a standard door is about 80 inches high, and using that mental “ruler,” you estimate the size of the couch. This ability to use context clues to measure the world is second nature to humans. ...

2024-09 · 8 min · 1605 words
[RealVul: Can We Detect Vulnerabilities in Web Applications with LLM? 🔗](https://arxiv.org/abs/2410.07573)

RealVul: A New Era for PHP Vulnerability Detection using Large Language Models

If you are studying software security or machine learning, you have likely noticed the explosion of interest in Large Language Models (LLMs). We know LLMs can write code, explain algorithms, and even translate languages. But can they act as security auditors? Can they look at a piece of code and tell you, “Hey, there’s a dangerous SQL injection right here”? The short answer is yes, but with major caveats. While there has been significant research into using Deep Learning for vulnerability detection in languages like C and C++, the web’s most dominant language—PHP—has been largely left behind. This is a critical gap. PHP powers nearly 80% of the top ten million websites, including giants like WordPress and Wikipedia. ...

2024-10 · 9 min · 1785 words