Papers

[KnowTuning: Knowledge-aware Fine-tuning for Large Language Models 🔗](https://arxiv.org/abs/2402.11176)

Why LLMs Struggle with Facts and How KnowTuning Fixes It

Introduction We have all experienced it: you ask a Large Language Model (LLM) a specific, detailed question—perhaps about a medical condition or a historical event—and the answer comes back sounding incredibly confident. The grammar is perfect, the tone is professional, but the content is… slightly off. Maybe it misses a crucial detail, hallucinates a date, or presents arguments in a confusing order. Despite their massive pre-training on the internet, LLMs still struggle with knowledge-intensive tasks. They are excellent at mimicking the style of an expert but often fail to retrieve the specific substance required for complex queries. This leads to three main problems: ...

[Kiss up, Kick down: Exploring Behavioral Changes in Multi-modal Large Language Models with Assigned Visual Personas 🔗](https://arxiv.org/abs/2410.03181)

The Proteus Effect in AI: Do LLMs Behave Differently When They "Look" Scary?

Introduction In the world of online gaming, there is a psychological phenomenon known as the “Proteus Effect.” It suggests that the appearance of a user’s digital avatar influences their behavior. If a player is given a tall, attractive avatar, they tend to act more confidently; if they are given an aggressive-looking warrior, they might act more confrontationally. But as we enter the era of Multi-modal Large Language Models (LLMs)—AI that can see as well as read—a fascinating question arises: Does the Proteus Effect apply to AI? ...

[KidLM: Advancing Language Models for Children – Early Insights and Future Directions 🔗](https://arxiv.org/abs/2410.03884)

KidLM: Why We Need Special Language Models for Children (and How to Build Them)

Introduction We live in an era where Artificial Intelligence is reshaping education. From homework helpers to interactive storytelling, Large Language Models (LLMs) are increasingly becoming a part of children’s daily lives. According to UNICEF, one in three internet users globally is a child. Yet, the very models designed to interact with them—ChatGPT, Llama, and others—are fundamentally not built for them. The problem lies in the data. Modern LLMs are trained on massive scrapes of the open internet. They learn the language of adults: complex sentence structures, nuanced debates, and, unfortunately, the toxicity and bias prevalent in online discourse. When we try to adapt these models for children using Supervised Fine-Tuning (SFT), we hit another roadblock: the annotators. The people labeling data to “teach” the AI how to behave are almost exclusively adults aged 18-35. ...

[KNN-INSTRUCT: Automatic Instruction Construction with K Nearest Neighbor Deduction 🔗](https://aclanthology.org/2024.emnlp-main.577.pdf)

Beyond Random Sampling: How KNN-INSTRUCT Builds Better LLM Training Data

If you have ever played with a Large Language Model (LLM) like ChatGPT or Claude, you know that the magic doesn’t just lie in the model’s ability to predict the next word. It lies in the model’s ability to follow your instructions, answer your questions, and act as a helpful assistant. This capability is achieved through a process called Supervised Fine-Tuning (SFT). But here is the catch: SFT requires massive datasets of high-quality conversations—specifically pairs of (instruction, response). Curating these datasets by hand is incredibly expensive and slow. To solve this, researchers have turned to using LLMs to generate their own training data, a technique known as bootstrapping. ...

[KB-Plugin: A Plug-and-play Framework for Large Language Models to Induce Programs over Low-resourced Knowledge Bases 🔗](https://arxiv.org/abs/2402.01619)

Breaking the Data Barrier: How KB-Plugin Teaches LLMs to Reason Over Any Knowledge Base

Introduction Large Language Models (LLMs) have revolutionized how we interact with information. However, they suffer from a well-known flaw: hallucination. When asked about specific, factual data—like the number of citations a researcher has or the specific rail network of a small town—LLMs often guess convincingly rather than answering accurately. To solve this, researchers link LLMs to external Knowledge Bases (KBs). Instead of answering directly, the LLM acts as a translator. It converts a natural language question (e.g., “Who is taller, LeBron James Jr. or his father?”) into a logical program (e.g., Find(LeBron James Jr.) -> Relate(Father) -> ...). This process is called Program Induction (PI). ...

[KAR³L: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students 🔗](https://arxiv.org/abs/2402.12291)

Beyond Spaced Repetition: How KAR³L Uses NLP to Revolutionize Flashcard Learning

If you have ever learned a new language, crammed for a medical board exam, or memorized trivia, you are likely familiar with Spaced Repetition Systems (SRS) like Anki or SuperMemo. These tools are the gold standard for efficient studying. They work by scheduling flashcards at the exact moment you are about to forget them, maximizing the efficiency of your memory. However, standard SRS algorithms have a significant blind spot: they are illiterate. ...

[Jump Starting Bandits with LLM-Generated Prior Knowledge 🔗](https://arxiv.org/abs/2406.19317)

Solved: The Cold Start Problem in Recommender Systems using LLMs

Imagine you have just launched a new streaming service. A new user signs up. You know their age and location, but you have zero data on what movies they actually like. What do you recommend? If you recommend a romantic comedy to a horror fan, they might churn immediately. This is the classic Cold Start Problem in recommender systems. The algorithm needs data to learn preferences, but it needs to make good recommendations to get that data. Traditionally, the system has to “explore” (make random guesses) before it can “exploit” (make smart choices), leading to a poor user experience in the early stages. ...

[Joint Pre-Encoding Representation and Structure Embedding for Efficient and Low-Resource Knowledge Graph Completion 🔗](https://aclanthology.org/2024.emnlp-main.851.pdf)

Speeding Up Knowledge Graphs: How PEMLM Slashes Resource Costs While Boosting Accuracy

In the world of Artificial Intelligence, Knowledge Graphs (KGs) act as the structured memory for machines. They store vast amounts of data in the form of triples—(Head Entity, Relation, Tail Entity)—such as (Paris, is_capital_of, France). These graphs power everything from search engine sidebars to recommendation systems and question-answering bots. However, there is a fundamental problem: Knowledge Graphs are rarely complete. Real-world data is messy, and relationships are often missing. This has given rise to the field of Knowledge Graph Completion (KGC), which uses algorithms to predict missing links, like inferring (? , operates_system, iOS) implies Apple. ...

[Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing 🔗](https://aclanthology.org/2024.emnlp-main.497.pdf)

Taming Dirty Data Locally: How Jellyfish Brings LLM Power to Data Preprocessing Without the Privacy Risk

If you have ever worked in data science, you know the “80/20 rule”: you spend 80% of your time cleaning and preparing data, and only 20% actually analyzing it or building models. Data Preprocessing (DP) is the unglamorous backbone of the data pipeline. It involves fixing spelling errors, filling in missing values, matching records from different databases, and standardizing formats. Traditionally, this has been handled by a fragmented ecosystem of specialized tools—one algorithm for error detection, a completely different one for entity matching, and so on. ...

[Jailbreaking LLMs with Arabic Transliteration and Arabizi 🔗](https://arxiv.org/abs/2406.18725)

Lost in Transliteration — How Arabizi Bypasses LLM Safety Filters

Large Language Models (LLMs) like GPT-4 and Claude 3 are designed to be helpful, but they are also designed to be safe. If you ask these models to write a guide on how to create malware or build a bomb, they are trained to refuse. This safety training, often achieved through Reinforcement Learning from Human Feedback (RLHF), acts as a firewall around the model’s vast knowledge. However, security researchers are constantly searching for cracks in this firewall. While most safety training focuses heavily on English, a new vulnerability has emerged in the linguistic “blind spots” of these models. ...

[Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs 🔗](https://arxiv.org/abs/2403.05020)

God Mode vs. Reality: Why AI Social Simulations Are Failing the Turing Test of Social Intelligence

Imagine a virtual town populated entirely by AI agents. They wake up, go to work, gossip at the coffee shop, and negotiate prices at the market. It sounds like science fiction—specifically, like Westworld or The Sims powered by supercomputers—but recent advances in Large Language Models (LLMs) have brought us tantalizingly close to this reality. Researchers and developers are increasingly using LLMs to simulate complex social interactions. These simulations are used for everything from training customer service bots to modeling economic theories and testing social science hypotheses. The assumption is simple: if an LLM can write a convincing dialogue between two people, it can simulate a society. ...

[Is This a Bad Table? A Closer Look at the Evaluation of Table Generation from Text 🔗](https://arxiv.org/abs/2406.14829)

Beyond Rows and Columns: A New Way to Judge AI-Generated Tables

Introduction Imagine you are asking a Large Language Model (LLM) to summarize a complex financial report into a neat, easy-to-read table. The model churns out a grid of numbers and headers. At a glance, it looks perfect. The columns align, the formatting is crisp, and the headers look professional. But is it actually good? In the world of Natural Language Processing (NLP), we have become very good at generating text. However, generating structured data—like tables—from unstructured text is a different beast. More importantly, evaluating whether that generated table is accurate is a notoriously difficult problem. ...

[Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering 🔗](https://arxiv.org/abs/2410.03466)

The Safety Trap: Why Guardrails Might Be Making AI Worse at Fighting Hate Speech

In the rapidly evolving landscape of Large Language Models (LLMs), there is a constant tug-of-war between two primary objectives: making models helpful and making them harmless. We want our AI assistants to answer our questions accurately, but we also want to ensure they don’t spew toxicity, bias, or dangerous instructions. To achieve this, developers implement “safety guardrails”—alignment techniques and system prompts designed to keep the model polite and safe. But what happens when the task requires engaging with toxic content to neutralize it? ...

[Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment 🔗](https://arxiv.org/abs/2402.14016)

Hacking the Judge: How Universal Adversarial Attacks Fool LLM Evaluators

Hacking the Judge: How Universal Adversarial Attacks Fool LLM Evaluators In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) have taken on a new role: the judge. We use powerful models like GPT-4 and Llama 2 not just to write code or poetry, but to evaluate the quality of text generated by other models. This paradigm, known as “LLM-as-a-judge,” is becoming a standard for benchmarking and even grading student essays or exams. ...

[Is It Really Long Context if All You Need Is Retrieval? 🔗](https://arxiv.org/abs/2407.00402)

The Long-Context Illusion: Why Length Isn't the Only Thing That Matters

In the rapidly evolving world of Large Language Models (LLMs), we are currently witnessing a “context window arms race.” Not long ago, a model that could remember 2,000 words was impressive. Today, we have models boasting context windows of 128k, 200k, or even 1 million tokens. The promise is alluring: you can feed an entire novel, a codebase, or a legal archive into a model and ask questions about it. But this technical leap forces us to ask a critical question: Does a longer input capacity equal better understanding? ...

[Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models? 🔗](https://arxiv.org/abs/2406.12822)

Lost in Translation: Why Multilingual LLMs Need Native Data, Not Just Translations

If you’ve ever used Google Translate to finish a Spanish assignment or interpret a menu in Tokyo, you know the results are usually functional but often lack “soul.” The grammar might be correct, but the cultural nuance—the idiom, the local context, the vibe—is often lost. In the world of Large Language Models (LLMs), we are facing a similar crisis on a massive scale. We want LLMs to speak every language fluently. However, gathering high-quality training data in languages like Russian, Chinese, or Swahili is much harder than finding it in English. The industry standard solution? Take high-quality English data and machine-translate it into the target language. ...

[Is Child-Directed Speech Effective Training Data for Language Models? 🔗](https://arxiv.org/abs/2408.03617)

The Data Gap: Can Language Models Learn Like Children?

The Data Gap: Can Language Models Learn Like Children? If you have ever watched a toddler learn to speak, it feels nothing short of miraculous. By the time a child is 10 years old, they have likely heard somewhere between 10 million and 100 million words. From this relatively small dataset, they achieve fluency, understand complex grammar, and grasp nuance. Contrast this with the Large Language Models (LLMs) we use today, like GPT-4 or Llama. These models are typically trained on hundreds of billions, sometimes trillions, of words. They require a dataset several orders of magnitude larger than a human child to achieve comparable (or sometimes still inferior) linguistic competence. ...

[Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning 🔗](https://arxiv.org/abs/2410.07461)

Pruning LLMs: Why Your Calibration Data Matters More Than You Think

Introduction In the current era of Artificial Intelligence, Large Language Models (LLMs) like Llama 2 and GPT-4 have transformed how we interact with technology. However, their capabilities come at a steep cost: hardware resources. A 7-billion parameter model can require upwards of 10GB of memory just to load, making it inaccessible for most consumer edge devices or mobile phones. To solve this, researchers turn to network pruning—a compression technique that removes “unimportant” weights from a model to reduce its size and speed up inference. Modern pruning algorithms are surprisingly effective, capable of removing 50% or more of a model’s parameters with minimal loss in intelligence. ...

[Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models 🔗](https://arxiv.org/abs/2410.03176)

Seeing Ghosts: Why AI Models Hallucinate Objects and How to Fix Their Eyes

Imagine asking an AI to describe a photo of a living room. It correctly identifies the sofa, the television, and the coffee table. But then, it confidently adds, “and there is a cat sleeping on the rug.” You look closely. There is no cat. There has never been a cat. This phenomenon is known as Object Hallucination. It is one of the most persistent and dangerous problems in Large Vision-Language Models (LVLMs) like LLaVA or GPT-4V. In high-stakes fields like medical imaging or autonomous driving, a hallucinated tumor or a non-existent pedestrian can be catastrophic. ...

[Investigating Mysteries of CoT-Augmented Distillation 🔗](https://arxiv.org/abs/2406.14511)

Why Does Chain-of-Thought Distillation Work? (Hint: It’s Not Logic)

Introduction In the current landscape of Large Language Models (LLMs), “Chain of Thought” (CoT) prompting has become a dominant paradigm. We have all seen the magic: if you ask a model like GPT-4 to “think step-by-step,” its ability to solve complex math word problems or commonsense reasoning tasks improves dramatically. Naturally, researchers asked the next logical question: Can we use these reasoning chains to teach smaller models? This process is known as CoT-Augmented Distillation. The idea is simple: you take a massive “teacher” model (like GPT-4 or Mistral), generate questions with step-by-step rationales, and then fine-tune a tiny “student” model (like GPT-2 or a 2B parameter model) on that data. The hope is that the student won’t just learn the answer; it will learn how to think. ...