Papers

[Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand Multilingual Instructions? 🔗](https://arxiv.org/abs/2402.13703)

Breaking the Language Barrier: Do Polyglot AI Models Need Polyglot Teachers?

Introduction In the rapidly evolving landscape of Large Language Models (LLMs), there is a distinct imbalance. While models like GPT-4 and Llama 2 dazzle us with their capabilities, they are predominantly “English-centric.” They are trained on vast oceans of English text, and their ability to follow instructions in other languages often feels like an afterthought—a side effect of translation rather than a core feature. But the world speaks more than just English. For an AI to be a truly global assistant, it must be “polyglot”—capable of understanding and generating fluent, culturally nuanced text in multiple languages. ...

[Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups 🔗](https://arxiv.org/abs/2411.01706)

Can LLMs Judge Difficulty? A Deep Dive into Complex Word Identification

Imagine you are learning a new language. You pick up a newspaper, start reading, and suddenly hit a wall. There is a word you just don’t understand. It disrupts your flow and comprehension. Now, imagine a computer system that could scan that text before you read it, identify those difficult words, and automatically replace them with simpler synonyms. This is the goal of Lexical Simplification, and its first, most crucial step is Complex Word Identification (CWI). ...

[Investigating LLMs as Voting Assistants via Contextual Augmentation: A Case Study on the European Parliament Elections 2024 🔗](https://arxiv.org/abs/2407.08495)

Can We Trust AI to Help Us Vote? Auditing LLMs in the 2024 European Elections

Introduction In the age of information overload, making an informed political decision is becoming increasingly difficult. During major political events, such as the 2024 European Parliament elections, voters are bombarded with manifestos, debates, and media commentary. To navigate this, many citizens turn to Voting Advice Applications (VAAs). These are traditional, rule-based web applications where users answer a fixed questionnaire (e.g., “Do you support the Euro?”), and the system matches them with the political party that best aligns with their views. ...

[Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis 🔗](https://arxiv.org/abs/2407.15286)

Is Your AI Actually Moral, or Just Pretending? The Mechanics of Self-Correction

Large Language Models (LLMs) have a bit of a reputation problem. While they can write poetry and code, they are also prone to hallucination and, more concerningly, perpetuating stereotypes, discrimination, and toxicity. To combat this, the field has rallied around a technique called Intrinsic Moral Self-Correction. The idea is elegantly simple: ask the model to double-check its work. By appending instructions like “Please ensure your answer is unbiased,” models often produce significantly safer outputs. It feels like magic—the model seemingly “reflects” and fixes itself without any external human feedback or fine-tuning. ...

[Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations 🔗](https://arxiv.org/abs/2408.15232)

Beyond the Search Bar: How Watching AI Agents Argue Helps Us Learn Better

Introduction We live in the golden age of answers. If you want to know the population of Brazil or the boiling point of tungsten, a quick Google search or a prompt to ChatGPT gives you the answer instantly. These systems excel at addressing known unknowns—information gaps you are aware of and can articulate into a specific question. But what about the unknown unknowns? These are the concepts, connections, and perspectives you don’t even know exist. How do you ask a question about a topic when you don’t know the vocabulary? How do you explore the implications of a new technology if you don’t know the economic or ethical frameworks surrounding it? ...

[Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding 🔗](https://arxiv.org/abs/2410.15609)

Making Voice Assistants Truly Robust: A Causal Approach to Speech Noise Injection

Imagine you are asking your smart home assistant to “add cereal to the shopping list.” Instead, it dutifully adds “serial” to your list. While this is a minor annoyance for a user, for the underlying Artificial Intelligence, it is a catastrophic failure of understanding. This phenomenon stems from errors in Automatic Speech Recognition (ASR). While modern Pre-trained Language Models (PLMs) like BERT or GPT are incredibly smart at understanding text, they are often trained on clean, perfect text. When they are fed messy, error-prone transcriptions from an ASR system, their performance nosedives. ...

[Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions 🔗](https://arxiv.org/abs/2402.15055)

The Handshake Inside the Machine: How Attention Heads and MLPs Collaborate to Predict the Next Token

The interior of a Large Language Model (LLM) is often described as a “black box.” We know what goes in (a prompt) and we know what comes out (a coherent continuation), but the billions of calculations in between remain largely opaque. For students and researchers in Natural Language Processing (NLP), this opacity is a problem. If we don’t know how a model works, we can’t fully trust it, fix it when it hallucinates, or prevent it from exhibiting bias. ...

[Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding 🔗](https://aclanthology.org/2024.emnlp-main.810.pdf)

Why CLIP Can't Read Between the Lines: Fixing Compositional Reasoning in Vision-Language Models

Introduction Imagine showing a picture of a horse riding on a person (a strange image, granted) to a state-of-the-art AI model. Then, you ask the model to pick the correct caption between two options: “a person riding a horse” and “a horse riding a person.” Ideally, this should be easy. The nouns are the same, but the relationship is flipped. However, most modern Vision-Language Models (VLMs), including the famous CLIP, struggle significantly with this. They act like “Bag-of-Words” models—they see “horse,” they see “person,” and they declare a match, completely ignoring the syntax or the relationship described by the verb “riding.” ...

[Interpretability-based Tailored Knowledge Editing in Transformers 🔗](https://aclanthology.org/2024.emnlp-main.225.pdf)

Surgical Precision for LLMs—How Tailored Knowledge Editing Fixes Facts Without Breaking Models

Large Language Models (LLMs) like GPT-4 or LLaMA are often described as modern-day encyclopedias. They store vast amounts of information about the world, from historical dates to scientific constants. But there is a fundamental flaw in this analogy: unlike a digital encyclopedia that can be updated with a few keystrokes, an LLM is frozen in time. What happens when the Prime Minister changes? What if the model learned incorrect information during training? Or worse, what if it memorized private user data that needs to be scrubbed? ...

[INTERINTENT: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context 🔗](https://arxiv.org/abs/2406.12203)

Can AI Keep a Secret? Testing Social Intelligence in the Game of Avalon

Large Language Models (LLMs) have mastered the art of conversation. They can write poetry, debug code, and summarize history. But can they lie strategically? Can they deduce who among their friends is a traitor? Can they understand the subtle difference between what someone says and what they actually intend? These capabilities fall under the umbrella of Social Intelligence. While we have plenty of benchmarks for math and coding, evaluating whether an AI can navigate complex social dynamics is much harder. Most current tests are static—multiple-choice questions that don’t reflect the fluid, high-stakes nature of real human interaction. ...

[Integrating Structural Semantic Knowledge for Enhanced Information Extraction Pre-training 🔗](https://aclanthology.org/2024.emnlp-main.129.pdf)

Beyond Plain Text — How SKIE Revolutionizes Information Extraction with Semantic Graphs

Introduction In the world of Natural Language Processing (NLP), understanding who did what to whom is the holy grail. This process, known as Information Extraction (IE), turns unstructured text—like a news article or a medical report—into structured data tables. For years, the standard approach has been to train massive language models on raw text. While models like BERT or RoBERTa are incredible at predicting the next word, they often treat sentences as linear sequences. They miss the hidden “skeleton” of language: the structural relationships between concepts. To fix this, researchers typically rely on heavily annotated datasets where humans manually label entities and relations. But this is expensive, slow, and hard to scale. ...

[Integrating Plutchik’s Theory with Mixture of Experts for Enhancing Emotion Classification 🔗](https://aclanthology.org/2024.emnlp-main.50.pdf)

When Psychology Meets AI: Teaching Models to Feel Using Plutchik’s Wheel and Mixture of Experts

Introduction In the world of Natural Language Processing (NLP), sentiment analysis has become a solved problem. Determining whether a movie review is positive or negative is something even basic models can handle with high accuracy. However, human experience is rarely just “positive” or “negative.” It is a kaleidoscope of joy, grief, anticipation, remorse, and awe. Detecting these fine-grained emotions in text remains a massive hurdle. For example, while models like RoBERTa can crush sentiment analysis benchmarks, their accuracy often plummets when asked to classify specific emotions in tweets or social media posts. Why? Because emotions are subjective, complex, and often overlapping. ...

[Integrating Argumentation and Hate-Speech-based Techniques for Counteracting Misinformation 🔗](https://aclanthology.org/2024.emnlp-main.622.pdf)

Beyond Fact-Checking: How AI Can Use Argumentation Strategies to Fight Misinformation

Introduction In the digital age, misinformation is a hydra. Cut off one head by flagging a post or banning a user, and two more appear in its place. We are witnessing a proliferation of false information that is not only annoying but potentially life-threatening, particularly in contexts like public health or crisis management. The current standard for dealing with this—content moderation—is largely reactive. Platforms wait for a report, check the content, and remove it. While this might stop the immediate spread, it does little to address the root cause: the perception of the person sharing the misinformation. If a user believes a falsehood and is simply silenced, their belief often hardens. They retreat to echo chambers, convinced of a conspiracy to silence the “truth.” ...

[IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning 🔗](https://arxiv.org/abs/2406.13683)

Beyond Black Boxes: How IntCoOp Teaches AI to 'Describe' Before It 'Classifies'

Beyond Black Boxes: How IntCoOp Teaches AI to “Describe” Before It “Classifies” In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) like CLIP have emerged as powerful foundation models. They can recognize objects, understand scenes, and even zero-shot classify categories they have never seen before. However, unlocking the full potential of these giants often requires a “magic spell”—a carefully crafted text prompt. Manual prompt engineering is tedious. While researchers have developed methods to automate this process (a technique known as prompt tuning), these methods often turn the model into a “black box,” learning abstract vectors that work mathematically but make no sense to humans. ...

[Language Models are Supervised Multitask Learners 🔗](https://arxiv.org/abs/2406.14491)

Rethinking Pre-Training: How Supervised Instruction Synthesis is Changing the LLM Landscape

The history of Large Language Models (LLMs) over the last few years has been dominated by a specific recipe: take a massive amount of raw text from the internet, train a model to predict the next token (unsupervised learning), and then, at the very end, fine-tune it to follow instructions (supervised learning). This recipe, popularized by models like GPT-2 and GPT-3, is known as “Vanilla Pre-Training.” It relies on the sheer scale of data. But there is a lingering hypothesis in the AI community: supervised multitask learning—where the model is explicitly told what task to perform—is actually a more efficient way to learn. The problem has always been scaling. We have petabytes of raw web text, but we don’t have petabytes of high-quality, human-labeled instruction-response pairs. ...

[Optimized Instruction Tuning of Specific Tasks 🔗](https://arxiv.org/abs/2404.16418)

Less is More: How Instruction-Only Task Selection Optimizes LLM Specialist Training

In the rapidly evolving landscape of Large Language Models (LLMs), we have seen a massive shift towards Instruction Tuning. Models like FLAN-T5 and T0 have demonstrated that training a model on a massive mixture of tasks—formatted as natural language instructions—unlocks incredible “zero-shot” capabilities. The prevailing wisdom has often been “the more tasks, the better.” The logic follows that a generalist model trained on thousands of tasks will be better equipped to handle a new, unseen task. ...

[Instruction Fine-Tuning: Does Prompt Loss Matter? 🔗](https://arxiv.org/abs/2401.13586)

The Forgotten Hyperparameter: Why Prompt Loss Matters in LLM Fine-Tuning

In the rapidly evolving world of Large Language Models (LLMs), “best practices” are often established not through rigorous ablation studies, but through community consensus and default library settings. One such standard in Supervised Instruction Fine-Tuning (SIFT) is prompt masking. When we fine-tune a model to follow instructions, the standard approach is to calculate the loss (the error) only on the completion (the model’s answer). We typically mask the prompt (the instruction and input), telling the model, “Don’t worry about predicting these tokens; just focus on generating the answer.” ...

[Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes 🔗](https://arxiv.org/abs/2410.05052)

Taming the Spike - How WeSaR Stabilizes LLM Training by scaling Weights

Training Large Language Models (LLMs) is an expensive, high-stakes endeavor. Imagine allocating thousands of GPUs and millions of dollars to train a model like LLaMA or GPT, only to have the training run diverge halfway through. The loss value shoots up suddenly, a phenomenon known as a loss spike, and weeks of progress can be ruined. Loss spikes are a fundamental issue in deep learning, particularly for Transformers. While engineers have developed various “band-aids”—like restarting training from a previous checkpoint or skipping data batches—the root causes remain partially obscure. ...

[Information Flow Routes: Automatically Interpreting Language Models at Scale 🔗](https://arxiv.org/abs/2403.00824)

Mapping the Mind of an LLM: How Information Flow Routes Reveal Model Inner Workings

The inner workings of Large Language Models (LLMs) often feel like a black box. We feed a prompt into one end, and a coherent response magically appears at the other. We know the architecture—Transformers, attention heads, feed-forward networks—but understanding exactly how a specific input token influences a specific output prediction remains one of the hardest challenges in AI research. Traditionally, researchers have tried to reverse-engineer these models using “circuits”—subgraphs of the model responsible for specific tasks. However, finding these circuits is usually a manual, labor-intensive process that requires human intuition to design specific test cases. ...

[InfiniPot: Infinite Context Processing on Memory-Constrained LLMs 🔗](https://arxiv.org/abs/2410.01518)

InfiniPot: How to Fit Infinite Context into Finite Memory

The promise of Large Language Models (LLMs) often feels boundless, but in practice, it is strictly limited by memory. Whether you are summarizing a massive legal contract, analyzing a full-length novel, or maintaining a chat history that spans weeks, you eventually hit a wall: the context window. For cloud-based giants like GPT-4 or Claude 3, simply throwing more GPUs at the problem can extend this window to 100K or even 1M tokens. But what happens when we want to bring this intelligence to the “edge”—to our laptops and mobile phones? In these memory-constrained environments, we cannot simply add more RAM. When the input sequence grows too long, the application crashes or slows to a crawl. ...