Papers

[Concept Space Alignment in Multilingual LLMs 🔗](https://arxiv.org/abs/2410.01079)

Do LLMs Think in a Universal Language? Decoding Concept Space Alignment

Do LLMs Think in a Universal Language? Decoding Concept Space Alignment When you ask a multilingual Large Language Model (LLM) like Llama-2 or BLOOMZ to translate a sentence from English to French, or to reason about a concept in Japanese, what is actually happening under the hood? Does the model maintain separate “brains” for each language, or has it developed a shared, universal “concept space” where the idea of a “dog” is stored in the same mathematical location, regardless of whether it is called “dog,” “chien,” or “inu”? ...

[Computational Meme Understanding: A Survey 🔗](https://aclanthology.org/2024.emnlp-main.1184.pdf)

Decoding the Internet: An Introduction to Computational Meme Understanding

Introduction In the modern digital landscape, memes have evolved far beyond funny cat pictures or relatable reaction images. They have become a primary dialect of the internet—a sophisticated, multimodal form of communication capable of shaping public opinion, spreading culture, and even influencing election results. During the last two US presidential elections, for example, memes were weaponized as coordinated media content to sway voters. But here lies the problem: while humans can process the layers of irony, cultural reference, and visual humor in a meme almost instantly, computers struggle immensely with this task. A meme is not just an image, nor is it just text; it is the complex interplay between the two, often requiring deep external knowledge to decode. ...

[Comparing a BERT Classifier and a GPT classifier for Detecting Connective Language Across Multiple Social Media 🔗](https://aclanthology.org/2024.emnlp-main.1067.pdf)

Beyond Toxicity: Teaching AI to Recognize "Connective Language"

Introduction: Shifting the Focus from Blocking to Building For the past two decades, the intersection of Natural Language Processing (NLP) and social media has largely focused on a digital form of waste management. Researchers and engineers have built sophisticated classifiers to detect and remove the “garbage”—hate speech, toxicity, misinformation, and spam. While this work is vital for digital hygiene, it represents a somewhat one-sided view of online discourse. We have spent immense energy teaching machines what humans shouldn’t say, but very little time teaching them what healthy communication actually looks like. ...

[Comparing Neighbors Together Makes it Easy: Jointly Comparing Multiple Candidates for Efficient and Effective Retrieval 🔗](https://arxiv.org/abs/2405.12801)

Breaking the Speed-Accuracy Trade-off: How CMC Contextualizes Search Candidates

In the world of Information Retrieval (IR) and Natural Language Processing (NLP), we are constantly balancing two opposing forces: speed and accuracy. When you type a query into a search engine or a chatbot, you expect an answer in milliseconds. To achieve this, systems rely on fast, lightweight models. However, you also expect that answer to be perfectly relevant. Achieving high relevance usually requires heavy, complex models that “read” every candidate document deeply. ...

[COMPACT: Compressing Retrieved Documents Actively for Question Answering 🔗](https://arxiv.org/abs/2407.09014)

Squeezing the Truth: How COMPACT Makes RAG Smarter and Faster

Introduction In the rapidly evolving world of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become the gold standard for grounding AI responses in reality. By fetching relevant documents from an external database, we can prevent hallucinations and give models access to up-to-date information. However, there is a catching point: the “context window.” While modern models boast about handling 100k or even 1 million tokens, filling that context comes with significant downsides. It is expensive, increases latency, and paradoxically, often confuses the model. Known as the “Lost in the Middle” phenomenon, LLMs struggle to find specific needles in massive haystacks of retrieved text. ...

[COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities 🔗](https://arxiv.org/abs/2406.12074)

Building Digital Twins: How to Align LLMs with Online Communities Without Human Supervision

Building Digital Twins: How to Align LLMs with Online Communities Without Human Supervision Imagine you are a social scientist trying to understand how different political groups feel about a new tax policy, or a public health official tracking emerging diet trends. Traditionally, you have two options: run a survey or conduct focus groups. Both are slow, expensive, and plagued by biases. People might lie to look better (social desirability bias) or simply refuse to participate (non-response bias). ...

[Communicating with Speakers and Listeners of Different Pragmatic Levels 🔗](https://arxiv.org/abs/2410.05851)

Can You Hear What I Didn't Say? Modeling Pragmatic Reasoning in AI Communication

Imagine you are sitting at a table with three objects: a red circle, a red square, and a gray circle. Someone points toward the table and says, “The red one!” Strictly speaking, this sentence is ambiguous. There are two red objects. However, most humans would immediately reach for the red circle. Why? Because if the speaker wanted the red square, they likely would have said “The square,” since that is a unique feature. The fact that they used color implies they are distinguishing it from the other shape of the same color or the other object of the same shape. ...

[Commonsense Knowledge Editing Based on Free-Text in LLMs 🔗](https://arxiv.org/abs/2410.23844)

Teaching Common Sense to LLMs—Why Fact-Based Editing Isn't Enough

Introduction Large Language Models (LLMs) like GPT-4 and LLaMA are impressive, but they are not perfect. They can hallucinate, rely on outdated information, or simply lack specific context. In recent years, researchers have developed “Knowledge Editing” techniques—surgical methods to update a model’s weights to fix a specific error without retraining the entire network. Traditionally, this has been applied to factual knowledge. For example, if the Prime Minister of a country changes, we can edit the model to associate the country with the new leader. However, the real world isn’t just made of static facts. It is filled with commonsense knowledge—the intuitive understanding of how people act, how physics works, and social norms. ...

[CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions 🔗](https://arxiv.org/abs/2410.03077)

Stop Confusing Your LLM: How Grouping Data Enhances Instruction Tuning

Introduction The rise of Large Language Models (LLMs) like ChatGPT and LLaMa has shifted the focus of AI research from merely creating architectures to refining how these models learn. We know that “Pre-training” gives a model its vast knowledge base, but “Instruction Tuning” (IT) is what makes it helpful. IT is the process that teaches the model to follow specific user commands, transforming it from a text predictor into a capable assistant. ...

[CommVQA: Situating Visual Question Answering in Communicative Contexts 🔗](https://arxiv.org/abs/2402.15002)

Why Context Matters: Reimagining Visual Question Answering with CommVQA

Imagine you are looking at a picture of a mountain range. If you found this image on a travel blog, you might ask: “Where is this located?” or “How difficult is the hike?” However, if you encountered the exact same image in a science magazine, your questions would likely shift to: “Is this a volcanic range?” or “How were these peaks formed?” This simple thought experiment highlights a fundamental aspect of human communication: our questions are rarely generated in a vacuum. They are shaped by our goals, our environment, and the information we already possess. ...

[Collective Critics for Creative Story Generation 🔗](https://arxiv.org/abs/2410.02428)

Can AI Be Creative? How 'Collective Critics' Are Teaching LLMs to Write Better Stories

If you have ever asked ChatGPT or Llama to write a story, you have likely encountered a specific problem. The output is usually coherent; the grammar is perfect, the sequence of events makes sense, and the characters do what they are supposed to do. But it is often… boring. It lacks the spark, the clever twist, or the vivid imagery that makes human writing gripping. In the field of Natural Language Processing (NLP), this is a known trade-off. We have become very good at coherence (logic and flow), but we are still struggling with creativity (novelty, surprise, and emotional resonance). ...

[Collaborative Performance Prediction for Large Language Models 🔗](https://arxiv.org/abs/2407.01300)

Beyond Scaling Laws: How Netflix-Style Algorithms Can Predict LLM Performance

Introduction In the rapidly evolving landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4, Llama 3, and Claude have become the engines driving innovation. However, a significant bottleneck hampers the progress of researchers and engineers alike: the exorbitant cost of evaluation. To truly understand a model’s capabilities, it must be tested against massive benchmarks—suites of tasks ranging from coding problems to complex reasoning and creative writing. Running a single LLM through a comprehensive benchmark can cost upwards of $10,000 and consume thousands of GPU hours. When you consider the sheer number of models being released and the variations in training configurations, the evaluation matrix becomes impossibly large and expensive to fill. ...

[COFFEE-GYM: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code 🔗](https://arxiv.org/abs/2409.19715)

Debugging the debugger: How COFFEE-GYM uses RL to teach AI how to give better coding advice

Introduction We are living in the golden age of AI-assisted programming. Large Language Models (LLMs) like GPT-4 and DeepSeekCoder have become indispensable tools for developers, capable of generating complex functions and boilerplate code in seconds. However, anyone who has used these tools knows a painful truth: they are not perfect. When an LLM generates buggy code, the natural next step is to ask it to fix it. This process—code editing—is crucial. But simply generating a fixed version of the code isn’t always enough, especially for educational purposes or complex debugging. We need the model to explain what went wrong and how to fix it. We need high-quality Natural Language (NL) Feedback. ...

[CODEJUDGE: Evaluating Code Generation with Large Language Models 🔗](https://arxiv.org/abs/2410.02184)

Beyond Test Cases: How CODEJUDGE Uses Slow Thinking to Evaluate AI Code

Introduction In the rapidly evolving landscape of Artificial Intelligence, code generation has become one of the “killer apps.” Tools like GitHub Copilot and ChatGPT have transformed how developers write software, churning out functions, classes, and even entire applications in seconds. But this capability introduces a critical, often overlooked bottleneck: Evaluation. How do we know if the code an AI writes is actually good? Historically, we’ve relied on two main methods: running the code against unit tests (which requires writing those tests first) or comparing the text of the code to a “correct” reference solution (using metrics like BLEU). Both methods have severe limitations. Real-world tasks often lack test cases, and correct code can be written in a thousand different ways, making text comparison unreliable. ...

[CodeAgent: Autonomous Communicative Agents for Code Review 🔗](https://arxiv.org/abs/2402.02172)

Meet Your New AI Code Review Team: Inside the CodeAgent Framework

Code review is the backbone of high-quality software engineering. It’s the process where developers check each other’s work to spot bugs, ensure stylistic consistency, and verify that the code actually does what the commit message says it does. However, if you have ever worked in a software team, you know the reality: code review is labor-intensive, time-consuming, and prone to human error. Naturally, researchers have turned to Large Language Models (LLMs) to automate this. But there is a snag. Most existing AI tools treat code review as a simple “input-output” task—you feed in code, and the AI spits out a critique. This ignores a fundamental truth: Code review is an interactive, collaborative process. It involves understanding context, checking formatting against legacy files, and ensuring security—tasks that often require different “mindsets.” ...

[Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs 🔗](https://arxiv.org/abs/2401.10065)

Why Thinking in Python Makes LLMs Smarter — The Power of Code Prompting

Why Thinking in Python Makes LLMs Smarter: The Power of Code Prompting If you have ever tried to navigate a complex legal document or determine your eligibility for a visa, you know that the logic involved is rarely straightforward. It is a maze of conditional statements: “If you are over 18, AND you have lived here for 5 years, OR you are married to a citizen, THEN…” This is known as conditional reasoning, and it is a fundamental component of human intelligence. For Large Language Models (LLMs), however, it can be a significant stumbling block. While models like GPT-4 are impressive, they often hallucinate or lose track of logic when faced with long chains of conditions buried in natural language text. ...

[CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering 🔗](https://arxiv.org/abs/2409.19753)

Bridging the Gap: How Chain-of-Thought Rewriting Optimizes Knowledge Graphs for LLMs

Introduction Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how we interact with information. They can write poetry, code websites, and answer questions on a vast array of topics. However, for all their brilliance, they have a notorious flaw: “hallucination.” When an LLM doesn’t know a specific fact—or when that fact is obscure or outdated—it often makes things up with supreme confidence. To combat this, researchers rely on Retrieval-Augmented Generation (RAG). The idea is simple: before the LLM answers a question, we retrieve relevant data from an external source (like a database or a document) and feed it to the LLM as context. ...

[CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference 🔗](https://arxiv.org/abs/2406.17626)

When "It" Becomes Dangerous: Exposing Safety Gaps in LLM Conversations

Large Language Models (LLMs) like LLaMA, GPT-4, and Claude have become incredibly adept at refusing harmful requests. If you explicitly ask a modern, safety-aligned model, “How do I make a bomb?” or “Write a hateful slur,” it will almost certainly refuse, citing ethical guidelines. This is the result of extensive “red teaming”—a process where researchers attack the model to find flaws and then patch them. However, most of this safety training focuses on single-prompt attacks. The user asks a bad question; the model says no. But real-world interactions are rarely single-turn queries. They are conversations. They involve context, back-and-forth dialogue, and linguistic shortcuts. ...

[COGEN: Learning from Feedback with Coupled Comprehension and Generation 🔗](https://arxiv.org/abs/2408.15992)

The Virtuous Cycle: How Coupling Speaking and Listening Improves AI Learning

In human cognition, speaking and listening are not isolated islands. When we listen to someone, our brains actively predict what they are about to say. Conversely, when we speak, we often simulate how our words will be received by the listener to ensure clarity. This bidirectional relationship suggests that improving one skill should naturally help the other. However, in the world of Artificial Intelligence, these two capabilities—generation (speaking) and comprehension (listening)—are often trained and treated as separate tasks. ...

[COEVOL: Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation 🔗](https://arxiv.org/abs/2406.07054)

Beyond Data Selection: How Multi-Agent Debate Can Evolve Better LLM Responses

If you have been following the evolution of Large Language Models (LLMs), you are likely familiar with the concept of Instruction Fine-Tuning (IFT). It is the crucial step that turns a raw, text-predicting base model into a helpful assistant capable of following user commands. Recently, the research community has shifted its focus from “how much data do we need?” to “how good does the data need to be?” Papers like LIMA (Less Is More for Alignment) demonstrated that a small set of high-quality data often beats massive amounts of noisy data. This led to a gold rush of data selection methods—algorithms designed to sift through datasets and pick the “cherry” samples while discarding the “lemons.” ...