Papers

[CoCoST: Automatic Complex Code Generation with Online Searching and Correctness Testing 🔗](https://arxiv.org/abs/2403.13583)

Teaching LLMs to Code Like Humans: The CoCoST Framework

The promise of Large Language Models (LLMs) in software engineering is dazzling. You type a prompt, and the model spits out working code. For simple tasks—like writing a Fibonacci sequence or a basic SQL query—current models like GPT-4 are incredibly proficient. However, the reality of professional software development is rarely that simple. Real-world coding involves intricate libraries (like TensorFlow or Pandas), complex logic, and specific data structures. When LLMs face these “complex code generation” tasks, they often hallucinate non-existent libraries, write code that runs but produces the wrong answer, or fail to handle edge cases. ...

[CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds 🔗](https://arxiv.org/abs/2410.03457)

How to Build a Better Troll: Using LLMs and Crowdsourcing to Teach AI Logical Fallacies

Introduction: The Art of Bad Arguments If you have ever ventured into the comments section of a controversial news article, you have likely encountered them: arguments that sound convincing on the surface but crumble under the slightest scrutiny. A commenter might claim that implementing a small tax increase will inevitably lead to a totalitarian communist state (a Slippery Slope). Another might argue that because a specific politician is corrupt, all politicians must be criminals (a Hasty Generalization). ...

[CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models 🔗](https://arxiv.org/abs/2410.06741)

Balancing Act: How CoBa Solves the Multi-Task Fine-Tuning Puzzle for LLMs

In the rapidly evolving landscape of Large Language Models (LLMs), we have moved past the initial awe of “it can speak” to the logistical nightmare of “how do we use this in production?” Imagine you are an engineer at a tech giant. You need your LLM to perform code completion in Python, translate Java to C++, and generate unit tests. The traditional approach is to fine-tune a separate model for each task. But deploying five different 13-billion parameter models is incredibly resource-heavy and inefficient. ...

[CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research 🔗](https://arxiv.org/abs/2411.01176)

Decoding the Language of Hackers: How CmdCaliper Brings Semantic Understanding to the Command Line

If you have ever stared at a terminal window during a security incident, you know that the command line is the battlefield of modern cybersecurity. For attackers, the command line interface (CLI) is the ultimate tool for execution, persistence, and privilege escalation. For defenders, it is a crime scene full of fingerprints. However, there is a significant problem in how we analyze these fingerprints. Attackers are masters of disguise. They can rewrite the same malicious logic in a dozen different ways—changing argument orders, using aliases, or obfuscating strings—to evade detection systems that rely on simple pattern matching or signature detection. ...

[Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation 🔗](https://arxiv.org/abs/2402.18191)

Less Data, Better Models: How 'Clustering and Ranking' Revolutionizes Instruction Tuning

Introduction: The Quality vs. Quantity Dilemma In the current landscape of Large Language Model (LLM) development, there is a prevailing assumption that “more is better.” We often assume that to make a model smarter, we must feed it more tokens, more documents, and more instructions. This is generally true for the pre-training phase, where models learn the statistical structure of language. However, the rules change significantly during the Instruction Tuning (IT) phase—the final polish that teaches a model to act as a helpful assistant. ...

[Cluster-Norm for Unsupervised Probing of Knowledge 🔗](https://arxiv.org/abs/2407.18712)

Cleaning Up the Signal: How Cluster-Norm Improves Unsupervised Knowledge Discovery in LLMs

Large Language Models (LLMs) are impressive, but they are also black boxes. When an LLM outputs a statement, does it “believe” that statement is true, or is it merely simulating a persona that would say that statement? As we fine-tune models with human preferences, we risk teaching them to be sycophants—telling us what we want to hear rather than what is true. To build safer and more reliable AI, we need to look inside the black box. We need to extract the model’s internal “knowledge” directly from its activations, bypassing its text output. This field is known as knowledge elicitation. ...

[Closing the Loop: Learning to Generate Writing Feedback via Language Model Simulated Student Revisions 🔗](https://aclanthology.org/2024.emnlp-main.928.pdf)

Can AI Learn to Teach? Training Feedback Generators with Simulated Students

Introduction: The Teacher’s Dilemma If you have ever taught a class or mentored a junior colleague, you know the struggle: giving good feedback is hard. Giving good feedback at scale is nearly impossible. In the world of education, feedback is the engine of improvement. A student writes an essay, receives comments, and (hopefully) revises the work to make it better. This cycle helps students develop critical thinking, self-assessment skills, and mastery of the subject. However, for educators, providing detailed, actionable, and personalized feedback for dozens or hundreds of students is an overwhelming time sink. ...

[ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures 🔗](https://arxiv.org/abs/2406.09818)

Why RAG Struggles with Climate Reports: Introducing ClimRetrieve

Introduction Climate change is arguably the most pressing challenge of our time. To understand how the corporate world is adapting, stakeholders—from investors to regulators—rely heavily on corporate sustainability reports. These documents are massive, qualitative, and complex, often hiding crucial data regarding climate risks and strategies within dense textual narratives. To process this flood of information, the tech world has turned to Retrieval Augmented Generation (RAG). RAG systems use Artificial Intelligence to search through documents, find relevant paragraphs, and use them to generate answers to specific questions. It sounds like the perfect solution. However, there is a catch: we don’t actually know how good these systems are at the specific task of retrieval in the climate domain. ...

[CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios 🔗](https://arxiv.org/abs/2410.03502)

Beyond the Board Exam: Why Chinese Medical AI Needs Real-World Clinical Testing

Beyond the Board Exam: Why Chinese Medical AI Needs Real-World Clinical Testing We are living in an era where Artificial Intelligence is passing medical licensing exams with flying colors. Headlines frequently tout Large Language Models (LLMs) that can score passing grades on the USMLE or its Chinese equivalents. This has led to a surge of excitement—and hype—about the imminent arrival of “AI Doctors.” However, anyone who has been to medical school (or treated a patient) knows a fundamental truth: Passing an exam is not the same as practicing medicine. ...

[CLEANGEN: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models 🔗](https://arxiv.org/abs/2406.12257)

Unlocking Secure AI: How CLEANGEN Disarms Backdoor Attacks in LLMs

The capabilities of Large Language Models (LLMs) like GPT-4, Llama 3, and Claude 3 have revolutionized how we interact with technology. From writing code to acting as personal assistants, these models are becoming ubiquitous. However, this rapid adoption comes with a significant security blind spot. While model weights for LLMs like Llama or Mistral are often public, the massive datasets used to train or fine-tune them are usually opaque. This lack of transparency opens the door for backdoor attacks. An attacker can poison a small fraction of the training data, embedding a hidden “trigger” that forces the model to generate malicious content—like bad code, offensive speech, or biased advice—whenever that trigger appears in a user prompt. ...

[ChatRetriever: Adapting Large Language Models for Generalized and Robust Conversational Dense Retrieval 🔗](https://arxiv.org/abs/2404.13556)

Can LLMs Replace the Search Bar? Meet ChatRetriever

Imagine you are chatting with a friend about movies. You ask, “Who directed Inception?” They answer, “Christopher Nolan.” Then you ask, “What other movies did he make?” To a human, “he” clearly refers to Christopher Nolan. To a standard search engine, however, “he” is ambiguous. This is the fundamental challenge of conversational search. Users naturally use pronouns, ellipses, and context-dependent phrasing, assuming the system remembers the history of the conversation. ...

[ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context 🔗](https://aclanthology.org/2024.emnlp-main.363.pdf)

Why ChatGPT Might Ignore You: The Hidden Biases of AI Guardrails

Introduction Imagine you are asking an AI assistant for advice on how to legally import a rare plant. If you tell the AI you are a fan of the Philadelphia Eagles, it gives you a helpful list of permits and regulations. But if you mention you support the Los Angeles Chargers, the AI shuts you down, claiming it cannot assist with that request. It sounds like a joke or a statistical anomaly, but according to recent research from Harvard University, this is a reproducible phenomenon. ...

[Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models 🔗](https://arxiv.org/abs/2311.09210)

When RAG Goes Wrong: How 'Chain-of-Note' Teaches AI to Ignore Bad Data

Introduction We are currently living in the “Golden Age” of Retrieval-Augmented Generation (RAG). If you have worked with Large Language Models (LLMs) recently, you know the drill: LLMs are incredibly smart, but they can be forgetful, outdated, or prone to confident lies—a phenomenon known as hallucination. The industry standard solution has been RAG. The idea is simple: before the model answers a question, we let it “cheat” by looking up the answer in a digital library (like Wikipedia). We retrieve relevant documents, feed them to the model, and ask it to generate an answer based on that evidence. ...

[Chain-of-Dictionary Prompting Elicits Translation in Large Language Models 🔗](https://arxiv.org/abs/2305.06575)

Unlocking Low-Resource Translation - How Chain-of-Dictionary Prompting Supercharges LLMs

Introduction We often think of Large Language Models (LLMs) like ChatGPT as universal translators. If you ask a modern LLM to translate English into French or Spanish, the results are often fluent and accurate. However, this performance is not distributed equally. When we step away from high-resource languages and attempt to translate into “low-resource” languages—those with significantly less training data available on the internet—the models often stumble. They hallucinate, miss key terms, or fail to generate coherent text entirely. ...

[Chain and Causal Attention for Efficient Entity Tracking 🔗](https://arxiv.org/abs/2410.05565)

Solving the Memory Maze: How Chain and Causal Attention (ChaCAL) Revolutionizes Entity Tracking in LLMs

Imagine you are reading a complex mystery novel. On page 10, the detective puts a key in his pocket. On page 50, he moves the key to a drawer. On page 200, he gives the contents of the drawer to his partner. Finally, on page 300, the partner unlocks a door. To understand that scene, you need to track the location of that key across hundreds of pages and several state changes. ...

[CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures 🔗](https://arxiv.org/abs/2410.05235)

Beyond the Diagnosis: Teaching AI to Argue Like a Doctor with CasiMedicos-Arg

Imagine you are a resident doctor in a busy emergency room. You examine a patient, review their vitals, and turn to your attending physician with a diagnosis. “It’s pneumonia,” you say. The attending looks at you and asks the most terrifying question in medical education: “Why?” It is not enough to get the answer right. In medicine, the reasoning process—the chain of evidence connecting symptoms to a diagnosis—is just as critical as the conclusion itself. ...

[Casablanca: Data and Models for Multidialectal Arabic Speech Recognition 🔗](https://arxiv.org/abs/2410.04527)

Beyond Modern Standard: How 'Casablanca' is Revolutionizing Arabic Speech Recognition

Introduction: The “Speech Divide” If you are reading this, chances are you have used a voice assistant like Siri, Alexa, or Google Assistant. You might have even marveled at how accurate automated subtitles on YouTube have become. For speakers of English, French, or Spanish, we are living in the golden age of Automatic Speech Recognition (ASR). Large language models and self-supervised learning (SSL) have solved the majority of transcription problems for these “resource-rich” languages. ...

[CareCorpus+: Expanding and Augmenting Caregiver Strategy Data to Support Pediatric Rehabilitation 🔗](https://aclanthology.org/2024.emnlp-main.392.pdf)

Revolutionizing Pediatric Care: How Synthetic Data and LLMs Are Unlocking Caregiver Strategies

Introduction Globally, over 50 million children aged 0–5 years experience some form of disability. For these children and their families, pediatric rehabilitation is not just about clinical visits; it is about the daily grind of navigating life. It involves finding ways to participate in family dinners, play at the park, or manage school routines. In this context, caregivers—parents and guardians—are the unsung experts. They develop unique, personalized “strategies” to help their children succeed. ...

[Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! 🔗](https://arxiv.org/abs/2410.01023)

If AI Can Explain the Joke, Does It Understand? Testing Multimodal Literacy with Visual Puns

When a friend winks at you while saying, “I’m definitely going to stick to my diet today,” you immediately understand that they likely mean the opposite. You didn’t just process the text (the sentence); you integrated the visual cue (the wink) to resolve the ambiguity of their statement. This ability is known as multimodal literacy. It is the human capacity to actively combine information from different sources—text, images, gestures—to form a complete reasoning process. We do this intuitively when we look at a textbook illustration to understand a complex paragraph or when we read a caption to make sense of an abstract photo. ...

[Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization? 🔗](https://arxiv.org/abs/2406.17274)

The Shaky Foundation of Trust: Why Evaluating Uncertainty in Text Summarization is Harder Than We Thought

In the rapidly evolving world of Natural Language Generation (NLG), we have witnessed Large Language Models (LLMs) perform feats that were considered science fiction only a decade ago. From summarizing complex financial reports to condensing medical records, abstractive text summarization is reshaping industries. However, there is a catch. LLMs hallucinate. They can generate summaries that sound fluent and confident but are factually incorrect. In high-stakes domains—like healthcare or finance—relying on a false summary can have catastrophic consequences. ...