EMNLP 2024

[In-Context Compositional Generalization for Large Vision-Language Models 🔗](https://aclanthology.org/2024.emnlp-main.996.pdf)

Beyond Simple Similarity: How to Teach Vision-Language Models to Generalize Compositionally

Introduction Imagine you are teaching a child what a “red apple” is. You show them a picture of a red apple. Now, you want them to understand a “green chair.” You show them a green chair. Finally, you present them with a “green apple”—an object they haven’t explicitly studied before, but which is composed of concepts they already know (“green” and “apple”). If the child recognizes it, they have demonstrated Compositional Generalization. ...

[In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search 🔗](https://arxiv.org/abs/2311.07237)

When LLMs Fail: Exploring the Long-Tail of Knowledge with Logic and Search

Large Language Models (LLMs) like GPT-4 and Llama 2 have dazzled the world with their ability to write code, compose poetry, and answer complex questions. But there is a catch: these models perform best when they are on “familiar ground.” When you ask an LLM about popular topics—like the iPhone or major historical events—it shines. But what happens when you push the model into the obscure corners of knowledge, known as the long-tail distribution? ...

[Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach 🔗](https://arxiv.org/abs/2410.00025)

Can We Teach AI to Listen Like Humans? The Power of Phoneme Fine-Tuning

Introduction: The “Cocktail Party” Problem for AI Imagine you are at a loud, crowded party. Your friend is telling you a story. Despite the background music, the clinking of glasses, and the dozens of other conversations happening around you, you can perfectly understand what your friend is saying. You can strip away the noise, ignore the specific pitch of their voice, and focus entirely on the words and their meaning. ...

[Improving Multi-party Dialogue Generation via Topic and Rhetorical Coherence 🔗](https://aclanthology.org/2024.emnlp-main.189.pdf)

Taming the Group Chat: How Reinforcement Learning Enhances Coherence in Multi-Party Dialogue AI

If you have ever been part of a busy group chat on WhatsApp or Slack, you know the chaos. Multiple conversations happen simultaneously. Someone answers a question from five minutes ago while two other people are debating lunch options. Keeping track of who is talking to whom—and more importantly, what they are talking about—is a significant cognitive task for humans. For Artificial Intelligence, this is a nightmare. In the field of Natural Language Processing (NLP), this problem is known as Multi-party Dialogue Generation (MDG). While standard chatbots (like early versions of Siri or simple customer service bots) only have to deal with one user (a one-on-one structure), MDG agents must navigate a web of entangled conversation threads. ...

[Improving Minimum Bayes Risk Decoding with Multi-Prompt 🔗](https://arxiv.org/abs/2407.15343)

Beyond the Perfect Prompt: How Multi-Prompt MBR Decoding Unlocks LLM Potential

Introduction If you have spent any time working with Large Language Models (LLMs), you have likely encountered the frustration of “prompt brittleness.” You spend hours crafting the perfect instruction, only to find that changing a single adjective or the order of examples drastically changes the output. This sensitivity is often seen as a bug, forcing engineers to hunt for a single “magic prompt” that solves a specific task. But what if we stopped trying to find the one perfect prompt? What if the sensitivity of LLMs to different instructions is actually a feature we can exploit? ...

[Improving Knowledge Graph Completion with Structure-Aware Supervised Contrastive Learning 🔗](https://aclanthology.org/2024.emnlp-main.772.pdf)

Beyond the Triple: How StructKGC Teaches Language Models to See the Graph

Knowledge Graphs (KGs) are the silent engines powering much of the modern web. From Google’s Knowledge Vault to Wikidata, these massive networks store facts in the form of triples: (Head Entity, Relation, Tail Entity). For example, (Leonardo da Vinci, painted, Mona Lisa). However, KGs have a fundamental problem: they are never finished. Even the largest repositories suffer from incompleteness. This has given rise to the field of Knowledge Graph Completion (KGC)—the task of predicting missing links. ...

[Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning 🔗](https://aclanthology.org/2024.emnlp-main.852.pdf)

Sharpening the Judge - How Contrastive Learning Fixes Reward Models in RLHF

Introduction In the current era of Generative AI, training a Large Language Model (LLM) to speak fluent English is effectively a solved problem. The frontier has shifted from capability to alignment. We don’t just want models that can write; we want models that write in accordance with human values—being helpful, harmless, and honest. The industry standard for achieving this is Reinforcement Learning from Human Feedback (RLHF). This technique fine-tunes models using a “Reward Model” that acts as a proxy for human judgment. Think of the Reward Model as a judge: if the judge has a keen eye and clear values, the AI learns to behave well. If the judge is confused, inconsistent, or easily tricked, the AI learns the wrong lessons. ...

[Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation 🔗](https://arxiv.org/abs/2405.19842)

Breaking the Shortcut: How CasCoD Teaches Small Models to Reason Like Giants

In the rapidly evolving world of Artificial Intelligence, we are witnessing a “survival of the fittest” regarding model size. Large Language Models (LLMs) like GPT-4 possess an emergent ability known as Chain-of-Thought (CoT) reasoning. Instead of just jumping to an answer, they break down complex problems into intermediate steps, much like a human showing their work on a math test. However, running these massive models is expensive and computationally heavy. This has led to a surge in research focused on Knowledge Distillation—teaching smaller, more efficient “Student” models (SLMs) to mimic the reasoning capabilities of “Teacher” LLMs. ...

[Improve Dense Passage Retrieval with Entailment Tuning 🔗](https://arxiv.org/abs/2410.15801)

Teaching Retrievers Logic: How Entailment Tuning is Solving the Relevance Gap in RAG

Teaching Retrievers Logic: How Entailment Tuning is Solving the Relevance Gap in RAG If you have ever built a Retrieval-Augmented Generation (RAG) system or an open-domain Question Answering (QA) bot, you have likely encountered a frustrating phenomenon: the “keyword trap.” You ask your system a specific question, like “Who was the first person to step on the moon?” The retriever goes into your vector database and pulls out a passage. But instead of an article about Neil Armstrong’s historic step, it retrieves a biography that says, “Neil Armstrong loved looking at the moon as a child.” ...

[Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation 🔗](https://arxiv.org/abs/2410.09318)

Can We Break ChatGPT? Preventing AI Cheating in CS Classrooms with Adversarial Attacks

Introduction The rapid rise of Large Language Models (LLMs) like ChatGPT and GitHub Copilot has fundamentally changed the landscape of software development. For professional developers, these tools are powerful productivity boosters. However, for computer science educators, they represent a looming crisis. In introductory programming courses (often called CS1 and CS2), the primary goal is to teach students the foundational logic of coding—loops, conditionals, and data structures. The problem? LLMs are exceptionally good at these standard problems. A student can copy a prompt, paste it into ChatGPT, and receive a working solution in seconds, bypassing the learning process entirely. ...

[Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective 🔗](https://arxiv.org/abs/2407.02814)

Unmasking Bias in Vision-Language Models: Why Pictures Are the Real Culprits

Introduction In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) have become superstars. Models like CLIP or GLIP can look at an image and describe it, or read a text description and find the corresponding object in a picture. They are powerful tools, pre-trained on massive datasets of image-text pairs scraped from the internet. However, this power comes with a significant catch: societal bias. Because these models learn from human-generated data, they often inherit our stereotypes. For example, a model might be more likely to associate a “kitchen” with a woman or a “workshop” with a man, regardless of who is actually in the picture. ...

[ImageInWords: Unlocking Hyper-Detailed Image Descriptions 🔗](https://arxiv.org/abs/2405.02793)

Beyond Alt-Text: Teaching AI to See Every Detail with ImageInWords

Introduction There is an old adage that says, “an image is worth a thousand words.” However, if you look at how we currently train Artificial Intelligence to understand images, the reality is much closer to “an image is worth a dozen words.” State-of-the-art Vision-Language Models (VLMs)—the AI systems responsible for understanding photos and generating art—are largely trained on datasets scraped from the web. These datasets rely on “alt-text,” the short, often SEO-driven captions hidden in website code. While helpful, alt-text is rarely descriptive. It might say “Canon EOS R6” (camera metadata) or “Europe vacation” (location), but it rarely describes the visual scene, lighting, textures, or spatial relationships in detail. ...

[If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions 🔗](https://arxiv.org/abs/2403.16442)

If CLIP Could Talk: Uncovering the Secret Language of Vision Models

When you show a picture of a Golden Retriever to a modern AI model like CLIP and it correctly identifies it as a “dog,” it’s easy to make assumptions about how it did that. We naturally assume the model “saw” the floppy ears, the golden fur, and the snout. We assume it matched the visual features of the image to the visual descriptions inherent in the word “dog.” But what if we’re wrong? What if the model isn’t looking at the dog at all, but rather looking for the digital equivalent of a watermark? Or what if it identifies the dog not by its shape, but by “knowing” it’s a pet that lives in North American suburbs? ...

[IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method 🔗](https://arxiv.org/abs/2505.06889)

Calculus to the Rescue: How ODEs Make BERT Immune to Adversarial Attacks

If you have ever fine-tuned a large language model (LLM) like BERT on a small dataset, you likely encountered a familiar frustration: overfitting. The model memorizes the training data perfectly but falls apart the moment it sees something slightly different. Even worse, these models are notoriously fragile to adversarial attacks. A malicious actor can change a single word in a sentence—a “perturbation”—and cause the model to flip its prediction entirely, even if the sentence looks identical to a human. ...

[IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning 🔗](https://arxiv.org/abs/2409.18046)

Bridging the Modality Gap: How IFCap Masters Zero-Shot Image Captioning Without Seeing Images

Image captioning—the art of teaching computers to describe what they see—has traditionally relied on massive datasets of paired images and texts. You show the model a picture of a cat, you give it the text “a cat sitting on a mat,” and repeat this millions of times. While effective, this approach is expensive and computationally heavy. But what if a model could learn to caption images without ever seeing an image during training? ...

[IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding 🔗](https://arxiv.org/abs/2409.19627)

Finding the Needle in the Audio Stack: How IDEAW Revolutionizes Neural Watermarking

In the digital age, audio is everywhere. From viral TikTok sounds to proprietary music tracks and AI-generated voiceovers, audio files are shared, remixed, and unfortunately, stolen at an unprecedented rate. This brings us to the crucial concept of Digital Watermarking. Imagine writing your name in invisible ink on a valuable document. That’s essentially what digital watermarking does for media—it embeds hidden information (like copyright ownership) directly into the signal. The catch? It must be imperceptible to the human ear but robust enough to survive compression, noise, and editing. ...

[I-AM-G: Interest Multimodal Generator for Item Personalization 🔗](https://aclanthology.org/2024.emnlp-main.1187.pdf)

From Generic to Genetic: How I-AM-G Personalizes Content Using Multimodal AI

Introduction Imagine logging into a movie streaming platform. You love adventure films—the rush of adrenaline, the vast landscapes, the hero’s journey. Your friend, on the other hand, loves animation—bright colors, whimsical characters, and exaggerated expressions. Now, imagine both of you see a recommendation for the same new movie. In a traditional system, you’d both see the exact same poster. But what if the poster could change to match your specific taste? You see a gritty, high-contrast poster emphasizing the action; your friend sees a vibrant, stylized version emphasizing the character design. ...

[I love pineapple on pizza != I hate pineapple on pizza: Stance-Aware Sentence Transformers for Opinion Mining 🔗](https://aclanthology.org/2024.emnlp-main.1171.pdf)

Why Your AI Thinks 'I Love Pizza' and 'I Hate Pizza' Are the Same Thing (And How to Fix It)

Introduction Imagine you are building a system to analyze social media debates. You want to separate people who love pineapple on pizza from those who consider it a culinary crime. You feed two sentences into a standard state-of-the-art AI model: “I love pineapple on pizza.” “I hate pineapple on pizza.” To a human, these are opposites. To a standard Sentence Transformer model, they are nearly identical. Why? Because they are both about the topic of “pineapple on pizza.” ...

[I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation 🔗](https://arxiv.org/abs/2407.14767)

Can AI Admit When It's Wrong? Teaching LLMs to Ask for Help

The current generation of Large Language Models (LLMs) is nothing short of impressive. They can write poetry, debug code, and summarize complex historical events. However, anyone who has used tools like ChatGPT or Claude extensively knows they suffer from a specific, persistent flaw: overconfidence. When an LLM faces an ambiguous instruction or lacks the necessary context to solve a problem, it rarely pauses to say, “I’m not sure, can you clarify?” Instead, it often guesses, producing a confident but incorrect answer—a phenomenon often linked to hallucination. ...

[I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses 🔗](https://arxiv.org/abs/2402.11192)

I Learn Better If You Speak My Language: Why Synthetic Data Beats Human Gold-Standard in LLM Training

In the rapidly evolving world of Large Language Models (LLMs), there is a widely accepted hierarchy of data quality. At the top sits human-annotated data—the “gold standard”—carefully crafted by experts. Below that is synthetic data generated by models, often viewed as a useful but slightly inferior substitute when human data is scarce. But what if that hierarchy is wrong? A fascinating research paper titled “I Learn Better If You Speak My Language” explores a counter-intuitive phenomenon: fine-tuning a small LLM (like Mistral or Llama-2) on responses generated by other LLMs (like GPT-4) often yields better results than fine-tuning on human-written responses. ...