Papers

[ImageInWords: Unlocking Hyper-Detailed Image Descriptions 🔗](https://arxiv.org/abs/2405.02793)

Beyond Alt-Text: Teaching AI to See Every Detail with ImageInWords

Introduction There is an old adage that says, “an image is worth a thousand words.” However, if you look at how we currently train Artificial Intelligence to understand images, the reality is much closer to “an image is worth a dozen words.” State-of-the-art Vision-Language Models (VLMs)—the AI systems responsible for understanding photos and generating art—are largely trained on datasets scraped from the web. These datasets rely on “alt-text,” the short, often SEO-driven captions hidden in website code. While helpful, alt-text is rarely descriptive. It might say “Canon EOS R6” (camera metadata) or “Europe vacation” (location), but it rarely describes the visual scene, lighting, textures, or spatial relationships in detail. ...

[If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions 🔗](https://arxiv.org/abs/2403.16442)

If CLIP Could Talk: Uncovering the Secret Language of Vision Models

When you show a picture of a Golden Retriever to a modern AI model like CLIP and it correctly identifies it as a “dog,” it’s easy to make assumptions about how it did that. We naturally assume the model “saw” the floppy ears, the golden fur, and the snout. We assume it matched the visual features of the image to the visual descriptions inherent in the word “dog.” But what if we’re wrong? What if the model isn’t looking at the dog at all, but rather looking for the digital equivalent of a watermark? Or what if it identifies the dog not by its shape, but by “knowing” it’s a pet that lives in North American suburbs? ...

[IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method 🔗](https://arxiv.org/abs/2505.06889)

Calculus to the Rescue: How ODEs Make BERT Immune to Adversarial Attacks

If you have ever fine-tuned a large language model (LLM) like BERT on a small dataset, you likely encountered a familiar frustration: overfitting. The model memorizes the training data perfectly but falls apart the moment it sees something slightly different. Even worse, these models are notoriously fragile to adversarial attacks. A malicious actor can change a single word in a sentence—a “perturbation”—and cause the model to flip its prediction entirely, even if the sentence looks identical to a human. ...

[IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning 🔗](https://arxiv.org/abs/2409.18046)

Bridging the Modality Gap: How IFCap Masters Zero-Shot Image Captioning Without Seeing Images

Image captioning—the art of teaching computers to describe what they see—has traditionally relied on massive datasets of paired images and texts. You show the model a picture of a cat, you give it the text “a cat sitting on a mat,” and repeat this millions of times. While effective, this approach is expensive and computationally heavy. But what if a model could learn to caption images without ever seeing an image during training? ...

[IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding 🔗](https://arxiv.org/abs/2409.19627)

Finding the Needle in the Audio Stack: How IDEAW Revolutionizes Neural Watermarking

In the digital age, audio is everywhere. From viral TikTok sounds to proprietary music tracks and AI-generated voiceovers, audio files are shared, remixed, and unfortunately, stolen at an unprecedented rate. This brings us to the crucial concept of Digital Watermarking. Imagine writing your name in invisible ink on a valuable document. That’s essentially what digital watermarking does for media—it embeds hidden information (like copyright ownership) directly into the signal. The catch? It must be imperceptible to the human ear but robust enough to survive compression, noise, and editing. ...

[I-AM-G: Interest Multimodal Generator for Item Personalization 🔗](https://aclanthology.org/2024.emnlp-main.1187.pdf)

From Generic to Genetic: How I-AM-G Personalizes Content Using Multimodal AI

Introduction Imagine logging into a movie streaming platform. You love adventure films—the rush of adrenaline, the vast landscapes, the hero’s journey. Your friend, on the other hand, loves animation—bright colors, whimsical characters, and exaggerated expressions. Now, imagine both of you see a recommendation for the same new movie. In a traditional system, you’d both see the exact same poster. But what if the poster could change to match your specific taste? You see a gritty, high-contrast poster emphasizing the action; your friend sees a vibrant, stylized version emphasizing the character design. ...

[I love pineapple on pizza != I hate pineapple on pizza: Stance-Aware Sentence Transformers for Opinion Mining 🔗](https://aclanthology.org/2024.emnlp-main.1171.pdf)

Why Your AI Thinks 'I Love Pizza' and 'I Hate Pizza' Are the Same Thing (And How to Fix It)

Introduction Imagine you are building a system to analyze social media debates. You want to separate people who love pineapple on pizza from those who consider it a culinary crime. You feed two sentences into a standard state-of-the-art AI model: “I love pineapple on pizza.” “I hate pineapple on pizza.” To a human, these are opposites. To a standard Sentence Transformer model, they are nearly identical. Why? Because they are both about the topic of “pineapple on pizza.” ...

[I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation 🔗](https://arxiv.org/abs/2407.14767)

Can AI Admit When It's Wrong? Teaching LLMs to Ask for Help

The current generation of Large Language Models (LLMs) is nothing short of impressive. They can write poetry, debug code, and summarize complex historical events. However, anyone who has used tools like ChatGPT or Claude extensively knows they suffer from a specific, persistent flaw: overconfidence. When an LLM faces an ambiguous instruction or lacks the necessary context to solve a problem, it rarely pauses to say, “I’m not sure, can you clarify?” Instead, it often guesses, producing a confident but incorrect answer—a phenomenon often linked to hallucination. ...

[I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses 🔗](https://arxiv.org/abs/2402.11192)

I Learn Better If You Speak My Language: Why Synthetic Data Beats Human Gold-Standard in LLM Training

In the rapidly evolving world of Large Language Models (LLMs), there is a widely accepted hierarchy of data quality. At the top sits human-annotated data—the “gold standard”—carefully crafted by experts. Below that is synthetic data generated by models, often viewed as a useful but slightly inferior substitute when human data is scarce. But what if that hierarchy is wrong? A fascinating research paper titled “I Learn Better If You Speak My Language” explores a counter-intuitive phenomenon: fine-tuning a small LLM (like Mistral or Llama-2) on responses generated by other LLMs (like GPT-4) often yields better results than fine-tuning on human-written responses. ...

[I Could've Asked That: Reformulating Unanswerable Questions 🔗](https://aclanthology.org/2024.emnlp-main.242.pdf)

Beyond 'I Don't Know': Teaching AI to Fix Our Unanswerable Questions

Introduction Imagine you are reading a dense legal contract or a complex medical journal. You aren’t an expert, so you turn to an AI assistant—like ChatGPT or a specialized document reader—to help you understand it. You ask a question based on your limited understanding: “What is the penalty if the tenant paints the walls?” The AI scans the document and replies: “The document does not mention penalties for painting walls.” ...

[Humans or LLMs as the Judge? A Study on Judgement Bias 🔗](https://aclanthology.org/2024.emnlp-main.474.pdf)

Who Watches the Watchmen? Uncovering Bias in Human and AI Judges

The explosion of Large Language Models (LLMs) like GPT-4, Claude, and Gemini has brought us remarkable capabilities in natural language processing. But with great power comes a difficult question: How do we know if these models are actually doing a good job? Evaluating an LLM isn’t like checking a math test. In open-ended tasks—like writing an essay, summarizing a story, or providing therapy-like advice—there is no single “correct” answer. Historically, we relied on humans to grade these responses. Recently, however, the field has shifted toward “LLM-as-a-judge,” where powerful models like GPT-4 are used to grade the outputs of other models. It’s faster, cheaper, and scalable. ...

[Human-LLM Hybrid Text Answer Aggregation for Crowd Annotations 🔗](https://arxiv.org/abs/2410.17099)

Beyond the Wisdom of Crowds: How Human-LLM Hybrid Frameworks Are Revolutionizing Text Annotation

The “Wisdom of Crowds” is a concept as old as statistics itself. The idea is simple: if you ask enough people to guess the number of jellybeans in a jar, the average of their guesses is often startlingly close to the truth—closer, in fact, than the guess of any single expert. In the world of Machine Learning (ML) and Natural Language Processing (NLP), we rely heavily on this principle. We use platforms like Amazon Mechanical Turk or Lancers to gather labeled data. When the task is simple, like clicking a button to say whether an image contains a cat, aggregating the answers is easy: just take the majority vote. ...

[How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective 🔗](https://arxiv.org/abs/2410.10093)

Beyond SFT: Aligning LLMs with Generalized Self-Imitation Learning (GSIL)

Beyond SFT: Aligning LLMs with Generalized Self-Imitation Learning (GSIL) Large Language Models (LLMs) are impressive, but raw pre-trained models are like brilliant but unruly students. They know a lot about the world, but they don’t always know how to behave, follow instructions, or solve complex problems step-by-step. To fix this, we perform a process called alignment. Currently, the standard recipe for alignment has two main stages: Supervised Fine-Tuning (SFT): You show the model examples of good prompts and responses, and it learns to copy them. Preference Fine-Tuning (RLHF/DPO): You show the model two responses (one good, one bad) and teach it to prefer the good one. The second stage is powerful but expensive. It requires collecting human preference data (“Response A is better than Response B”), which is costly and slow to scale. What if we could achieve the high performance of preference learning using only the demonstration data from the first stage? ...

[How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning 🔗](https://arxiv.org/abs/2402.02872)

Deconstructing In-Context Learning: The Two-Tower Mechanism Hidden Inside LLMs

Deconstructing In-Context Learning: The Two-Tower Mechanism Hidden Inside LLMs Large Language Models (LLMs) like GPT-4 and Llama have displayed a fascinating emergent ability known as In-Context Learning (ICL). This is the phenomenon where you provide a model with a few examples (demonstrations) in the prompt—like “English: Cat, French: Chat”—and the model instantly learns the pattern to complete a new example, all without any parameter updates or retraining. While we use ICL every day, the underlying mechanism remains somewhat of a “black box.” How exactly does the model move information from the demonstration examples to the final prediction? Does it actually “learn” the task, or is it just relying on pre-existing knowledge? ...

[How Susceptible are Large Language Models to Ideological Manipulation? 🔗](https://arxiv.org/abs/2402.11725)

Brainwashing AI: How Easily Can LLMs Be Ideologically Manipulated?

Large Language Models (LLMs) like ChatGPT and Llama-2 have become our digital interlocutors, helping us draft emails, summarize news, and answer complex questions. But as we increasingly rely on them for information, a critical question arises: Does the model have an ideology? And if so, can that ideology be hijacked? We often think of AI alignment as preventing models from generating hate speech or building bombs. However, a subtler and perhaps more pervasive risk exists: ideological manipulation. Can a malicious actor take a neutral model and, with a tiny amount of data, turn it into a radical partisan? ...

[How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics 🔗](https://arxiv.org/abs/2410.03429)

Stop Cheating: How to Find the "Real" Hard Questions in NLI Datasets

Imagine you are taking a multiple-choice history test. You don’t actually know the history, but you notice a pattern: every time the answer contains the word “never,” it’s the correct choice. You ace the test, scoring 100%. But have you learned history? No. You’ve just learned a statistical shortcut. This scenario describes a massive problem in current Artificial Intelligence, specifically in Natural Language Inference (NLI). Models like BERT and RoBERTa achieve superhuman scores on benchmark datasets, but they often fail when faced with real-world, nuanced language. Why? Because the datasets they are tested on are full of “spurious correlations”—linguistic shortcuts that allow models to guess the right answer without understanding the logic. ...

[How Far Can We Extract Diverse Perspectives from Large Language Models? 🔗](https://arxiv.org/abs/2311.09799)

Breaking the Echo Chamber: Can LLMs Simulate Diverse Human Perspectives?

Introduction In the world of Artificial Intelligence, Large Language Models (LLMs) are often described as “compressed knowledge.” They have devoured varied texts from millions of human authors, encompassing a vast spectrum of beliefs, cultures, and values. Yet, when we chat with a model like GPT-4, we often receive a single, polished, “majority-vote” answer. This raises a fascinating research question: If these models were trained on diverse perspectives, can we reverse-engineer them to extract that diversity? Can an LLM step out of its default persona and simulate a crowd of people with disagreeing opinions? ...

[How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? 🔗](https://arxiv.org/abs/2404.12866)

Unlocking the Power of Text in Multimodal In-Context Learning

Introduction In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs)—models that can understand both text and images—have become the new frontier. A key capability of these models is In-Context Learning (ICL). This is the ability of a model to learn a new task simply by looking at a few examples provided in the prompt, without requiring any updates to its weights (no fine-tuning necessary). For example, if you want an MLLM to write a funny caption for an image, you might first show it three examples of images with funny captions. The model “gets the idea” and applies that pattern to your new image. ...

[How Does the Disclosure of AI Assistance Affect the Perceptions of Writing? 🔗](https://arxiv.org/abs/2410.04545)

The Bias of Disclosure: How Knowing AI Helped You Write Changes How You Are Judged

Introduction We have entered a new era of digital composition. Gone are the days when “writing assistance” simply meant a red squiggly line under a misspelled word. With the advent of Large Language Models (LLMs) like GPT-4, writing has evolved into a co-creative process. Humans prompt, AI drafts, humans refine, and AI polishes. This paradigm shift raises profound questions about authorship, creativity, and quality. However, a critical psychological question remains unanswered: How do readers react when they know a piece of text was co-written by an AI? ...

[How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with Really Good Data 🔗](https://aclanthology.org/2024.emnlp-main.777.pdf)

The Illusion of Competence: Why Your Code LLM Might Be Cheating (And How to Fix It)

If you have been following the explosion of Large Language Models (LLMs) specialized for coding, you have likely seen the leaderboards. Every week, a new open-source model claims to rival GPT-4 on benchmarks like HumanEval. It seems we are in a golden age of automated programming. But there is a catch. If you take these high-flying models and test them on newer, fresher problems from competitive programming sites, their performance often collapses. Why? ...