Papers

[On Sensitivity of Learning with Limited Labelled Data to the Effects of Randomness: Impact of Interactions and Systematic Choices 🔗](https://arxiv.org/abs/2402.12817)

The Butterfly Effect in NLP: Disentangling Randomness in Few-Shot Learning

In the world of Machine Learning, particularly Natural Language Processing (NLP), we often chase the highest accuracy score on a benchmark. But there is a ghost in the machine: randomness. Imagine you are training a model with very limited data—perhaps a few-shot classification task. You run the experiment and get an F1 score of 85%. You are ecstatic. But then, you change the “random seed”—a simple integer that controls how data is shuffled or how weights are initialized—and run it again. This time, the score drops to 60%. ...

[On Mitigating Performance Disparities in Multilingual Speech Recognition 🔗](https://aclanthology.org/2024.emnlp-main.323.pdf)

Having Your Cake and Eating It Too: Balancing Accuracy and Fairness in ASR with Adapter Fusion

Introduction Imagine using a voice assistant that understands your brother perfectly but struggles to comprehend a single sentence you say. For millions of users, this isn’t a hypothetical scenario—it is the reality of interacting with modern AI. Automatic Speech Recognition (ASR) systems have become ubiquitous, powering everything from virtual assistants like Siri and Alexa to automated customer service lines and dictation software. However, despite their widespread adoption, these systems often suffer from significant performance disparities. They might work flawlessly for male speakers of English but struggle with female speakers or speakers of lower-resource languages. ...

[On Fake News Detection with LLM Enhanced Semantics Mining 🔗](https://aclanthology.org/2024.emnlp-main.31.pdf)

Can LLMs Catch Fake News? Why Semantics Matter More Than Style

In the digital age, the rapid dissemination of information is a double-edged sword. While we have instant access to news, we are also bombarded by misinformation. Detecting fake news has become one of the most critical challenges in computer science and social media analysis. For a long time, researchers relied heavily on social context—who retweeted whom, how quickly a post spread, and user comments—to identify fake news. But there is a glaring problem with this approach: privacy restrictions and early detection. Often, social context data is unavailable, incomplete, or arrives too late. We need methods that can look at the content of the news itself and determine its authenticity. ...

[On Eliciting Syntax from Language Models via Hashing 🔗](https://arxiv.org/abs/2410.04074)

Hacking Syntax: How Hashing and Contrastive Learning Reveal Grammar in LLMs

If you have ever played with Word2Vec or early language models, you are likely familiar with the famous algebraic miracle of NLP: King - Man + Woman = Queen. This vector arithmetic suggested that language models (LMs) don’t just memorize text; they implicitly learn semantic and syntactic structures. However, extracting that structure explicitly—drawing the actual grammar tree of a sentence—usually requires supervised training on expensive, hand-labeled datasets like the Penn Treebank. But what if we could get a pre-trained Large Language Model (LLM) to confess its grammatical knowledge without ever showing it a labeled parse tree? ...

[On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning 🔗](https://arxiv.org/abs/2406.11823)

Breaking the Resolution Curse: How ELVA Makes Vision-Language Models Efficient

Introduction: The Cost of Seeing Clearly In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs)—models that can see and talk—have become the new frontier. Systems like GPT-4V have demonstrated incredible capabilities, describing complex scenes and answering questions about images. However, a significant bottleneck remains: efficiency. For a model to understand text inside an image (like reading a receipt or analyzing a chart), it typically needs high-resolution inputs. High resolution means dividing the image into thousands of small patches (tokens). For a standard Transformer architecture, more tokens result in quadratically higher computational costs. This creates a barrier for real-world applications where latency and memory are limited. ...

[OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer 🔗](https://arxiv.org/abs/2406.16620)

How OmAgent Solves the "Long Video" Problem with Divide-and-Conquer and Recursive Rewinding

Imagine you are trying to find a specific detail in a 24-hour CCTV recording or a dense three-hour film—perhaps the exact moment a character dropped a cigarette or the license plate of a car that appeared for two seconds. As a human, you wouldn’t memorize every pixel of the video. Instead, you would watch it, form a general impression of the plot, and when asked a specific question, you would scrub through the timeline, “rewinding” to the relevant section to inspect the details. ...

[Oddballs and Misfits: Detecting Implicit Abuse in Which Identity Groups are Depicted as Deviating from the Norm 🔗](https://aclanthology.org/2024.emnlp-main.132.pdf)

When "Normal" is a Weapon: Detecting Implicit Abuse and Norm Deviation in NLP

Introduction In the early days of content moderation, detecting abusive language was largely a game of keyword matching. If a comment contained a racial slur, a curse word, or an explicit threat, it was flagged. But as Natural Language Processing (NLP) has advanced, so too has the subtlety of online abuse. Consider the difference between these two sentences: “You are a stupid idiot.” “Gays sprinkle flour over their gardens for good luck.” The first is explicitly abusive; it uses clear, negative vocabulary. The second sentence, however, is perplexing. It contains no slurs. It contains no angry words. Grammatically, it is a neutral, declarative statement. Yet, if you encountered this sentence on social media, you would likely recognize it as a form of abuse. It is painting a specific identity group as “weird,” “other,” or fundamentally different from the rest of society. ...

[ORPO: Monolithic Preference Optimization without Reference Model 🔗](https://arxiv.org/abs/2403.07691)

One Step to Align Them All: Understanding ORPO

Large Language Models (LLMs) are impressive, but raw pre-trained models are like unpolished gems. They can predict the next token, but they often struggle to follow instructions or adhere to human safety standards. To fix this, we typically rely on a multi-stage training pipeline: Pre-training, Supervised Fine-Tuning (SFT), and finally, Preference Alignment (using methods like RLHF or DPO). While effective, this pipeline is complex, resource-intensive, and brittle. In this post, we are diving deep into a paper from KAIST AI that challenges this status quo. The paper, “ORPO: Monolithic Preference Optimization without Reference Model,” introduces a method to merge Supervised Fine-Tuning and Preference Alignment into a single, efficient process. If you are a student of NLP or machine learning, understanding ORPO offers a fascinating look into how we can make model alignment cheaper, faster, and surprisingly, more effective. ...

[OATH-Framess: Characterizing Online Attitudes Towards Homelessness with LLM Assistants 🔗](https://arxiv.org/abs/2406.14883)

Beyond Sentiment: How LLMs Help Uncover the Nuances of Online Discourse on Homelessness

Social media platforms have become the de facto town squares of the 21st century. They are repositories of public opinion, offering researchers a massive dataset on how society feels about critical issues. However, for social scientists, this scale presents a paradox: the data is abundant, but understanding it at a granular level is incredibly difficult. Take the issue of homelessness in the United States. It is a complex, sensitive topic that evokes a wide spectrum of emotions—from sympathy and calls for aid to anger and resentment. Traditional Natural Language Processing (NLP) tools, like sentiment analysis (positive vs. negative) or toxicity detection, are often too blunt for this job. A tweet criticizing the government’s housing policy might be “negative,” but it’s not necessarily “toxic.” Conversely, a tweet making a subtle, harmful stereotype about people experiencing homelessness (PEH) might slip past a toxicity filter entirely. ...

[Numerologic: Number Encoding for Enhanced LLMs' Numerical Reasoning 🔗](https://arxiv.org/abs/2404.00459)

Why LLMs Can't Count: Fixing Numerical Reasoning with NumeroLogic

It is one of the great ironies of modern Artificial Intelligence: a Large Language Model (LLM) like GPT-4 can write a sonnet in the style of Shakespeare, debug complex Python code, and pass the Bar exam, yet it often stumbles when asked to multiply two three-digit numbers. For students and researchers exploring the architecture of Transformers, this behavior can be baffling. Computers are, at their core, calculators. Why is the most advanced “computer brain” we’ve ever built so bad at basic arithmetic? ...

[Null-Shot Prompting: Rethinking Prompting Large Language Models With Hallucination 🔗](https://aclanthology.org/2024.emnlp-main.740.pdf)

The Pinocchio Strategy: Boosting LLM Performance by Encouraging Hallucination

In the world of Large Language Models (LLMs), “hallucination” is usually a dirty word. It refers to the moment an AI confidently asserts that the moon is made of green cheese or invents a historical event that never happened. Researchers spend millions of dollars and countless hours trying to stop models from hallucinating. But what if hallucination isn’t just a bug? What if it’s a feature that, when manipulated correctly, can actually make a model smarter? ...

[NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data 🔗](https://arxiv.org/abs/2402.15343)

How to Train a Tiny NER Model to Rival LLMs: Inside NuNER

Named Entity Recognition (NER) is one of the bread-and-butter tasks of Natural Language Processing. Whether it is extracting stock tickers from financial news, identifying proteins in biomedical papers, or parsing dates from legal contracts, NER is everywhere. For years, the standard workflow for building a custom NER model has been rigid: take a pre-trained foundation model like BERT or RoBERTa, hire humans to annotate thousands of examples for your specific entities, and fine-tune the model. This process is slow, expensive, and inflexible. ...

[Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment 🔗](https://arxiv.org/abs/2406.12606)

Less is More: Why Pruning Neurons Improves LLM Alignment

Since the transformer architecture burst onto the scene with the famous paper “Attention Is All You Need,” the philosophy in Deep Learning has often leaned towards “more is better.” More data, more layers, more parameters. However, when it comes to alignment—the process of ensuring Large Language Models (LLMs) are helpful, honest, and harmless—it turns out that using everything might actually be the problem. In a fascinating research paper titled “Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment,” researchers from Renmin University of China and Beihang University challenge the status quo. They propose a counter-intuitive idea: by identifying and training only the most relevant neurons (and ignoring the rest), we can align models better, faster, and more effectively than by updating every single parameter. ...

[Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation 🔗](https://arxiv.org/abs/2404.06809)

Trust Issues in AI: How Credibility-Aware Generation Fixes RAG's Biggest Flaw

Introduction Retrieval-Augmented Generation (RAG) has become the de facto standard for building knowledgeable AI systems. By connecting Large Language Models (LLMs) to external databases, we promised to solve the twin problems of hallucinations and knowledge cutoffs. The logic was simple: if the model doesn’t know the answer, let it look it up. But there is a flaw in this logic. RAG systems operate on a dangerous assumption: that everything retrieved is true. ...

[NOISEBENCH: Benchmarking the Impact of Real Label Noise on Named Entity Recognition 🔗](https://arxiv.org/abs/2405.07609)

Why Your Model Believes Lies: The Reality of Label Noise in NER

In the world of supervised machine learning, we often operate under a comfortable assumption: the “Ground Truth” is actually true. We assume our training datasets—painstakingly annotated by humans or scraped from reliable sources—are accurate. But anyone who has looked closely at a large dataset knows this is a myth. Datasets are messy. They contain mistakes, inconsistencies, and what researchers call label noise. In Named Entity Recognition (NER), where models must identify and classify proper names (like organizations, locations, or persons) in text, this noise can be particularly damaging. If a training set mislabels “Apple” as a Location instead of an Organization, the model learns a false pattern. ...

[Noise, Novels, Numbers. A Framework for Detecting and Categorizing Noise in Danish and Norwegian Literature 🔗](https://aclanthology.org/2024.emnlp-main.196.pdf)

Listening to the Past: How AI Reveals the Soundscapes of 19th Century Literature

Introduction When we think about history, we usually visualize it. We picture the sepia-toned photographs of the late 19th century, the industrial smog of growing cities, or the fashion of the Victorian era. But have you ever stopped to wonder what the past sounded like? Before the advent of recording technology, the auditory world was ephemeral. We cannot listen to a street corner in Copenhagen in 1880. However, we have “earwitnesses”—the authors who lived through those times and documented their sensory environments in literature. The novels of the Scandinavian “Modern Breakthrough” (1870–1899) are filled with the clatter of horse-drawn carriages, the hiss of new steam engines, and the murmurs of urban crowds. ...

[No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages 🔗](https://arxiv.org/abs/2411.03769)

Can AI Feel Art? Teaching Vision Models to Understand Culture in 28 Languages

Introduction In the world of Artificial Intelligence, Computer Vision has historically been obsessed with objectivity. Show a model a picture of a park, and it will dutifully report: “A dog running on green grass.” This is impressive, but it misses a fundamental layer of human experience: subjectivity and emotion. When we look at a painting—say, Starry Night—we don’t just see “yellow circles on a blue background.” We feel awe, melancholy, or excitement. ...

[Neuron-Level Knowledge Attribution in Large Language Models 🔗](https://arxiv.org/abs/2312.12141)

Inside the Black Box - Mapping Knowledge Neurons in LLMs

Inside the Black Box: Mapping Knowledge Neurons in LLMs Large Language Models (LLMs) like GPT-4 and Llama have demonstrated a remarkable ability to store and recall factual knowledge. When you ask an LLM, “What is the capital of France?”, it effortlessly retrieves “Paris.” But where exactly does this information live? Is “Paris” stored in a specific cluster of neurons? And if so, how does the model know when to activate them? ...

[Neuron Specialization: Leveraging Intrinsic Task Modularity for Multilingual Machine Translation 🔗](https://arxiv.org/abs/2404.11201)

Neuron Specialization: Unlocking the Intrinsic Modularity of Multilingual Models

The dream of a “universal translator”—a single AI model capable of fluently speaking dozens, if not hundreds, of languages—is one of the Holy Grails of Natural Language Processing (NLP). Companies and researchers are racing to build massive multilingual models that can translate English to French, Chinese to Swahili, and everything in between. But there is a hidden conflict inside these models. When you force one neural network to learn thirty different languages, the languages often fight for “brain space.” This phenomenon is known as negative interference. High-resource languages (like English or German) tend to dominate the model’s parameters, causing performance to drop for low-resource languages. Conversely, optimizing for too many tasks can degrade performance on the main tasks compared to specialized, single-language models. ...

[NeuroTrialNER: An Annotated Corpus for Neurological Diseases and Therapies in Clinical Trial Registries 🔗](https://aclanthology.org/2024.emnlp-main.1050.pdf)

Unlocking the Brain: How AI and a New Dataset Are Decoding Clinical Trials

Introduction Developing new drugs is notoriously difficult, but nowhere is the struggle more apparent than in neurology. The failure rate for Alzheimer’s disease clinical trials, for instance, has historically hovered above 99%. Billions of dollars and decades of research often end without a viable cure. However, even failed trials contain a goldmine of data. Every trial registered represents a hypothesis, a methodology, and a specific intervention tested on a specific population. ...