EMNLP 2024

[Learning to Extract Structured Entities Using Language Models 🔗](https://arxiv.org/abs/2402.04437)

Beyond Triplets: Revolutionizing Information Extraction with Structured Entities and MuSEE

In the vast ocean of unstructured text on the internet—Wikipedia pages, news articles, financial reports—lies a treasure trove of data waiting to be organized. For years, the field of Information Extraction (IE) has been the miner of this digital age, digging through paragraphs to find relationships between things. Traditionally, this has been done by hunting for “triplets”: a Subject, a Relation, and an Object (e.g., Bill Gates, Co-founder, Microsoft). While effective, this approach has limits. It treats information as a bag of disconnected facts rather than a cohesive profile of an entity. ...

[Learning to Correct for QA Reasoning with Black-box LLMs 🔗](https://arxiv.org/abs/2406.18695)

CoBB: How to Fix Black-Box LLM Errors Without Accessing Weights

Introduction In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4, Claude, and Gemini have become ubiquitous. They possess incredible reasoning capabilities, yet they remain prone to hallucinations, biases, and reasoning errors. For researchers and engineers, the standard solution to these errors is usually fine-tuning or steering the model. However, a significant barrier exists: most state-of-the-art models are black boxes. We interact with them via APIs, sending a prompt and receiving text. We do not have access to the model’s weights, gradients, or often even the output token probabilities (logits). This opacity makes traditional adaptation methods—which rely on accessing internal model states—impossible. ...

[Learning from Natural Language Explanations for Generalizable Entity Matching 🔗](https://arxiv.org/abs/2406.09330)

Can Explanations Teach Small Models to Generalize? A Deep Dive into Entity Matching

Imagine you are a data scientist at a massive e-commerce aggregator. You have a database of products from Amazon and another from Google Shopping. Your task is to merge them. On one side, you have a record: iPhone 13, 128GB, Midnight. On the other side: Apple iPhone 13 - Black - 128 GB Storage. To a human, these are obviously the same product. But to a machine, they are just strings of characters. This is the problem of Entity Matching (EM)—identifying when different records refer to the same real-world entity. ...

[Learning Planning-based Reasoning via Trajectories Collection and Process Reward Synthesizing 🔗](https://arxiv.org/abs/2402.00658)

Baking System 2 Thinking into LLMs: How Offline Simulation Improves Reasoning

Baking System 2 Thinking into LLMs: How Offline Simulation Improves Reasoning Large Language Models (LLMs) like GPT-4 and Llama 2 have dazzled the world with their ability to write poetry, code, and essays. However, when it comes to rigorous logical reasoning or complex multi-step mathematics, cracks often appear in the facade. The model might hallucinate facts, make logical leaps that don’t follow, or simply guess the final answer without understanding the “why.” ...

[Learning Personalized Alignment in Evaluating Open-ended Text Generation 🔗](https://arxiv.org/abs/2310.03304)

Beyond the Average User: How PERSE Teaches AI to Evaluate Text Like a Human

In the world of Artificial Intelligence, we have become very good at generating text. Models like GPT-4 and LLaMA-2 can write poetry, code, and short stories with ease. However, evaluating that text remains a massive hurdle. In objective tasks like translation or summarization, we have ground truths to compare against. But what about creative writing? If I write a story with a tragic, ambiguous ending, is it “good”? One reader might call it “poignant and realistic,” while another dismisses it as “depressing and unsatisfying.” ...

[Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation 🔗](https://arxiv.org/abs/2406.19760)

Unlocking Judicial Fairness - How LLMs and Legal Knowledge are Revolutionizing Case Retrieval

Unlocking Judicial Fairness: How LLMs and Legal Knowledge are Revolutionizing Case Retrieval In the legal world, the concept of stare decisis—standing by things decided—is foundational. For judges and lawyers, sourcing relevant precedents is not just a research task; it is a critical requirement for upholding judicial fairness. If a judge cannot find a past case that mirrors the current one, the consistency of the law is at risk. However, finding that “needle in a haystack” is becoming increasingly difficult. Legal case retrieval is significantly different from typing a question into Google. The “queries” are often entire case documents describing a new situation, and the “documents” to be retrieved are lengthy, complex judgments from the past. These documents are filled with specialized jargon, intricate procedural details, and often contain multiple different crimes within a single text. ...

[Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism 🔗](https://arxiv.org/abs/2311.01041)

The Art of Saying 'I Don't Know': How L2R Makes LLMs Reliable

Introduction: The Problem with the Confident Liar If you have spent any time interacting with Large Language Models (LLMs) like ChatGPT or LLaMA, you have likely encountered a specific, frustrating behavior: the confident hallucination. You ask a question about a niche topic, a fictional character, or a specific medical condition, and the model responds with absolute certainty. It sounds plausible, the grammar is perfect, and the logic seems sound. But there is one problem—the facts are completely made up. ...

[Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning 🔗](https://arxiv.org/abs/2406.12050)

Training for Depth: How Reflective Augmentation Teaches LLMs to Understand, Not Just Solve

If you have ever tutored a student in mathematics, you know there is a distinct difference between memorization and understanding. A student who memorizes might be able to solve a specific quadratic equation because they’ve seen that exact pattern fifty times. But if you ask them, “How would this change if the coefficient was negative?” or “Can you solve this using a different method?”, they crumble. They have the answer, but they don’t have the reasoning depth. ...

[Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models 🔗](https://arxiv.org/abs/2410.20008)

Inside the Black Box: How Instruction Tuning Rewires LLM Layers

Introduction We have become accustomed to the “magic” of Large Language Models (LLMs). You type a prompt—whether it’s a request to translate a sentence, summarize a paragraph, or analyze the sentiment of a review—and the model obliges. But beneath the surface, what is actually happening inside the neural network? While we know how to train these models—feeding them massive datasets and refining them with instruction tuning—we often treat the resulting model as a “black box.” We know the input and the output, but the internal mechanics of where and when the model decides to perform a specific task remain largely a mystery. Does the model know it is performing a translation in the very first layer? or does that realization happen right before the final output? ...

[LawBench: Benchmarking Legal Knowledge of Large Language Models 🔗](https://arxiv.org/abs/2309.16289)

Can AI Replace Lawyers? Inside LawBench, the Ultimate Test for Chinese Legal LLMs

Introduction In the past few years, the headline “AI Passes the Bar Exam” has appeared in nearly every major tech publication. It is a compelling narrative: Large Language Models (LLMs) like GPT-4 have ingested so much information that they can technically qualify to practice law. But any practicing attorney will tell you that passing a standardized test and navigating the nuanced, high-stakes reality of the legal system are two very different things. ...

[Latent Concept-based Explanation of NLP Models 🔗](https://arxiv.org/abs/2404.12545)

Beyond Highlighting Words: Unlocking NLP Black Boxes with Latent Concepts

Introduction Deep learning models, particularly Large Language Models (LLMs) like BERT, RoBERTa, and Llama, have achieved superhuman performance on a vast array of Natural Language Processing (NLP) tasks. Yet, despite their brilliance, they suffer from a significant flaw: they are “black boxes.” We feed them a sentence, and they spit out a prediction, but the internal reasoning process remains largely opaque. For students and researchers aiming to build safe and trustworthy AI, this opacity is a problem. How do we trust a model if we don’t know why it made a decision? ...

[Large Language Models for Data Annotation and Synthesis: A Survey 🔗](https://arxiv.org/abs/2402.13446)

The End of Manual Labeling? How LLMs Are Revolutionizing Data Annotation

If you have ever trained a machine learning model, you know the pain. You have a brilliant architecture and a clear objective, but you hit the inevitable bottleneck: data. specifically, labeled data. For years, the gold standard for getting high-quality labels was human annotation. Whether relying on expensive domain experts (like doctors labeling X-rays) or crowdsourcing platforms (like Amazon Mechanical Turk), the process is slow, costly, and often inconsistent. But we are currently witnessing a paradigm shift. The very models that consume data—Large Language Models (LLMs)—are now capable of producing it. ...

[Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment 🔗](https://arxiv.org/abs/2408.12194)

Can LLMs Fix Search? A Deep Dive into Next-Gen Dense Retrieval

If you have used a search engine recently, you have likely benefited from dense retrieval. Unlike the search engines of the 90s that looked for exact keyword matches, modern systems try to understand the meaning behind your query. They turn your words into a list of numbers (a vector) and look for documents with similar vectors. For years, the backbone of this technology has been models like BERT and T5. They are excellent at what they do, but they have hit a ceiling. They struggle with long documents, they often fail when presented with data from a new domain (like legal or medical documents), and they require massive amounts of labeled data to train. ...

[Large Language Models Know What is Key Visual Entity: An LLM-assisted Multimodal Retrieval for VQA 🔗](https://aclanthology.org/2024.emnlp-main.613.pdf)

LLMs as Visual Detectives: How Focusing on Key Entities Solves Complex Visual Questions

Imagine you are looking at a photograph of a bustling city street. In the background, there is a bus. A friend asks you, “What is the name of the bus company?” To answer, your eyes immediately filter out the pedestrians, the buildings, the traffic lights, and the clouds. You focus entirely on the logo printed on the side of the bus. For humans, this selective attention is instinctual. For Artificial Intelligence, specifically Visual Question Answering (VQA) systems, it is incredibly difficult. When presented with a complex image, traditional AI models often get “distracted” by the most prominent objects (like the pedestrians) rather than the specific detail required to answer the question (the logo). ...

[Large Language Models Can Self-Correct with Key Condition Verification 🔗](https://arxiv.org/abs/2405.14092)

Can LLMs Grade Their Own Homework? Unlocking Self-Correction with PROCO

Introduction We have all been there. You ask a Large Language Model (LLM) a complex question—perhaps a tricky math word problem or an obscure trivia query—and it confidently gives you an answer. It looks plausible. The reasoning seems sound. But then, upon closer inspection, you realize the answer is completely wrong. This phenomenon, often called hallucination or reasoning failure, is one of the biggest hurdles in deploying AI agents for high-stakes tasks. The natural solution seems to be: “Why don’t we just ask the model to double-check its work?” ...

[PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners 🔗](https://arxiv.org/abs/2310.02469)

Can We Teach LLMs to Keep Secrets? Deep Dive into PrivacyMind

Introduction: The Dilemma of the Expert LLM The explosion of Large Language Models (LLMs) has changed the landscape of artificial intelligence. We have moved past general-purpose chatbots to an era of specialized experts—models like BloombergGPT for finance or Med-PaLM for medicine. To create these experts, we take a general model (like LLaMA) and fine-tune it on domain-specific data. But here lies a critical problem: Domain-specific data is often sensitive. To train a medical LLM effectively, you need real medical records. To train a financial advisor, you need real transaction histories. This data is riddled with Personally Identifiable Information (PII)—names, addresses, social security numbers, and specific medical conditions. ...

[Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark 🔗](https://aclanthology.org/2024.emnlp-main.759.pdf)

Beyond the Board Exam: Why LLMs Struggle with Real-World Clinical Decisions

Introduction In the past year, headlines have been dominated by the impressive feats of Large Language Models (LLMs) in the medical field. We’ve seen reports of AI passing the United States Medical Licensing Examination (USMLE) with flying colors, performing on par with—or sometimes even better than—human experts on standardized tests. It is easy to look at these results and assume that we are on the brink of an AI revolution in daily clinical practice. ...

[Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks 🔗](https://arxiv.org/abs/2407.00869)

The Truth Hurts: How Exploiting the 'Involuntary Honesty' of LLMs Breaks Safety Guardrails

Lying is harder than telling the truth. To tell the truth, you simply recall a fact or perform a logical deduction. To tell a lie—specifically a convincing one—you must know the truth, deliberately suppress it, fabricate a plausible alternative, and ensure the fabrication maintains internal consistency. It is a complex cognitive task. We often assume that Large Language Models (LLMs) are masters of hallucination, capable of spinning wild tales or getting facts wrong. However, a fascinating new research paper titled “Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks” reveals a paradoxical weakness in these systems: they struggle to lie on purpose. ...

[Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course 🔗](https://arxiv.org/abs/2407.05216)

Grading at Scale: What Happens When You Let GPT-4 Grade 1,000 Students?

Introduction Imagine being a teaching assistant for a university course. Now, imagine 1,028 students just submitted an essay assignment. Even if you spent only five minutes grading each one, that is over 85 hours of non-stop grading. This scalability bottleneck is one of the oldest problems in higher education. With the rise of Large Language Models (LLMs), a solution seems obvious: why not let the AI grade the assignments? We know models like GPT-4 are capable of sophisticated reasoning and analysis. However, moving this technology from a controlled experiment into a real-world classroom is fraught with challenges. How do students feel about being graded by a robot? Who pays for the API costs? And, perhaps most importantly, will students try to trick the AI into giving them a perfect score? ...

[Language-to-Code Translation with a Single Labeled Example 🔗](https://aclanthology.org/2024.emnlp-main.462.pdf)

How to Teach LLMs to Code with Just One Example—A Deep Dive into ICIP

Imagine you have just released a new software library or a specialized database API. You want developers to be able to use it effortlessly, perhaps by simply typing natural language commands like “find all users who signed up yesterday” rather than writing complex SQL queries or function calls. To build a tool that translates English into your specific code, you typically face a massive hurdle: data. Training a model to understand a specific API usually requires thousands of pairs of natural language commands and their corresponding code snippets. Creating this dataset is expensive, time-consuming, and tedious. ...