EMNLP 2024

[Extending Context Window of Large Language Models from a Distributional Perspective 🔗](https://arxiv.org/abs/2410.01490)

Breaking the Length Barrier: How Distributional Analysis Extends LLM Context Windows

Introduction Imagine reading a mystery novel, but by the time you reach the final chapter, you’ve completely forgotten the clues introduced in the first few pages. This is the reality for many Large Language Models (LLMs). While models like LLaMA-2 are powerful, they are often trained with a fixed “context window” (e.g., 4,000 tokens). Ask them to process a 10,000-token document, and they hit a wall. To solve this, researchers don’t want to retrain these massive models from scratch—it’s too expensive. Instead, they try to “stretch” the model’s existing capabilities to handle longer texts during inference. Common techniques involving Position Interpolation (PI) or methods like YaRN have made great strides, but they often rely on heuristics or “gut feelings” about which parameters to tweak. ...

[Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models 🔗](https://arxiv.org/abs/2410.08436)

Beyond Chain-of-Thought—Teaching LLMs to Build Structured Proof Graphs

Introduction In the world of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and Llama-3 have become akin to brilliant but occasionally unreliable students. Ask them a complex question, and they might give you the correct answer. However, if you ask them why they reached that conclusion, the explanation can sometimes be a hallucinated mess or a logical leap of faith. For casual conversation, this is tolerable. But for high-stakes domains—legal analysis, scientific discovery, or complex logical puzzles—we need more than just an answer. We need a proof. We need to see the intermediate steps, the evidence used, and the logical structure that connects the premises to the conclusion. ...

[Exploring the Practicality of Generative Retrieval on Dynamic Corpora 🔗](https://arxiv.org/abs/2305.18952)

Can AI Search Engines Keep Up? Generative vs. Dual Encoder Retrieval in a Changing World

In the world of computer science research, benchmarks often rely on “static” data. We train a model on Wikipedia dumps from 2018, test it on questions about that data, and call it a day. But in the real world, information is fluid. News breaks, laws change, and new scientific discoveries are made every hour. A search engine that excels at retrieving history but fails to index today’s news is functionally useless. ...

[Exploring the Learning Capabilities of Language Models using LEVERWORLDS 🔗](https://arxiv.org/abs/2410.00519)

Can LLMs Learn Physics? The Battle Between Transformers and Classical Statistics

Introduction In the current era of Artificial Intelligence, Large Language Models (LLMs) are often hailed as “general-purpose learners.” We’ve seen them write code, compose sonnets, and even pass bar exams. This versatility has led to a growing assumption: if you throw enough data at a Transformer, it can learn the underlying model of almost anything. But how true is this when we step away from language and move toward the physical world? Does an LLM actually “understand” the laws of physics governing a system, or is it just memorizing statistical correlations? ...

[Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems 🔗](https://aclanthology.org/2024.emnlp-main.915.pdf)

Can AI Connect the Dots? Investigating Compositional Reasoning in LLMs with Trap Problems

If you ask a student to solve the equation \(x^2 + x = 3\), they might grab a piece of paper, use the quadratic formula, and give you a precise irrational number involving square roots. But if you change the question slightly to “Find the integer solution of the equation \(x^2 + x = 3\),” the student’s behavior changes. They will solve it, realize the result isn’t an integer, and correctly answer: “There are no integer solutions.” ...

[Exploring Union and Intersection of Visual Regions for Generating Questions, Answers, and Distractors 🔗](https://aclanthology.org/2024.emnlp-main.88.pdf)

Beyond Redundancy: How ReBo Generates Diverse Visual Questions, Answers, and Distractors

Introduction In the rapidly evolving world of Large Vision-Language Models (LVLMs), the ability of an AI to look at an image and ask intelligent questions is just as important as its ability to answer them. We rely on massive datasets of “Visual Question Answering” (VQA) pairs to train these models. However, there is a bottleneck: creating high-quality, multiple-choice questions for images is labor-intensive for humans, and when machines try to do it, they often get stuck in a loop of redundancy. ...

[Exploring Space Efficiency in a Tree-based Linear Model for Extreme Multilabel Classification 🔗](https://arxiv.org/abs/2410.09554)

The Invisible Pruning: Why Tree Models Are Smaller Than You Think

Imagine you are building a search system for an online retailer with millions of products, or a tagging system for Wikipedia articles with hundreds of thousands of categories. This is the realm of Extreme Multi-label Classification (XMC). In these scenarios, you aren’t just choosing between “Cat” or “Dog.” You are selecting a relevant subset of labels from a massive universe of possibilities. To solve this, Data Scientists often rely on linear models because they are simple and fast. However, they face a massive bottleneck: Space. ...

[Exploring Nested Named Entity Recognition with Large Language Models: Methods, Challenges, and Insights 🔗](https://aclanthology.org/2024.emnlp-main.492.pdf)

Peeling the Onion: Can LLMs Master Nested Named Entity Recognition?

Natural Language Processing (NLP) has seen a meteoric rise in capabilities thanks to Large Language Models (LLMs) like ChatGPT and Llama. We often see these models generating poetry, solving code, or summarizing emails with ease. However, when we apply them to rigorous Information Extraction (IE) tasks, the cracks begin to show. One of the most deceptive challenges in NLP is Named Entity Recognition (NER). While identifying a person or a location sounds simple, language is rarely flat. It is hierarchical. It is recursive. It is nested. ...

[Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation 🔗](https://arxiv.org/abs/2409.05224)

Less is More: Optimizing Multilingual Translation with Language-Specific Subspaces

The dream of a “universal translator”—a single AI model capable of fluently translating between hundreds of languages—is closer than ever. Models like NLLB (No Language Left Behind) and M2M-100 have demonstrated that massive, pre-trained transformers can handle a dizzying array of linguistic pairs. But there is a catch. These models are behemoths, often containing billions of parameters. Fine-tuning them for specific tasks or new data is computationally expensive and storage-heavy. Worse, there is a phenomenon known as “negative interference” or the “curse of multilinguality.” When you fine-tune a model to improve a low-resource language (like Zulu or Occitan), the model often forgets or degrades its performance on high-resource languages (like English or French). It’s a zero-sum game where languages fight for capacity within the neural network. ...

[Exploring Intra and Inter-language Consistency in Embeddings with ICA 🔗](https://arxiv.org/abs/2406.12474)

Decoding the Universal Language of AI: Finding Consistent Meaning in Word Vectors

Introduction: The “Black Box” of Language Imagine you are looking at the brain of an Artificial Intelligence. You ask it what the word “Argentina” means. Instead of showing you a map or a flag, the AI hands you a slip of paper with a list of numbers: [0.0088871, -0.02218, ...]. This is the fundamental challenge of word embeddings. In modern Natural Language Processing (NLP), words are converted into multi-dimensional vectors—coordinates in a massive geometric space. These vectors are incredibly useful for computers; they allow machines to calculate analogies (like “King - Man + Woman = Queen”) and understand relationships. However, for humans, they are unreadable. A single dimension in a 300-dimensional vector usually doesn’t mean anything specific on its own. It’s just a mathematical artifact. ...

[Explicit, Implicit, and Scattered: Revisiting Event Extraction to Capture Complex Arguments 🔗](https://arxiv.org/abs/2410.03594)

Beyond the Highlighter: How Generative AI is Revolutionizing Event Extraction

Imagine you are a doctor reading a patient’s forum post on Reddit. The patient writes, “I haven’t taken my 12mg dose since Thursday… struggling with the shakes.” As a human, you immediately understand several things: The Event: The patient is tapering off medication or withdrawing. Implicit Information: Even though they didn’t explicitly say “I stopped taking my meds,” the context implies a “Stop” event. Scattered Information: The dosage (12mg) and the timing (since Thursday) are separated from the symptoms (shakes), yet they all belong to the same medical event. For years, Natural Language Processing (NLP) models have treated Event Extraction (EE) like a student with a highlighter. They look for specific, continuous spans of text to identify “Who,” “What,” and “When.” But as the example above shows, real-world communication—especially online discourse—is rarely that tidy. ...

[Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering 🔗](https://arxiv.org/abs/2403.02966)

Making LLMs Honest - How Summarizing Knowledge Graphs Improves Question Answering

Making LLMs Honest: How Summarizing Knowledge Graphs Improves Question Answering Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how we interact with information. They can write poetry, code, and answer complex questions. However, they suffer from a well-known flaw: hallucinations. Because their knowledge is “frozen” in their parameters from training, they often get facts wrong, especially obscure or evolving ones. To fix this, researchers and engineers typically use RAG (Retrieval-Augmented Generation). The idea is simple: look up the relevant information from an external source and feed it to the LLM. One of the most structured and reliable sources for this external data is a Knowledge Graph (KG). ...

[Event Causality Identification with Synthetic Control 🔗](https://arxiv.org/abs/2509.18156)

Finding Parallel Universes in Text: How Synthetic Control Solves Event Causality

Introduction “She got a high-paying job because she graduated from a top university.” When we read a sentence like this, our brains instantly form a causal link. We assume the degree caused the job offer. But did it? Perhaps she was simply a brilliant coder, and she would have landed that job regardless of her alma mater. To determine if the degree was the true cause, we would ideally need to see a parallel universe: one where she didn’t go to that university but had the exact same skills and background, and observe if she still got the job. ...

[Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection 🔗](https://arxiv.org/abs/2308.10819)

When Smart Models Act Dumb: Analyzing Prompt Injection in LLMs

When Smart Models Act Dumb: Analyzing Prompt Injection in LLMs Imagine you have hired a highly efficient, incredibly eager personal assistant. You hand them a stack of documents and say, “Summarize the financial report on page 5.” The assistant rushes off, reads the documents, and comes back. But instead of a summary, they say, “I have deleted all your calendar appointments, as requested.” Confused, you ask, “Why did you do that?” ...

[Evaluating the Effectiveness of Large Language Models in Establishing Conversational Grounding 🔗](https://aclanthology.org/2024.emnlp-main.545.pdf)

Are LLMs Actually Listening? The Challenge of Conversational Grounding

Have you ever had a conversation where you thought the other person understood you, only to realize ten minutes later they had absolutely no idea what you were talking about? In human communication, avoiding this requires a constant, subtle process of checking, clarifying, and confirming. This is called Conversational Grounding. We know Large Language Models (LLMs) like GPT-4 and Llama are fluent speakers. They can generate poetry, code, and essays. But are they good listeners? Do they truly build a shared understanding with a user, or are they just predicting the next likely word based on surface-level patterns? ...

[Evaluating n-Gram Novelty of Language Models Using RUSTY-DAWG 4 🔗](https://arxiv.org/abs/2406.13069)

Are LLMs Just Stochastic Parrots? Measuring Novelty with RUSTY-DAWG

Introduction: The Copy-Paste Dilemma In the era of Generative AI, a single question looms larger than perhaps any other: Do Large Language Models (LLMs) actually create new content, or are they just sophisticated copy-paste machines? This question isn’t just academic curiosity. It is the fulcrum upon which billion-dollar lawsuits (like The New York Times vs. OpenAI) and the future of copyright law pivot. If an LLM is merely regurgitating memorized chunks of text from its massive training corpus, the argument for “fair use” becomes shaky. Conversely, if models are synthesizing information to create truly novel sentences, they are fulfilling the promise of artificial intelligence. ...

[Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language Models 🔗](https://arxiv.org/abs/2406.13556)

Do AI Models Get More Biased Over Time? Analyzing the Evolution of Social Biases on Social Media

Introduction In the rapidly evolving world of Natural Language Processing (NLP), we often view Large Language Models (LLMs) as static repositories of knowledge. We train them, we freeze them, and we use them. But the data fueling these models—particularly data scraped from social media platforms like X (formerly Twitter)—is anything but static. It is a living, breathing, and often turbulent stream of human consciousness. We know that social media usage is growing exponentially, with hundreds of millions of new users joining annually. We also know that these platforms can be breeding grounds for social biases. This raises a critical, uncomfortable question for the AI community: If we continuously train language models on an ever-growing stream of social media data, are we inadvertently amplifying social biases over time? ...

[Evaluating Readability and Faithfulness of Concept-based Explanations 🔗](https://arxiv.org/abs/2404.18533)

Auditing the Auditors: How to rigorously Measure AI Concept Explanations

Auditing the Auditors: How to Rigorously Measure AI Concept Explanations In the rapidly evolving world of Large Language Models (LLMs), we face a “black box” problem. We know that these models process vast amounts of text and develop internal representations of the world, but understanding how they do it remains a significant challenge. When an LLM outputs a sentence about “computer security,” which specific neurons fired? Did the model understand the abstract concept of “security,” or was it just matching patterns? ...

[Evaluating Psychological Safety of Large Language Models 🔗](https://arxiv.org/abs/2212.10529)

Psychoanalyzing the Machine: Do LLMs Have Dark Personality Traits?

Introduction: Beyond Bad Words In the 1960s, a computer scientist named Joseph Weizenbaum created ELIZA, a simple chatbot designed to mimic a psychotherapist. It didn’t understand language; it just matched patterns. Yet, users found themselves emotionally attached to it, pouring out their secrets. Fast forward sixty years, and we have Large Language Models (LLMs) like GPT-4 and Llama-2. These models are lightyears ahead of ELIZA, capable of reasoning, coding, and holding deeply nuanced conversations. ...

[Evaluating Large Language Models via Linguistic Profiling 🔗](https://aclanthology.org/2024.emnlp-main.166.pdf)

Can LLMs Actually Write? Testing Linguistic Constraints Beyond Standard Benchmarks

Introduction We live in an era where Large Language Models (LLMs) like GPT-4, LLaMA, and Mistral are passing bar exams, solving complex mathematical proofs, and writing code. We judge them based on “leaderboards”—massive lists of benchmarks that test their reasoning capabilities, world knowledge, and problem-solving skills. But there is a fundamental question that often gets lost in the excitement over these high-level cognitive tasks: How good are these models at the basic mechanics of language? ...