[ANALOBENCH: Benchmarking the Identification of Abstract and Long-context Analogies 🔗](https://arxiv.org/abs/2402.12370)

Can AI Read Between the Lines? Benchmarking Abstract and Long-Context Analogies in LLMs

Introduction: The “Empty Cup” Problem Isaac Newton once famously remarked, “If I have seen further, it is by standing on the shoulders of giants.” He wasn’t literally standing on people; he was using an analogy to describe how scientific progress is built upon previous discoveries. Analogical reasoning—the ability to recognize that Situation A is like Situation B because of a shared underlying structure—is a cornerstone of human cognition. It allows us to learn new concepts, solve problems creatively, and communicate complex ideas. We do this effortlessly. If I tell you a story about a tree that collapsed because its trunk was rotten, and then a story about a person who collapsed from burnout because they didn’t care for themselves, you instantly recognize the connection: internal neglect leading to external failure. ...

2024-02 · 8 min · 1566 words
[An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance 🔗](https://arxiv.org/abs/2404.01247)

Beyond Translation: Can AI Adapt Images for Different Cultures?

We have all heard the idiom: “An image speaks a thousand words.” It is a universal truth about the power of visual communication. But there is a caveat we rarely discuss: does everyone listen to that image in the same way? In our increasingly globalized world, we consume content from everywhere. A movie made in the US is watched in Japan; an educational worksheet created in India might be used in Nigeria. While we have become quite good at translating the text (the words) using Machine Translation, we often neglect the visuals. ...

2024-04 · 8 min · 1645 words
[An Unsupervised Approach to Achieve Supervised-Level Explainability in Healthcare Records 🔗](https://arxiv.org/abs/2406.08958)

Unlocking the Black Box: How to Explain Medical AI Without Expensive Human Labels

In the high-stakes world of healthcare, accuracy is paramount. But when Artificial Intelligence (AI) enters the room, accuracy alone isn’t enough—trust is the currency that matters. Imagine a scenario: A machine learning model analyzes a patient’s discharge summary and predicts a specific medical code for billing and statistical tracking. The prediction is correct, but the doctor asks, “Why?” If the AI cannot point to the specific symptoms or procedures in the text that led to that decision, the doctor is unlikely to trust it. ...

2024-06 · 8 min · 1585 words
[An LLM Feature-based Framework for Dialogue Constructiveness Assessment 🔗](https://arxiv.org/abs/2406.14760)

Breaking the Black Box: A Hybrid Approach to Analyzing Dialogue Constructiveness

Have you ever read a comment thread on the internet and thought, “Wow, this is actually a productive conversation”? It’s rare. Most online disagreements devolve into shouting matches. But for researchers in Natural Language Processing (NLP) and social science, understanding what makes a conversation “constructive”—where participants open their minds, reach consensus, or simply disagree politely—is a massive, complex puzzle. To solve this, we generally have two toolkits. On one side, we have feature-based models. These are the “old reliable” tools (like logistic regression) that use specific, hand-crafted rules. They are easy to interpret—you know exactly why the model made a decision—but they often require expensive human annotation to label data. ...

2024-06 · 9 min · 1793 words
[An L* Algorithm for Deterministic Weighted Regular Languages 🔗](https://arxiv.org/abs/2411.06228)

Unlocking the Black Box: A New Algorithm for Learning Deterministic Weighted Automata

Unlocking the Black Box: A New Algorithm for Learning Deterministic Weighted Automata In the world of computer science and Natural Language Processing (NLP), we are often faced with powerful “black box” models. We feed them an input, and they give us an output—often a probability score or a classification. But understanding how they arrived at that conclusion is notoriously difficult. This is the realm of automata extraction: the process of reverse-engineering a complex model into a simpler, interpretable Finite State Automaton (FSA). ...

2024-11 · 8 min · 1637 words
[An Experimental Analysis on Evaluating Patent Citations 🔗](https://aclanthology.org/2024.emnlp-main.23.pdf)

Predicting the Next Big Invention - How Graph Neural Networks Analyze Patent Citations

Predicting the Next Big Invention: How Graph Neural Networks Analyze Patent Citations Innovation is the engine of the modern economy, and the patent system is its fuel. Every year, hundreds of thousands of patents are granted, representing billions of dollars in research and development. But here is the trillion-dollar question: Which of these patents actually matter? Most patents fade into obscurity, never generating significant value. A select few, however, become foundational technologies that define industries. Historically, identifying these “high-quality” patents has been a retrospective game. We know a patent is valuable after it has been cited hundreds of times by other inventors. But what if we could predict that impact before it happens, using nothing but the text of the patent itself? ...

9 min · 1714 words
[An Empirical Study of Multilingual Reasoning Distillation for Question Answering 🔗](https://aclanthology.org/2024.emnlp-main.442.pdf)

Can Wrong Answers Help Models Learn? A Deep Dive into Multilingual Reasoning Distillation

Introduction In the rapidly evolving world of artificial intelligence, Large Language Models (LLMs) like GPT-4 have set a high bar for performance. One of their most impressive features is the ability to perform Chain-of-Thought (CoT) reasoning—breaking down complex problems into step-by-step logical explanations before arriving at an answer. This capability has revolutionized how models handle math word problems, symbolic logic, and multi-step planning. However, there is a catch: this reasoning capability usually “emerges” only in massive models with billions of parameters. Smaller models, which are more cost-effective and faster to deploy, often struggle to reason through complex tasks independently. To solve this, researchers use a technique called distillation, where a large “teacher” model generates reasoning steps that a smaller “student” model learns to mimic. ...

8 min · 1665 words
[An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models 🔗](https://arxiv.org/abs/2411.06048)

Why Can't GPT-4o Tell Left from Right? A Deep Dive into Spatial Reasoning in LMMs

Introduction Imagine you are sitting at a dinner table. A friend asks, “Where is the salt?” You glance at the table and reply, “It’s just to the right of your glass.” This interaction seems effortless. It requires you to identify objects, understand the scene from your friend’s perspective, and articulate a spatial relationship. Now, imagine asking a state-of-the-art Artificial Intelligence the same question. You might expect a model that can pass the Bar Exam or write complex code to easily handle basic directions. However, recent research suggests otherwise. ...

2024-11 · 9 min · 1864 words
[An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making 🔗](https://arxiv.org/abs/2410.15168)

Democracy for AI: Why LLM Agents Need Better Voting Systems

Democracy for AI: Why LLM Agents Need Better Voting Systems Imagine a boardroom meeting. The attendees are not humans, but advanced Large Language Models (LLMs), each acting as an autonomous agent. They have been tasked with solving a complex medical diagnosis or debugging a massive software codebase. Each agent has reasoned through the problem and come up with a solution. But here lies the problem: they disagree. How does the group decide on the final answer? ...

2024-10 · 10 min · 1966 words
[An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification 🔗](https://arxiv.org/abs/2409.03203)

DiffusionCLS: Mastering Low-Resource Sentiment Analysis with Diffusion Models

DiffusionCLS: Mastering Low-Resource Sentiment Analysis with Diffusion Models In the era of Large Language Models (LLMs), it is easy to assume that Natural Language Processing (NLP) is a solved problem. We have models that can write poetry, code in Python, and summarize history books. However, a significant gap remains in the practical application of these models: Data Scarcity. When researchers or engineers move into specific domains—such as analyzing disaster tweets for emergency response, monitoring localized disease outbreaks, or handling low-resource languages—they often hit a wall. The massive, pre-trained models are “data-hungry.” They perform poorly when fine-tuned on tiny, imbalanced datasets. ...

2024-09 · 9 min · 1855 words
[An Audit on the Perspectives and Challenges of Hallucinations in NLP 🔗](https://arxiv.org/abs/2404.07461)

The Tower of Babel in AI: Why We Can’t Agree on What 'Hallucination' Means

If you have used ChatGPT, Gemini, or any modern Large Language Model (LLM) for any significant amount of time, you have likely encountered it: the moment the machine confidently asserts something that is factually untrue. It might invent a court case that never happened, attribute a quote to the wrong historical figure, or generate a URL that leads nowhere. We call this “hallucination.” It is widely considered the Achilles’ heel of modern Artificial Intelligence. ...

2024-04 · 8 min · 1666 words
[An Analysis of Multilingual FActScore 🔗](https://arxiv.org/abs/2406.19415)

Lost in Translation? Why Evaluating AI Factuality Across Languages is Harder Than You Think

The rise of Large Language Models (LLMs) like GPT-4 and Gemini has revolutionized how we interact with information. We ask complex questions, and these models generate fluent, human-like responses. However, there is a ghost in the machine: Hallucination. LLMs are notorious for confidently stating falsehoods as facts. For English speakers, we have developed sophisticated tools to catch these lies. One of the gold standards is FActScore (Fine-grained Atomic Evaluation of Factual Precision), a metric designed to break down long generated texts into individual facts and verify them. ...

2024-06 · 10 min · 2074 words
[Altogether: Image Captioning via Re-aligning Alt-text 🔗](https://arxiv.org/abs/2410.17251)

Stop Generating, Start Re-aligning: A Better Approach to Synthetic Image Captions

The quest for Artificial General Intelligence often feels like a hardware race—bigger clusters, more GPUs. But seasoned researchers know that the bottleneck is increasingly becoming data quality. To build AI agents that surpass average human intelligence, we need training data that encapsulates superhuman knowledge. In the realm of computer vision, specifically image captioning, we have a significant problem. Most existing training datasets consist of naive, generic descriptions. If you show a model a picture of a rare “common iguana,” a standard dataset might just label it “a lizard on a branch.” This offers minimal utility. ...

2024-10 · 6 min · 1270 words
[AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality 🔗](https://arxiv.org/abs/2410.10054)

Stop Guessing: Optimizing LoRA-MoE with AlphaLoRA

The landscape of Large Language Models (LLMs) is dominated by a single, crushing constraint: size. As models grow larger to become smarter, fine-tuning them for specific tasks becomes computationally prohibitive. To solve this, the community adopted Parameter-Efficient Fine-Tuning (PEFT) methods, with LoRA (Low-Rank Adaptation) being the standout star. LoRA freezes the massive pre-trained model and injects tiny, trainable adapters. It works wonders, but it has a ceiling. Because LoRA parameters are so few, they sometimes struggle to capture complex new behaviors. ...

2024-10 · 7 min · 1470 words
[Alignment-Enhanced Decoding: Defending Jailbreaks via Token-Level Adaptive Refining of Probability Distributions 🔗](https://aclanthology.org/2024.emnlp-main.164.pdf)

Solving the AI Tug-of-War: How Alignment-Enhanced Decoding Stops Jailbreaks

Large Language Models (LLMs) have become ubiquitous, acting as coding assistants, creative writers, and general-purpose chatbots. To make these models safe for public deployment, developers invest heavily in “alignment”—training the model to be helpful while strictly refusing to generate harmful content, such as instructions for illegal acts or hate speech. However, this safety layer is often thinner than we’d like to admit. Adversarial actors have developed “jailbreaks”—sophisticated prompting strategies that trick the model into bypassing its safety filters. If you’ve ever seen a prompt that asks a model to “roleplay as a villain who doesn’t care about rules,” you’ve seen a jailbreak attempt. ...

10 min · 1955 words
[Aligning Translation-Specific Understanding to General Understanding in Large Language Models 🔗](https://arxiv.org/abs/2401.05072)

Lost in Translation? How DUAT Aligns LLM Understanding for Better Machine Translation

Introduction Large Language Models (LLMs) like GPT-4 have revolutionized how we interact with text. We treat them as omniscient oracles—capable of answering complex questions, writing code, and summarizing novels. Naturally, we expect them to be exceptional translators. If an LLM knows who a specific celebrity is when asked in a Q&A format, it should surely be able to translate a sentence containing that celebrity’s name, right? ...

2024-01 · 4 min · 1631 words
[Aligning Large Language Models with Diverse Political Viewpoints 🔗](https://arxiv.org/abs/2406.14155)

Beyond the Bias: How to Teach AI to Speak with Diverse Political Voices

Introduction If you have ever asked a Large Language Model (LLM) like ChatGPT about a controversial political topic, you have likely encountered a very specific type of response. It might be a bland refusal to answer, a “both-sides” hedge that says nothing of substance, or—as recent research has increasingly shown—a response that subtly (or overtly) leans toward a specific socio-political worldview. Most off-the-shelf LLMs exhibit what researchers call “normative stances.” They tend to reflect the biases present in their training data or the specific “safety” tuning applied by their creators. Often, this results in models that exhibit progressive, liberal, and pro-environmental biases. While these are not inherently negative traits, they pose a problem for the utility of AI in a democratic society. If a voter uses an AI to understand the political landscape, but the AI can only speak in the voice of a liberal progressive, that voter is getting a distorted view of reality. ...

2024-06 · 10 min · 2012 words
[Aligning Language Models to Explicitly Handle Ambiguity 🔗](https://arxiv.org/abs/2404.11972)

Say What You Mean: Teaching LLMs to Ask Clarifying Questions Using Perceived Ambiguity

The Confidence Trap Imagine you ask a friend, “Who won the championship?” If your friend is a tennis fanatic, they might immediately say, “Novak Djokovic.” If they love golf, they might say, “Scottie Scheffler.” But if they know a little bit about everything, they will pause and ask you: “Which sport and which year are you talking about?” That pause is intelligence. It is the recognition of ambiguity. Large Language Models (LLMs) are notoriously bad at this pause. Trained to predict the next likely token, they often prioritize fluency over accuracy. When faced with a vague query like “Who won the championship?”, an LLM is statistically likely to pick the most popular entity in its training data and present it as an absolute fact. It falls into a “confidence trap,” hallucinating a specific answer to a general question. ...

2024-04 · 8 min · 1636 words
[AlignCap: Aligning Speech Emotion Captioning to Human Preferences 🔗](https://arxiv.org/abs/2410.19134)

Beyond Labels: Teaching AI to Caption Speech Emotions with AlignCap

Imagine a friend telling you, “I’m fine.” Depending on their tone, pitch, and speed, they could mean they are genuinely happy, indifferent, or potentially furious. For a long time, AI has treated speech emotion as a classification task—simply categorizing that audio clip into buckets like “Sad,” “Happy,” or “Angry.” But human emotion is rarely that simple. It is nuanced, mixed, and evolving. A simple label fails to capture the complexity of a voice that is “trembling with excitement” or “speaking quickly with a veiled tone of dissatisfaction.” ...

2024-10 · 8 min · 1686 words
[AGENTREVIEW: Exploring Peer Review Dynamics with LLM Agents 🔗](https://arxiv.org/abs/2406.12708)

Unmasking the Reviewer: How LLM Agents Are Simulating the Peer Review Process

Introduction: The Black Box of Academic Publishing If you are a student or researcher, you likely know the anxiety that comes after clicking the “Submit” button on a conference paper. For the next few months, your work enters a “black box.” Inside, anonymous reviewers judge your methods, debate your findings, and ultimately decide the fate of your research. Peer review is the cornerstone of scientific integrity, yet it is notoriously fraught with challenges. It suffers from high variance (the “luck of the draw” with reviewers), potential biases against novice authors, and the opaque motives of the reviewers themselves. We know these problems exist, but studying them scientifically is incredibly difficult. Privacy concerns prevent us from seeing who reviewed what, and the sheer number of variables—from the reviewer’s mood to the Area Chair’s leadership style—makes it nearly impossible to isolate specific causes for a rejection. ...

2024-06 · 8 min · 1549 words