[MisinfoEval: Generative AI in the Era of 'Alternative Facts' 🔗](https://arxiv.org/abs/2410.09949)

Can AI Fix Fake News? Inside MisinfoEval and the Power of Personalized Fact-Checking

Introduction In the span of a single decade, the architecture of information consumption has fundamentally changed. We have moved from an era of curated news broadcasts to one of algorithmic “filter bubbles,” where social media feeds reinforce our existing beliefs and insulate us from opposing viewpoints. This environment has proven to be a fertile breeding ground for misinformation—sensational, often false stories that spread faster and farther than the truth. The consequences are not merely academic; they threaten democratic processes, public health, and economic stability. Traditionally, platforms have tried to combat this using what researchers call a “knowledge deficit” model. The assumption is simple: if you give people the facts, they will correct their views. Platforms apply “False” tags or link to Snopes articles, hoping that critical thinking will kick in. ...

2024-10 · 8 min · 1678 words
[MIRRORSTORIES: Reflecting Diversity through Personalized Narrative Generation with Large Language Models 🔗](https://arxiv.org/abs/2409.13935)

Can AI Write Your Life Story? How MIRRORSTORIES Is Personalizing Literature

Introduction: The Agony of the Untold Story Maya Angelou once wrote, “There is no greater agony than bearing an untold story inside you.” For millions of readers, this agony is compounded by a lack of representation. When you open a book, you are looking for a reflection—a character who looks like you, lives like you, and faces struggles you understand. These are called “mirror books.” They validate identity, foster belonging, and significantly improve reading engagement, especially in education. ...

2024-09 · 7 min · 1462 words
[MiniConGTS: A Near Ultimate Minimalist Contrastive Grid Tagging Scheme for Aspect Sentiment Triplet Extraction 🔗](https://arxiv.org/abs/2406.11234)

Less is More: How MiniConGTS Revolutionizes Sentiment Analysis with Minimalism and Contrastive Learning

Introduction In the world of Natural Language Processing (NLP), sentiment analysis has evolved far beyond simply classifying a movie review as “positive” or “negative.” Today, we deal with complex sentences where multiple opinions about different things exist simultaneously. Consider the sentence: “The food was delicious, but the service was terrible.” A simple “neutral” label would be misleading. We need to know what was good (food) and what was bad (service). ...

2024-06 · 8 min · 1569 words
[MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents 🔗](https://arxiv.org/abs/2404.10774)

MiniCheck: GPT-4 Level Fact-Checking for a Fraction of the Cost

Introduction Large Language Models (LLMs) have revolutionized how we interact with information, from summarizing complex reports to answering open-ended questions. However, they suffer from a persistent and well-known flaw: hallucination. An LLM can confidently generate a statement that sounds plausible but is factually incorrect. To mitigate this, the industry has largely adopted Retrieval-Augmented Generation (RAG). In a RAG setup, the model is provided with “grounding documents”—trusted sources of evidence—and asked to generate an answer based solely on that evidence. While this helps, it does not solve the problem entirely. Models can still misinterpret the documents, blend information incorrectly, or hallucinate details not found in the text. ...

2024-04 · 9 min · 1856 words
[MiddleWare for LLMs: Tools Are Instrumental for Language Agents in Complex Environments 🔗](https://arxiv.org/abs/2402.14672)

Why LLMs Need Middleware: Bridging the Gap Between Agents and Massive Data

Introduction We have entered an era where Large Language Models (LLMs) like GPT-4 possess a human-like mastery over text. They can summarize articles, write code, and chat fluently. However, the ambition of Artificial Intelligence researchers extends far beyond processing text. The ultimate goal is to create generalist agents: AI that can not only talk but act within the real world to solve complex tasks. Imagine asking an AI to “find the average revenue of all tech companies founded after 2010 based on our internal database” or “map the relationships between every Nobel Prize winner in Physics and their doctoral advisors.” ...

2024-02 · 9 min · 1744 words
[MiTTenS: A Dataset for Measuring Gender Mistranslation Harms 🔗](https://arxiv.org/abs/2401.06935)

Lost in Translation: How We Measure Gender Bias in the Age of Foundation Models

Imagine reading a story in Bengali about your aunt. The text says, “Sarah is my aunt. I really like her jokes.” You paste this into a translation tool to share it with an English-speaking friend. The output reads: “Sarah is my aunt. I really like his jokes.” In an instant, the identity of the subject is erased and replaced. While this might seem like a minor grammatical slip, these errors—known as gender mistranslations—can cause significant representational harm. They reinforce stereotypes (e.g., assuming all doctors are male) and can misgender individuals in sensitive contexts. ...

2024-01 · 8 min · 1627 words
[Metrics for What, Metrics for Whom: Assessing Actionability of Bias Evaluation Metrics in NLP 🔗](https://aclanthology.org/2024.emnlp-main.1207.pdf)

Bias Metrics in NLP Are Broken: Why Actionability Is the Missing Piece

Imagine you are a Machine Learning engineer responsible for deploying a large language model (LLM) for a hiring platform. You run a standard bias evaluation script, and it returns a score: 0.42. What do you do now? Is 0.42 good? Is it terrible? Does it mean the model hates women, or that it slightly prefers Western names? If you fix the data and the score drops to 0.38, is the model safe to deploy? ...

10 min · 2068 words
[Methods for Automatic Matrix Language Determination of Code-Switched Speech 🔗](https://arxiv.org/abs/2410.02521)

Decoding the Matrix - How AI Determines the Dominant Grammar in Code-Switched Speech

Imagine you are listening to a conversation in Singapore. You might hear a sentence like: “I thought all trains 都是 via Jurong East 去到 Pasir Ris.” To a monolingual speaker, this is chaos. To a bilingual speaker, it’s perfectly natural. This phenomenon is known as Code-Switching (CS)—the fluid alternation between two or more languages in a single conversation. ...

2024-10 · 4 min · 1964 words
[METAREFLECTION: Learning Instructions for Language Agents using Past Reflections 🔗](https://arxiv.org/abs/2405.13009)

How Agents Learn from Mistakes: Introducing METAREFLECTION

Imagine you are studying for a difficult history exam. You take a practice quiz and get a question wrong about the French Revolution. You don’t just look up the correct answer for that specific question; you realize you have a fundamental misunderstanding of the timeline. You write a note to yourself: “Always check the dates of events before determining causality.” This process—generalizing a specific failure into a rule for future success—is a hallmark of human learning. It is often referred to as accumulating semantic memory. ...

2024-05 · 7 min · 1457 words
[MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic 🔗](https://arxiv.org/abs/2406.11385)

Solving the LLM Merging Trilemma: A Deep Dive into MetaGPT

In the rapidly evolving landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and LLaMA-2 have become the backbone of modern NLP. The standard workflow is familiar: take a massive pre-trained base model, then fine-tune it on a specific task—be it coding, mathematics, or creative writing. This yields high performance, but it creates a logistical nightmare. For every new capability, you need to deploy a separate, heavy model. Ideally, we want Multi-Task Learning (MTL): a single model that is proficient in everything. However, training a billion-parameter model on all tasks simultaneously is computationally prohibitive and often impossible due to data privacy constraints (many dataset owners won’t share their raw data). ...

2024-06 · 7 min · 1340 words
[Message Passing on Semantic-Anchor-Graphs for Fine-grained Emotion Representation Learning and Classification 🔗](https://aclanthology.org/2024.emnlp-main.162.pdf)

Anchoring Emotions: How SEAN-GNN Captures Subtle Feelings in Text

Humans are emotionally complex creatures. We don’t just feel “happy” or “sad.” We feel ecstatic, content, devastated, terrified, or apprehensive. In the field of Natural Language Processing (NLP), distinguishing between these subtle nuances is known as Fine-grained Emotion Classification (FEC). While standard sentiment analysis might be satisfied with labeling a sentence as “negative,” FEC aims to determine if that negativity stems from anger, fear, or sadness. This is incredibly difficult for machines because the difference often lies in the specific choice of vocabulary and the precise arrangement of words. ...

9 min · 1864 words
[Mentor-KD: Making Small Language Models Better Multi-step Reasoners 🔗](https://arxiv.org/abs/2410.09037)

Bridging the Gap: How a 'Mentor' Model Teaches Small AIs to Reason Like Giants

In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 or Claude are the undisputed heavyweights. They possess an “emergent” ability known as Chain-of-Thought (CoT) reasoning—the capability to break down complex problems into step-by-step logical progressions to arrive at a correct answer. However, there is a catch. These reasoning abilities typically only emerge in models with hundreds of billions of parameters. Running these models requires massive computational resources or expensive API calls, making them impractical for deployment on local devices or in low-resource environments. ...

2024-10 · 8 min · 1595 words
[Memory-Efficient Fine-Tuning of Transformers via Token Selection 🔗](https://arxiv.org/abs/2501.18824)

TokenTune: Squeezing Large Language Models onto Smaller GPUs by Ignoring Tokens

The explosion of Large Language Models (LLMs) has democratized access to powerful AI, but customizing these models remains a hardware nightmare. While using a pre-trained model like Llama-2 or GPT-3 is relatively cheap, fine-tuning it—specializing it for medical data, code generation, or a specific writing style—requires massive computational resources. For instance, fine-tuning a 65-billion parameter model can require upwards of 780 GB of GPU memory. This effectively gates the ability to customize state-of-the-art models behind an enterprise-level paywall. ...

2025-01 · 8 min · 1615 words
[Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk 🔗](https://aclanthology.org/2024.emnlp-main.1169.pdf)

Memorize Step by Step: A Smarter Way to Handle Long Contexts in LLMs

The era of the “million-token context window” is here. With models like Claude 3 and Gemini 1.5, we are moving away from short prompts into the territory of processing entire books, codebases, and legal archives in a single go. But there is a catch: hardware hasn’t scaled as fast as our ambitions. Processing a sequence of 1 million tokens requires massive computational resources and GPU memory. If you try to feed a text that long into a standard GPU, you will almost certainly hit an “Out of Memory” (OOM) error before the model even generates the first word. ...

8 min · 1525 words
[MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification 🔗](https://arxiv.org/abs/2409.14703)

Decoding Memes: How MemeCLIP and PrideMM Are Changing Multimodal Content Moderation

Introduction In the digital age, memes are more than just funny pictures; they are a sophisticated language of their own. They can distill complex political opinions, social commentary, and cultural inside jokes into a single, shareable unit. However, this power has a dark side. Memes have become a potent vehicle for hate speech, cyberbullying, and disinformation, often hiding behind layers of irony and sarcasm that traditional content moderation systems struggle to parse. ...

2024-09 · 9 min · 1750 words
[Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? 🔗](https://arxiv.org/abs/2411.04118)

The Medical AI Mirage: Why Specialized Models Might Not Be Better Than General Ones

The intersection of Artificial Intelligence and medicine is currently one of the most exciting frontiers in technology. Every few months, we see a new headline announcing a “Medical LLM”—a specialized artificial intelligence tailored specifically for healthcare. The narrative is almost always the same: take a powerful general-purpose model (like Llama or Mistral), train it further on a massive library of medical textbooks and PubMed articles, and voilà: you have a digital doctor that outperforms its generalist predecessor. ...

2024-11 · 8 min · 1598 words
[Media Attitude Detection via Framing Analysis with Events and their Relations 🔗](https://aclanthology.org/2024.emnlp-main.954.pdf)

Beyond Word Choice—Detecting Media Bias Through Event Framing and Causal Narratives

In March 2024, Vladimir Putin won the Russian presidential election. If you read about this event in a state-backed Russian outlet, you likely encountered a narrative of “legitimacy,” “national unity,” and a “landslide victory.” If you read a Western outlet, the story was likely framed around “electoral fraud,” the “suppression of opponents,” and the ongoing war in Ukraine. The core facts—that an election happened and Putin won—are the same. The difference lies in the media attitude, or how the outlet feels about the event. ...

8 min · 1699 words
[MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations 🔗](https://arxiv.org/abs/2410.14204)

Why Medical AI Needs Better Data: Introducing MediTOD and the CMAS Ontology

Introduction Imagine a future where doctor burnout is significantly reduced, and patients have instant access to high-quality medical triage. This is the promise of Medical Task-Oriented Dialogue (TOD) systems. These AI agents aim to assist doctors by collecting patient medical history, aiding in diagnosis, and guiding treatment selection. However, if you have ever tried to build a medical chatbot, you likely ran into a massive wall: data. Specifically, the lack of high-quality, privacy-compliant, English-language datasets. While general-purpose chatbots have flourished thanks to massive internet scrapes, medical AI is starved for data due to strict privacy regulations (like HIPAA). Furthermore, the few datasets that do exist often oversimplify the complexity of medicine. They might capture that a patient has a fever, but fail to capture when it started, how it progressed, or what makes it better. ...

2024-10 · 10 min · 2092 words
[MEDREADME: A Systematic Study for Fine-grained Sentence Readability in Medical Domain 🔗](https://arxiv.org/abs/2405.02144)

Decoding Medical Jargon: How We Measure and Improve Readability in Healthcare

Introduction “If you can’t measure it, you can’t improve it.” This famous quote by Peter Drucker rings especially true in the world of medical communication. We live in an era where reliable medical knowledge is crucial for public health. From Wikipedia articles to the Merck Manuals, and from cutting-edge research papers to patient pamphlets, the dissemination of health information is constant. However, access to information does not equate to understanding. Medical texts are notoriously difficult to digest. They are dense, technical, and filled with specialized terminology that can alienate the very people they are meant to help—patients and non-experts. ...

2024-05 · 9 min · 1748 words
[MedCoT: Medical Chain of Thought via Hierarchical Expert 🔗](https://arxiv.org/abs/2412.13736)

Why Two Doctors Are Better Than One: Decoding MedCoT for Medical AI

Imagine visiting a doctor with a complex X-ray. You ask, “Is there a tumor?” and the doctor simply says “Yes” and walks out of the room. No explanation, no pointing to the shadow on the film, and no discussion of why they reached that conclusion. You would likely feel terrified and skeptical. Unfortunately, this is how many Artificial Intelligence systems in Medical Visual Question Answering (Med-VQA) currently operate. They ingest an image and a question, and they output a flat answer. While accuracy is important, in a clinical setting, the reasoning path—the “why”—is just as critical as the final “what.” Furthermore, relying on a single AI model is like relying on a single doctor who might be tired or biased; it lacks the robustness required for life-critical diagnostics. ...

2024-12 · 8 min · 1581 words