[README++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment 🔗](https://arxiv.org/abs/2305.14463)

Breaking the Language Barrier: Inside README++, the New Standard for Multilingual Readability

Introduction Imagine you are learning a new language. You’ve mastered the basics, and you want to practice reading. You pick up a news article, but the vocabulary is too dense. You try a children’s book, but the grammar is surprisingly complex. This frustration is a common hurdle in second language acquisition, and it highlights a critical task in Natural Language Processing (NLP): Readability Assessment. Readability assessment is the automated process of determining how difficult a text is to comprehend. For decades, this field relied on simple statistics—counting syllables per word or words per sentence. Today, Large Language Models (LLMs) promise to revolutionize this by “reading” the text and understanding its semantic depth. ...

2023-05 · 8 min · 1652 words
[Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding 🔗](https://arxiv.org/abs/2406.19263)

Tree-of-Lens: Teaching AI to Understand Screen Layouts Like Humans Do

Graphical User Interfaces (GUIs) are the visual language of the digital world. Whether you are scrolling through Instagram, organizing files on Windows, or shopping on a mobile app, you rely on a complex arrangement of icons, text, and spatial relationships to make sense of the screen. For human users, this process is intuitive. We see a “Checkout” button and immediately understand it belongs to the “Shopping Cart” panel because of its proximity and grouping. However, for Multimodal Large Language Models (MLLMs) and accessibility tools, this remains a significant challenge. While AI has gotten very good at describing an image generally, it struggles with the specific task of Screen Point-and-Read (ScreenPR). ...

2024-06 · 7 min · 1370 words
[RECALL: Membership Inference via Relative Conditional Log-Likelihoods 🔗](https://arxiv.org/abs/2406.15968)

Detecting LLM Training Data: A Deep Dive into RECALL

Introduction Large Language Models (LLMs) like GPT-4 and Llama are trained on trillions of tokens sourced from the vast expanse of the internet. While this scale enables impressive capabilities, it also creates a “black box” problem. We rarely know exactly what data these models were trained on. This opacity raises serious questions: Did the model memorize copyrighted books? Does it contain Personally Identifiable Information (PII)? Has the test set for a benchmark been contaminated because the questions were included in the training data? ...

2024-06 · 9 min · 1805 words
[Re-Reading Improves Reasoning in Large Language Models 🔗](https://arxiv.org/abs/2309.06275)

Why LLMs Need to Read Twice: The Simple 'RE2' Strategy for Better Reasoning

When you encounter a particularly tricky math word problem or a convoluted logic puzzle, what is the first thing you do? If you are like most humans, you read it again. You scan the text, identify the core question, and then re-read the details to understand how they fit together. This simple cognitive strategy—re-reading—is fundamental to human comprehension. However, Large Language Models (LLMs) like GPT-4 or LLaMA typically don’t do this. They process text linearly, reading from left to right, token by token. Once they pass a word, they generally don’t “look back” in the same way a human does when reconsidering the context of a whole sentence. ...

2023-09 · 8 min · 1550 words
[Re-ReST: Reflection-Reinforced Self-Training for Language Agents 🔗](https://arxiv.org/abs/2406.01495)

Can Agents Teach Themselves? Mastering Self-Training with Reflection

The landscape of Large Language Models (LLMs) has shifted rapidly from simple chatbots to Language Agents—systems capable of reasoning, planning, and interacting with external environments to solve complex tasks. Whether it’s browsing the web to answer multi-hop questions or writing code to pass unit tests, agents represent the next frontier of AI utility. However, building these agents presents a significant bottleneck: Data. To make a generic LLM (like Llama-2 or Llama-3) act as a competent agent, we typically need to fine-tune it on high-quality “trajectories”—step-by-step examples of reasoning and acting. Historically, there have been two ways to get this data: human annotation (slow and expensive) or “distillation,” where we ask a massive, proprietary model like GPT-4 to generate examples for the smaller model to learn from. ...

2024-06 · 9 min · 1782 words
[Re-Evaluating Evaluation for Multilingual Summarization 🔗](https://aclanthology.org/2024.emnlp-main.1085.pdf)

Lost in Evaluation — Why We Can't Trust English Metrics for Multilingual AI

Lost in Evaluation: Why We Can’t Trust English Metrics for Multilingual AI The field of Natural Language Processing (NLP) is currently witnessing a massive expansion in accessibility. We are no longer just building models for English speakers; with the release of multilingual Large Language Models (LLMs) like BLOOM, Llama 2, and Aya-23, the goal is to create AI that speaks the world’s languages. However, building these models is only half the battle. The other half is determining whether they actually work. ...

8 min · 1568 words
[Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training 🔗](https://arxiv.org/abs/2508.11393)

Unlocking the Black Box—How Self-Training Can Teach Transformers to Explain Themselves

Artificial Intelligence has a trust problem. As Deep Learning models, particularly Transformers, continue to dominate fields ranging from movie sentiment analysis to complex scientific classification, they have become increasingly accurate—and increasingly opaque. We often know what a model predicts, but rarely why. If a model flags a financial transaction as fraudulent or diagnoses a medical scan as positive, “because the computer said so” is no longer an acceptable justification. We need rationales—highlights of the input text that explain the decision. ...

2025-08 · 8 min · 1676 words
[Rationale-Aware Answer Verification by Pairwise Self-Evaluation 🔗](https://arxiv.org/abs/2410.04838)

Getting It Right for the Right Reasons: Improving LLM Verification with REPS

Imagine you are a professor grading a math exam. You come across a student who has written the correct final answer, “42,” but their working out involves subtracting apples from oranges and dividing by the color blue. Do you give them full marks? In the world of Large Language Models (LLMs), the answer has traditionally been “yes.” Current methods for training LLMs to verify answers often focus solely on the final output. If the model guesses the right answer, it gets a reward, regardless of the logical leaps or hallucinations it took to get there. This creates a dangerous “Clever Hans” effect where models learn to act correctly but think incorrectly. ...

2024-10 · 10 min · 1969 words
[Ranking Manipulation for Conversational Search Engines 🔗](https://arxiv.org/abs/2406.03589)

SEO for LLMs - How to Hack Conversational Search Rankings

Introduction: The New Era of Search For two decades, the internet economy has revolved around a single, crucial concept: the ranked list. When you search for “best blender” on Google, an entire industry known as Search Engine Optimization (SEO) works tirelessly to ensure their product lands on the first page of links. But the paradigm is shifting. We are moving from Search Engines to Conversational Search Engines. Tools like Perplexity.ai, Google’s Search Generative Experience (SGE), and ChatGPT Search don’t just give you a list of blue links. They read the websites for you, synthesize the information, and produce a natural language recommendation. Instead of “Here are 10 links about blenders,” the output is “The Smeg 4-in-1 is the best choice because…” ...

2024-06 · 8 min · 1660 words
[RaTEScore: A Metric for Radiology Report Generation 🔗](https://arxiv.org/abs/2406.16845)

Beyond Word Matching: How RaTEScore Teaches AI to Grade Medical Reports Like a Doctor

Artificial Intelligence is rapidly transforming healthcare. We are moving toward a future where “Generalist Medical AI” can look at an X-ray, an MRI, or a CT scan and draft a diagnostic report in seconds. This promises to reduce burnout for radiologists and speed up patient care. However, there is a massive bottleneck in this revolution: Trust. If an AI writes a report, how do we know it’s accurate? If we have two different AI models, how do we know which one is better? In general Natural Language Processing (NLP), we verify text using standard metrics. But in medicine, a “standard” metric can be dangerous. If an AI writes “No pneumothorax” instead of “Pneumothorax,” it has only changed one word—a small error to a computer, but a life-threatening error to a patient. ...

2024-06 · 8 min · 1662 words
[RWKV-CLIP: A Robust Vision-Language Representation Learner 🔗](https://arxiv.org/abs/2406.06973)

Beyond Transformers: How RWKV-CLIP Revolutionizes Vision-Language Models with RNN Efficiency

Introduction In the world of Artificial Intelligence, Contrastive Language-Image Pre-training (CLIP) was a watershed moment. By learning to associate images with their textual descriptions on a massive scale, CLIP enabled models to understand visual concepts with zero-shot capabilities that were previously unimaginable. If you show a standard computer vision model a picture of a specific breed of dog it wasn’t trained on, it fails. Show it to CLIP, and it understands. ...

2024-06 · 9 min · 1886 words
[RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models 🔗](https://arxiv.org/abs/2407.05131)

Fixing the Trust Issue: How RULE Makes Medical AI More Reliable

Artificial Intelligence is revolutionizing healthcare, particularly in the realm of medical imaging. We now have Medical Large Vision Language Models (Med-LVLMs) capable of looking at an X-ray or a retinal scan and answering clinical questions. However, there is a persistent “elephant in the room” with these models: Hallucinations. Even the best models sometimes generate medical responses that sound plausible but are factually incorrect. In a clinical setting, a factual error isn’t just a glitch—it’s a safety risk. ...

2024-07 · 7 min · 1356 words
[RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework 🔗](https://arxiv.org/abs/2410.19109)

How to Tame Your LLM: A Look at RSA-Control and Pragmatic Generation

Introduction We are living in the golden age of Large Language Models (LLMs). From GPT-4 to Llama, these models can write poetry, code, and essays with startling fluency. However, anyone who has spent time prompting these models knows they can be like unruly teenagers: talented but difficult to control. You might ask for a summary suitable for a five-year-old, and the model might return a text full of academic jargon. Worse, you might ask for a story, and the model might inadvertently produce toxic or biased content based on its training data. ...

2024-10 · 8 min · 1689 words
[RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs 🔗](https://arxiv.org/abs/2407.02552)

Breaking the Language Barrier: How RLOO Unlocks Multilingual Alignment

If you follow the rapid evolution of Large Language Models (LLMs), you are likely familiar with the “alignment” phase. After a model consumes terabytes of text to learn how to predict the next token (Pre-training) and learns to follow instructions (Supervised Fine-Tuning or SFT), it undergoes a final polish: Preference Optimization. This is the stage where models like ChatGPT or Claude learn to be helpful, harmless, and conversational, usually via techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). ...

2024-07 · 7 min · 1472 words
[RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets 🔗](https://aclanthology.org/2024.emnlp-main.1214.pdf)

Beyond One Object: Understanding RECANTFormer for Generalized Referring Expression Comprehension

Imagine you are asking a robot to help you in the kitchen. You say, “Pass me the red mug.” In a perfect world—or at least in the world of classic computer vision benchmarks—there is exactly one red mug on the counter. The robot identifies it, boxes it, and grabs it. But what if real life happens? What if the dishwasher is empty and there are no red mugs? Or what if you just had a party and there are three red mugs? ...

10 min · 2048 words
[REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering 🔗](https://arxiv.org/abs/2402.17497)

Trust Issues in AI: How REAR Teaches LLMs to Ignore Irrelevant Data

Introduction We often think of Large Language Models (LLMs) as vast repositories of knowledge, but they have a significant weakness: they cannot memorize everything, especially real-time events or niche domain knowledge. To solve this, the AI community widely adopted Retrieval-Augmented Generation (RAG). The concept is simple: when an LLM is asked a question, it first searches an external database (like Wikipedia) for relevant documents, then uses those documents to generate an answer. ...

2024-02 · 10 min · 2062 words
[RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Generation 🔗](https://arxiv.org/abs/2406.05794)

RE-RAG: Giving RAG the Confidence to Say 'I Don't Know' (or 'I Know Better')

Retrieval-Augmented Generation (RAG) has become the backbone of modern AI knowledge systems. By combining a parametric memory (the weights of a Large Language Model) with non-parametric memory (an external database of documents), we can build systems that answer questions with up-to-date, specific information. But anyone who has built a RAG system knows the dirty secret: Retrievers are noisy. If a user asks a question and the retriever fetches irrelevant documents, the generator (the LLM) is placed in a difficult position. It might try to force an answer from the bad context, leading to hallucinations, or it might ignore the context entirely, leading to answers that lack citation. Standard retrievers provide a similarity score, but this score is relative—it tells you that Document A is better than Document B, but it doesn’t tell you if Document A is actually good. ...

2024-06 · 8 min · 1691 words
[RAt: Injecting Implicit Bias for Text-To-Image Prompt Refinement Models 🔗](https://aclanthology.org/2024.emnlp-main.1144.pdf)

The Hidden Manipulator: How RAt Injects Implicit Bias into AI Art Generators

The rise of Text-to-Image (T2I) generation models like Stable Diffusion and Midjourney has revolutionized digital creativity. However, getting a high-quality image out of these models often requires “prompt engineering”—the art of crafting detailed, specific text descriptions. Because most users aren’t experts at writing these complex prompts, a new class of tools has emerged: Text-to-Image Prompt Refinement (T2I-Refine) services. These services take a user’s simple input (e.g., “a smart phone”) and expand it into a rich description (e.g., “a smart phone, intricate details, 8k resolution, cinematic lighting”). While helpful, this intermediate layer introduces a fascinating security and ethical question: Can a prompt refinement model be “poisoned” to secretly manipulate the output? ...

8 min · 1522 words
[RAR: Retrieval-augmented retrieval for code generation in low-resource languages 🔗](https://aclanthology.org/2024.emnlp-main.1199.pdf)

Cracking the Code for Low-Resource Languages: An Introduction to Retrieval-Augmented Retrieval (RAR)

Introduction If you have ever asked ChatGPT or GitHub Copilot to write a Python script or a JavaScript function, you know the results can be magically accurate. These models have been trained on billions of lines of code from popular languages, making them incredibly proficient at standard programming tasks. But what happens when you step off the beaten path? When you ask an LLM to generate code for low-resource languages—domain-specific languages (DSLs) like Microsoft Power Query M, OfficeScript, or complex Excel formulas—the performance drops significantly. These languages don’t have the massive repositories of open-source code required to train a model effectively. ...

8 min · 1638 words
[RAG-QA Arena: Evaluating Domain Robustness for Long-Form Retrieval-Augmented Question Answering 🔗](https://arxiv.org/abs/2407.13998)

Beyond Wikipedia: Benchmarking Long-Form RAG with LFRQA and RAG-QA Arena

Beyond Wikipedia: Benchmarking Long-Form RAG with LFRQA and RAG-QA Arena Retrieval-Augmented Generation (RAG) has become the de facto architecture for building reliable AI systems. By connecting Large Language Models (LLMs) to external data sources, we give them a “memory” that is up-to-date and verifiable. However, a significant gap exists in how we evaluate these systems. Most current benchmarks rely on Wikipedia data and expect short, punchy answers (like “Paris” or “1984”). But in the real world, we use RAG to generate comprehensive reports, summarize financial trends, or explain complex biological mechanisms. When an LLM generates a nuanced, three-paragraph explanation, comparing it to a three-word ground truth using standard “exact match” metrics is like grading a history essay with a math answer key. ...

2024-07 · 8 min · 1672 words