[PsyGUARD: An Automated System for Suicide Detection and Risk Assessment in Psychological Counseling 🔗](https://arxiv.org/abs/2409.20243)

Beyond Binary Detection: How PsyGUARD Revolutionizes Automated Suicide Risk Assessment

Introduction: The Crisis and the Gap Suicide remains one of the most critical public health challenges globally. Every loss of life is a tragedy that ripples through families and communities. As mental health awareness grows, more individuals are turning to online counseling services for help. These platforms offer immediate, confidential support, breaking down barriers of time and space. However, as the volume of users increases, human counselors can become overwhelmed. This is where Artificial Intelligence (AI) steps in. For years, researchers have been developing automated systems to detect suicidal ideation in text. Yet, there is a significant flaw in the current landscape: most existing systems treat suicide detection as a simple binary problem—Suicidal or Non-Suicidal. ...

2024-09 · 7 min · 1420 words
[PsFuture: A Pseudo-Future-based Zero-Shot Adaptive Policy for Simultaneous Machine Translation 🔗](https://arxiv.org/abs/2410.04075)

Faking the Future: How PsFuture Brings Zero-Shot Adaptivity to Simultaneous Translation

Imagine the high-pressure job of a simultaneous interpreter at the United Nations. They listen to a speech in one language and must translate it into another in real-time. If they wait too long to hear the full sentence, they fall behind (high latency). If they translate too soon, they might guess wrong and make a mistake (low quality). They must constantly decide: Do I speak now, or do I listen for one more word? ...

2024-10 · 8 min · 1602 words
[Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging 🔗](https://arxiv.org/abs/2406.16330)

Don't Just Prune, Merge: How Manifold Learning is Revolutionizing LLM Compression

Introduction The race for larger, more capable Large Language Models (LLMs) like Llama-3 and Mistral has led to incredible breakthroughs in artificial intelligence. However, this progress comes with a massive cost. As these models scale to billions of parameters, they become increasingly difficult to deploy in resource-limited environments. Running a 70-billion parameter model on a consumer-grade GPU—or worse, a mobile device—is often a non-starter due to memory and energy constraints. ...

2024-06 · 10 min · 1942 words
[Prove Your Point!: Bringing Proof-Enhancement Principles to Argumentative Essay Generation 🔗](https://arxiv.org/abs/2410.22642)

Beyond Text Generation: Teaching AI to Argue Logically with PESA

Have you ever asked an AI to write an essay on a controversial topic? Often, the result looks impressive at first glance. The grammar is perfect, the vocabulary is sophisticated, and the structure seems sound. But if you look closer, cracks begin to appear. The AI might make a bold claim in the first sentence, only to provide evidence that contradicts it three sentences later. Or, it might list facts that are technically true but irrelevant to the argument at hand. ...

2024-10 · 8 min · 1678 words
[Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing? 🔗](https://arxiv.org/abs/2407.01119)

Man vs. Machine: The First True Creative Writing Duel Between GPT-4 and a World-Class Novelist

In the history of Artificial Intelligence, we mark progress by the fallen champions of humanity. We remember when Deep Blue defeated Garry Kasparov at chess. We remember when AlphaGo stunned Lee Sedol. These were pivotal moments where machines proved they could out-calculate the best human minds in closed systems of logic and strategy. But art is not a closed system. For years, we have comforted ourselves with the idea that while machines crunch numbers, humans create meaning. However, the rise of Large Language Models (LLMs) like GPT-4 has brought an uneasy question to the surface: Are we losing the creative frontier, too? We know AI can write competent emails and passable high school essays. We know it can outperform the average human writer. ...

2024-07 · 10 min · 1960 words
[PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval 🔗](https://arxiv.org/abs/2404.18424)

PromptReps: How to Turn LLMs into Retrievers Without Training

Introduction In the rapidly evolving landscape of Natural Language Processing (NLP), Large Language Models (LLMs) like GPT-4 and Llama-3 have become the de facto standard for generating text, writing code, and answering questions. Their ability to understand context is unparalleled. However, a significant challenge remains: how do we use these generative giants to effectively find information within massive datasets without breaking the bank? Traditionally, utilizing LLMs for Information Retrieval (IR) has fallen into two distinct but imperfect camps. First, there are prompt-based re-ranking methods. In this scenario, you retrieve a small set of documents using a simple keyword search and then ask the LLM, “Is this document relevant to the user’s query?” While this yields high accuracy, it is computationally excruciating. Imagine running a massive model like GPT-4 hundreds of times for every single search query; it is too slow and costly for real-time applications. ...

2024-04 · 10 min · 1951 words
[PROMETHEUS 2: An Open Source Language Model Specialized in Evaluating Other Language Models 🔗](https://arxiv.org/abs/2405.01535)

The Judge We Need: How PROMETHEUS 2 Merges Skills to Rival GPT-4 in Evaluation

The explosion of Large Language Models (LLMs) has created a peculiar bottleneck in artificial intelligence. We have models that can write poetry, code, and legal briefs, but we are running out of reliable ways to grade them. Historically, humans were the judges. But humans are slow, expensive, and often inconsistent. To solve this, the industry shifted toward the “LLM-as-a-Judge” paradigm, where powerful proprietary models like GPT-4 evaluate the outputs of smaller models. It works well, but it introduces new problems: high costs, lack of transparency (closed source), and data privacy concerns. ...

2024-05 · 8 min · 1494 words
[Private Language Models via Truncated Laplacian Mechanism 🔗](https://arxiv.org/abs/2410.08027)

Keeping Secrets in High Dimensions - A New Approach to Private Word Embeddings

Keeping Secrets in High Dimensions: A New Approach to Private Word Embeddings Natural Language Processing (NLP) has become deeply embedded in our daily lives, from the predictive text on your smartphone to the large language models (LLMs) analyzing medical records. However, these models have a tendency to be a bit too good at remembering things. They often memorize specific details from their training data, which leads to a critical problem: privacy leakage. If a model is trained on sensitive emails or clinical notes, there is a risk that an attacker could extract that private information. ...

2024-10 · 9 min · 1843 words
[Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality 🔗](https://arxiv.org/abs/2410.05210)

Can AI Understand Grammar? Improving Compositionality in VLMs Without Breaking Them

Introduction Humans possess an innate ability to understand the world through multiple senses. We effortlessly combine visual cues with language to interpret complex scenes. If you see a picture of “a horse riding a man,” you immediately recognize the absurdity and distinguish it from “a man riding a horse.” This ability to understand how different components (objects, attributes, relations) combine to form meaning is called compositional reasoning. In the world of Artificial Intelligence, Vision-Language Models (VLMs) like CLIP have revolutionized how computers understand images and text. They are fantastic at recognizing objects and matching images to captions in a general sense. However, they have a “bag-of-words” problem. To a standard VLM, “a horse riding a man” and “a man riding a horse” look mathematically almost identical because they contain the same words. ...

2024-10 · 8 min · 1651 words
[Preserving Generalization of Language Models in Few-shot Continual Relation Extraction 🔗](https://arxiv.org/abs/2410.00334)

Don't Lose Your Head: How Keeping the Language Model Head Solves Catastrophic Forgetting

Imagine learning how to ride a bicycle. Now, imagine that learning to ride that bike caused you to immediately forget how to walk. This absurdity is a reality for many Artificial Intelligence models. This phenomenon, known as Catastrophic Forgetting, is a major hurdle in the field of Continual Learning (CL), where models must learn a sequence of tasks without erasing their prior knowledge. This problem becomes even harder when you don’t have much data to learn from—a scenario called Few-shot Continual Relation Extraction (FCRE). Here, a model must identify relationships in text (e.g., “Person A is the mother of Person B”) based on just a handful of examples, all while handling new relationships that appear over time. ...

2024-10 · 7 min · 1437 words
[Preference-Guided Reflective Sampling for Aligning Language Models 🔗](https://arxiv.org/abs/2408.12163)

Beyond Random Guessing: How Preference-Guided Reflective Sampling Aligns LLMs

Introduction Imagine you are a professor asking a student to write an essay. If the student writes a single draft and hands it in immediately, the quality might be decent, but it likely misses some nuance. Now, imagine you ask the student to write a draft, read it over, critique their own work based on specific criteria (like “be more concise” or “add references”), and then write a final version. The result is almost guaranteed to be better. ...

2024-08 · 9 min · 1771 words
[Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model 🔗](https://arxiv.org/abs/2408.10764)

Otter: Taming LLMs with Non-Disruptive Parameter Insertion

Large Language Models (LLMs) are undeniably impressive. They can write poetry, debug code, and summarize history. However, anyone who has worked with them extensively knows they are not without their flaws. They can hallucinate, produce toxic content, or fail at complex reasoning tasks. To fix this, we generally have two options: finetuning the model (which risks “catastrophic forgetting,” where the model loses its original knowledge) or inference intervention. Inference intervention involves using a separate, smaller “calibration” model (often a Reward Model) to guide the main LLM during text generation. ...

2024-08 · 7 min · 1469 words
[Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement 🔗](https://arxiv.org/abs/2403.16184)

Taming the Bias: How to Successfully Integrate Vision-Language Models into Scene Graph Generation

Introduction Imagine walking into a messy living room. You don’t just see a “sofa,” a “cat,” and a “remote.” You instantly understand the web of connections: the cat is sleeping on the sofa, the remote is under the cushion, and the painting is hanging on the wall. This structured understanding of objects and their relationships is what computer vision researchers call a Scene Graph. Scene Graph Generation (SGG) is a pivotal task in AI, bridging the gap between raw pixel data and high-level language description. It transforms an image into a structured graph where nodes are objects and edges are relationships (predicates). This structure is essential for downstream tasks like robotic navigation (“robot, pick up the cup on the table”) or assisting the visually impaired. ...

2024-03 · 14 min · 2865 words
[Precise Model Benchmarking with Only a Few Observations 🔗](https://arxiv.org/abs/2410.05222)

How Good is Your LLM on Niche Topics? Solving the Small Sample Size Problem with Empirical Bayes

Introduction In the era of Large Language Models (LLMs), we are obsessed with benchmarks. We look at massive leaderboards and see that a model achieves “85% accuracy on MMLU” or “90% on HellaSwag.” These aggregate numbers give us a general sense of capability, but they often hide a critical problem: models are not equally good at everything. A practitioner often cares about specific, granular topics. You might not care about general “Law,” but you care deeply about “Intellectual Property Law in the 19th Century.” The problem is data scarcity. While we have thousands of questions for broad categories, niche subgroups might only have ten or twenty examples available for testing. ...

2024-10 · 9 min · 1728 words
[PREALIGN: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment 🔗](https://arxiv.org/abs/2407.16222)

PreAlign: Teaching LLMs to Translate Before They Learn to Read

Large Language Models (LLMs) like LLaMA and GPT-4 have transformed how we interact with technology. While these models are technically multilingual, there is a catch: they are predominantly trained on English text. They often treat other languages as second-class citizens, picking them up spontaneously rather than systematically. This results in a phenomenon known as weak cross-lingual alignment. An LLM might know a fact in English (e.g., “The piano was invented in Italy”) but fail to recall that same fact when queried in Chinese or Russian. The knowledge is “stuck” in the English part of the model’s brain. ...

2024-07 · 3 min · 1465 words
[Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation 🔗](https://arxiv.org/abs/2311.16201)

Why Your LLM Can't Draw: The Limits of Pre-training in Auto-Regressive Image Generation

Introduction: The Great Divide in AI Art If you have been following the explosion of AI-generated imagery over the last few years, you likely know the big names: DALL-E, Midjourney, Stable Diffusion. What you might not know is that under the hood, there is a fundamental split in how these models work. On one side, we have Diffusion Models (like Stable Diffusion and DALL-E 2/3). These work by removing noise from a chaotic image to reveal a clear picture. On the other side, we have Auto-Regressive Models (like the original DALL-E and Google’s Parti). These treat images like language: they break an image into a sequence of “tokens” and predict them one by one, just like ChatGPT predicts the next word in a sentence. ...

2023-11 · 10 min · 1964 words
[Pragmatic Norms Are All You Need - Why The Symbol Grounding Problem Does Not Apply to LLMs 🔗](https://aclanthology.org/2024.emnlp-main.651.pdf)

Meaning Without Objects: Why LLMs Don't Need to See a Dog to Know What 'Dog' Means

Meaning Without Objects: Why LLMs Don’t Need to See a Dog to Know What ‘Dog’ Means In the last few years, the field of Natural Language Processing (NLP) has experienced a seismic shift. We have moved from systems that struggle to construct a coherent sentence to Large Language Models (LLMs) like GPT-4, which can pass the Uniform Bar Exam in the 90th percentile. This performance creates a cognitive dissonance for researchers and students alike. On one hand, these models generate text that appears deeply knowledgeable, reasoned, and coherent. On the other hand, we know they are, at their core, statistical engines predicting the next token in a sequence. They have never seen a sunset, felt a “stick,” or petted a “dog.” ...

11 min · 2235 words
[PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation 🔗](https://arxiv.org/abs/2406.18528)

PrExMe: Unlocking the Secrets of Prompt Engineering for LLM-Based Evaluation

In the rapidly evolving world of Natural Language Processing (NLP), we have reached a point where we are using Artificial Intelligence to evaluate Artificial Intelligence. Large Language Models (LLMs) have become so capable that researchers now use them as “judges” to grade the quality of machine translation (MT) and text summarization. This is known as LLM-based evaluation. However, using an LLM as a judge introduces a new variable: the prompt. If you ask ChatGPT to “grade this translation,” you might get a different score than if you ask it to “act as an expert translator and critique this text.” This variability creates a problem for scientific rigor. Which prompt is the “correct” one? Do different models require different prompting strategies? ...

2024-06 · 8 min · 1609 words
[POSTMARK: A Robust Blackbox Watermark for Large Language Models 🔗](https://arxiv.org/abs/2406.14517)

Can We Watermark AI Text Without Model Access? Deep Dive into POSTMARK

Large Language Models (LLMs) are reshaping the internet. From generating news articles to writing code, the volume of machine-generated content is exploding. But this capability comes with a shadow side: hallucinations, bias, and the potential for mass-produced disinformation. If the web becomes flooded with AI-generated text, how can we trust what we read? Furthermore, if future AI models are trained on the output of today’s AI, we risk a feedback loop of degrading quality. ...

2024-06 · 9 min · 1884 words
[Position Engineering: Boosting Large Language Models through Positional Information Manipulation 🔗](https://arxiv.org/abs/2404.11216)

Beyond Prompt Engineering: How "Ghost Tokens" Unlock LLM Potential

If you have spent any time working with Large Language Models (LLMs) like GPT-4 or Llama 2, you are likely familiar with the dark art of Prompt Engineering. We spend hours tweaking phrases, adding “Let’s think step by step,” or restructuring paragraphs just to get the model to output the correct answer. It is a process that feels less like engineering and more like casting spells—change one word, and the magic works; change another, and it fails. ...

2024-04 · 8 min · 1702 words