[Private Language Models via Truncated Laplacian Mechanism 🔗](https://arxiv.org/abs/2410.08027)

Keeping Secrets in High Dimensions - A New Approach to Private Word Embeddings

Keeping Secrets in High Dimensions: A New Approach to Private Word Embeddings Natural Language Processing (NLP) has become deeply embedded in our daily lives, from the predictive text on your smartphone to the large language models (LLMs) analyzing medical records. However, these models have a tendency to be a bit too good at remembering things. They often memorize specific details from their training data, which leads to a critical problem: privacy leakage. If a model is trained on sensitive emails or clinical notes, there is a risk that an attacker could extract that private information. ...

2024-10 · 9 min · 1843 words
[Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality 🔗](https://arxiv.org/abs/2410.05210)

Can AI Understand Grammar? Improving Compositionality in VLMs Without Breaking Them

Introduction Humans possess an innate ability to understand the world through multiple senses. We effortlessly combine visual cues with language to interpret complex scenes. If you see a picture of “a horse riding a man,” you immediately recognize the absurdity and distinguish it from “a man riding a horse.” This ability to understand how different components (objects, attributes, relations) combine to form meaning is called compositional reasoning. In the world of Artificial Intelligence, Vision-Language Models (VLMs) like CLIP have revolutionized how computers understand images and text. They are fantastic at recognizing objects and matching images to captions in a general sense. However, they have a “bag-of-words” problem. To a standard VLM, “a horse riding a man” and “a man riding a horse” look mathematically almost identical because they contain the same words. ...

2024-10 · 8 min · 1651 words
[Preserving Generalization of Language Models in Few-shot Continual Relation Extraction 🔗](https://arxiv.org/abs/2410.00334)

Don't Lose Your Head: How Keeping the Language Model Head Solves Catastrophic Forgetting

Imagine learning how to ride a bicycle. Now, imagine that learning to ride that bike caused you to immediately forget how to walk. This absurdity is a reality for many Artificial Intelligence models. This phenomenon, known as Catastrophic Forgetting, is a major hurdle in the field of Continual Learning (CL), where models must learn a sequence of tasks without erasing their prior knowledge. This problem becomes even harder when you don’t have much data to learn from—a scenario called Few-shot Continual Relation Extraction (FCRE). Here, a model must identify relationships in text (e.g., “Person A is the mother of Person B”) based on just a handful of examples, all while handling new relationships that appear over time. ...

2024-10 · 7 min · 1437 words
[Preference-Guided Reflective Sampling for Aligning Language Models 🔗](https://arxiv.org/abs/2408.12163)

Beyond Random Guessing: How Preference-Guided Reflective Sampling Aligns LLMs

Introduction Imagine you are a professor asking a student to write an essay. If the student writes a single draft and hands it in immediately, the quality might be decent, but it likely misses some nuance. Now, imagine you ask the student to write a draft, read it over, critique their own work based on specific criteria (like “be more concise” or “add references”), and then write a final version. The result is almost guaranteed to be better. ...

2024-08 · 9 min · 1771 words
[Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model 🔗](https://arxiv.org/abs/2408.10764)

Otter: Taming LLMs with Non-Disruptive Parameter Insertion

Large Language Models (LLMs) are undeniably impressive. They can write poetry, debug code, and summarize history. However, anyone who has worked with them extensively knows they are not without their flaws. They can hallucinate, produce toxic content, or fail at complex reasoning tasks. To fix this, we generally have two options: finetuning the model (which risks “catastrophic forgetting,” where the model loses its original knowledge) or inference intervention. Inference intervention involves using a separate, smaller “calibration” model (often a Reward Model) to guide the main LLM during text generation. ...

2024-08 · 7 min · 1469 words
[Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement 🔗](https://arxiv.org/abs/2403.16184)

Taming the Bias: How to Successfully Integrate Vision-Language Models into Scene Graph Generation

Introduction Imagine walking into a messy living room. You don’t just see a “sofa,” a “cat,” and a “remote.” You instantly understand the web of connections: the cat is sleeping on the sofa, the remote is under the cushion, and the painting is hanging on the wall. This structured understanding of objects and their relationships is what computer vision researchers call a Scene Graph. Scene Graph Generation (SGG) is a pivotal task in AI, bridging the gap between raw pixel data and high-level language description. It transforms an image into a structured graph where nodes are objects and edges are relationships (predicates). This structure is essential for downstream tasks like robotic navigation (“robot, pick up the cup on the table”) or assisting the visually impaired. ...

2024-03 · 14 min · 2865 words
[Precise Model Benchmarking with Only a Few Observations 🔗](https://arxiv.org/abs/2410.05222)

How Good is Your LLM on Niche Topics? Solving the Small Sample Size Problem with Empirical Bayes

Introduction In the era of Large Language Models (LLMs), we are obsessed with benchmarks. We look at massive leaderboards and see that a model achieves “85% accuracy on MMLU” or “90% on HellaSwag.” These aggregate numbers give us a general sense of capability, but they often hide a critical problem: models are not equally good at everything. A practitioner often cares about specific, granular topics. You might not care about general “Law,” but you care deeply about “Intellectual Property Law in the 19th Century.” The problem is data scarcity. While we have thousands of questions for broad categories, niche subgroups might only have ten or twenty examples available for testing. ...

2024-10 · 9 min · 1728 words
[PREALIGN: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment 🔗](https://arxiv.org/abs/2407.16222)

PreAlign: Teaching LLMs to Translate Before They Learn to Read

Large Language Models (LLMs) like LLaMA and GPT-4 have transformed how we interact with technology. While these models are technically multilingual, there is a catch: they are predominantly trained on English text. They often treat other languages as second-class citizens, picking them up spontaneously rather than systematically. This results in a phenomenon known as weak cross-lingual alignment. An LLM might know a fact in English (e.g., “The piano was invented in Italy”) but fail to recall that same fact when queried in Chinese or Russian. The knowledge is “stuck” in the English part of the model’s brain. ...

2024-07 · 3 min · 1465 words
[Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation 🔗](https://arxiv.org/abs/2311.16201)

Why Your LLM Can't Draw: The Limits of Pre-training in Auto-Regressive Image Generation

Introduction: The Great Divide in AI Art If you have been following the explosion of AI-generated imagery over the last few years, you likely know the big names: DALL-E, Midjourney, Stable Diffusion. What you might not know is that under the hood, there is a fundamental split in how these models work. On one side, we have Diffusion Models (like Stable Diffusion and DALL-E 2/3). These work by removing noise from a chaotic image to reveal a clear picture. On the other side, we have Auto-Regressive Models (like the original DALL-E and Google’s Parti). These treat images like language: they break an image into a sequence of “tokens” and predict them one by one, just like ChatGPT predicts the next word in a sentence. ...

2023-11 · 10 min · 1964 words
[Pragmatic Norms Are All You Need - Why The Symbol Grounding Problem Does Not Apply to LLMs 🔗](https://aclanthology.org/2024.emnlp-main.651.pdf)

Meaning Without Objects: Why LLMs Don't Need to See a Dog to Know What 'Dog' Means

Meaning Without Objects: Why LLMs Don’t Need to See a Dog to Know What ‘Dog’ Means In the last few years, the field of Natural Language Processing (NLP) has experienced a seismic shift. We have moved from systems that struggle to construct a coherent sentence to Large Language Models (LLMs) like GPT-4, which can pass the Uniform Bar Exam in the 90th percentile. This performance creates a cognitive dissonance for researchers and students alike. On one hand, these models generate text that appears deeply knowledgeable, reasoned, and coherent. On the other hand, we know they are, at their core, statistical engines predicting the next token in a sequence. They have never seen a sunset, felt a “stick,” or petted a “dog.” ...

11 min · 2235 words
[PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation 🔗](https://arxiv.org/abs/2406.18528)

PrExMe: Unlocking the Secrets of Prompt Engineering for LLM-Based Evaluation

In the rapidly evolving world of Natural Language Processing (NLP), we have reached a point where we are using Artificial Intelligence to evaluate Artificial Intelligence. Large Language Models (LLMs) have become so capable that researchers now use them as “judges” to grade the quality of machine translation (MT) and text summarization. This is known as LLM-based evaluation. However, using an LLM as a judge introduces a new variable: the prompt. If you ask ChatGPT to “grade this translation,” you might get a different score than if you ask it to “act as an expert translator and critique this text.” This variability creates a problem for scientific rigor. Which prompt is the “correct” one? Do different models require different prompting strategies? ...

2024-06 · 8 min · 1609 words
[POSTMARK: A Robust Blackbox Watermark for Large Language Models 🔗](https://arxiv.org/abs/2406.14517)

Can We Watermark AI Text Without Model Access? Deep Dive into POSTMARK

Large Language Models (LLMs) are reshaping the internet. From generating news articles to writing code, the volume of machine-generated content is exploding. But this capability comes with a shadow side: hallucinations, bias, and the potential for mass-produced disinformation. If the web becomes flooded with AI-generated text, how can we trust what we read? Furthermore, if future AI models are trained on the output of today’s AI, we risk a feedback loop of degrading quality. ...

2024-06 · 9 min · 1884 words
[Position Engineering: Boosting Large Language Models through Positional Information Manipulation 🔗](https://arxiv.org/abs/2404.11216)

Beyond Prompt Engineering: How "Ghost Tokens" Unlock LLM Potential

If you have spent any time working with Large Language Models (LLMs) like GPT-4 or Llama 2, you are likely familiar with the dark art of Prompt Engineering. We spend hours tweaking phrases, adding “Let’s think step by step,” or restructuring paragraphs just to get the model to output the correct answer. It is a process that feels less like engineering and more like casting spells—change one word, and the magic works; change another, and it fails. ...

2024-04 · 8 min · 1702 words
[Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification 🔗](https://aclanthology.org/2024.emnlp-main.1019.pdf)

Can LLMs Actually Detect Hate Speech? An Analysis of Behavior Patterns and Failures

Imagine you are a content moderator for a social media platform, or perhaps a developer building a chatbot intended for elderly companionship. You want to ensure that the content processed or generated by your system is safe. Naturally, you turn to Large Language Models (LLMs) to help filter out offensive speech. You feed a comment into the model and ask: “Is this text offensive?” You expect a simple “Yes” or “No.” Instead, the model refuses to answer, lectures you on morality, or hallucinates a response that has nothing to do with the question. ...

7 min · 1484 words
[Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models 🔗](https://arxiv.org/abs/2410.12011)

Can Images Read? A Deep Dive into the Linguistic Brain of Pixel-Based Models

Imagine trying to read a book not by recognizing letters or words, but by looking at a continuous screenshot of the pages. This is essentially how Pixel-based Language Models work. Instead of breaking text down into a vocabulary of “tokens” (like subwords or characters) as models like BERT or GPT do, these models treat text as images. Why would we do this? The standard approach of using subwords creates a “vocabulary bottleneck.” If you want a model to understand 100 languages, you need a massive vocabulary list that competes for space. Pixel-based models bypass this entirely. If a script can be rendered on a screen, the model can process it. ...

2024-10 · 7 min · 1311 words
[PhiloGPT: A Philology-Oriented Large Language Model for Ancient Chinese Manuscripts with Dunhuang as Case Study 🔗](https://aclanthology.org/2024.emnlp-main.163.pdf)

Decoding the Past: How PhiloGPT is Revolutionizing the Study of Ancient Chinese Manuscripts

Imagine trying to read a letter written a thousand years ago. The paper is tattered, characters are missing due to wormholes or water damage, and the grammar follows rules that haven’t been used for centuries. Furthermore, the author used a slang term specific to a small village in the 7th century that appears in no modern dictionary. This is the daily reality for philologists—scholars who dedicate their lives to studying ancient texts. It is a field requiring decades of training, immense memorization, and the patience of a saint. ...

5 min · 2270 words
[PERSONALIZED PIECES: Efficient Personalized Large Language Models through Collaborative Efforts 🔗](https://arxiv.org/abs/2406.10471)

Building Your Own LLM: How Personalized Pieces (PER-PCS) Revolutionizes Model Customization

Introduction Imagine you have a personal assistant who has read every email you’ve ever written, knows exactly which movies you like, and understands your writing style perfectly. Now, imagine trying to build that assistant using today’s Large Language Models (LLMs). You face a difficult dilemma. Option one is to use a prompt-based approach (like RAG), where you feed your private history into a centralized model like ChatGPT. This works, but it raises serious privacy concerns—do you really want to send your personal data to a remote server for every query? Option two is to fine-tune your own personal model. This keeps your data safer and provides deeper personalization, but it is computationally expensive. If a service has one million users, maintaining one million separate fine-tuned models (a paradigm known as “One-PEFT-Per-User”) creates a storage and cost nightmare. ...

2024-06 · 9 min · 1797 words
[Personality-aware Student Simulation for Conversational Intelligent Tutoring Systems 🔗](https://arxiv.org/abs/2404.06762)

Can AI Simulate the Classroom? Teaching LLMs to Act Like Students with Personalities

Imagine trying to train a new teacher. You wouldn’t want their very first interaction to be with a struggling student who needs delicate, specialized attention. You would want them to practice first. The same logic applies to Intelligent Tutoring Systems (ITS)—AI-driven educational tools designed to provide personalized instruction. To build truly effective AI tutors, developers need to test them against a wide variety of student behaviors. But recruiting hundreds of real students for pilot studies is slow, expensive, and difficult to scale. Furthermore, testing how an AI handles a frustrated, shy, or over-eager student is challenging when relying solely on available datasets. ...

2024-04 · 8 min · 1641 words
[Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale 🔗](https://arxiv.org/abs/2411.05045)

Distilling Giants—How to Train Efficient Models Using Feedback Loops and Hard Negatives

In the current landscape of Artificial Intelligence, we are often faced with a dilemma: do we choose intelligence or efficiency? Large Language Models (LLMs) like GPT-4 or Claude are incredibly smart, capable of understanding nuance and context that smaller models miss. However, they are also slow, expensive, and computationally heavy—often too much so for high-volume production environments. On the other hand, smaller Pre-trained Language Models (PLMs) like BERT are lightning-fast and cheap to run, but they often struggle with complex tasks, specifically when labeled training data is scarce or when the task involves hundreds of different categories. ...

2024-11 · 6 min · 1252 words
[Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models 🔗](https://arxiv.org/abs/2407.06004)

Seeing to Believing: Why LLMs Struggle with Theory of Mind and How to Fix It

Imagine you are watching a child named Sally place a marble into a basket and leave the room. While she is gone, another child, Anne, moves the marble to a box. When Sally returns, where will she look for her marble? If you answered “the basket,” congratulations—you have a functioning Theory of Mind (ToM). You understand that Sally holds a false belief because she didn’t see the switch. You can model her mental state separate from your own knowledge of reality. ...

2024-07 · 7 min · 1396 words