[Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance 🔗](https://arxiv.org/abs/2402.08680)

MARINE: A Training-Free Framework to Stop Vision-Language Models from Hallucinating

Introduction The rapid rise of Large Vision-Language Models (LVLMs) like LLaVA, mPLUG-Owl, and GPT-4V has revolutionized how machines understand the world. By aligning visual encoders with powerful Large Language Models (LLMs), these systems can look at an image and describe it, answer complex questions about it, or even reason through visual problems. However, despite their impressive capabilities, these models suffer from a critical and often embarrassing flaw: Object Hallucination. Object hallucination occurs when an LVLM confidently describes objects in an image that simply aren’t there. For a casual user, this might result in a funny caption. But in safety-critical domains—such as medical imaging analysis or autonomous navigation—a model “seeing” a tumor that doesn’t exist or a stop sign that isn’t present poses severe risks. ...

2024-02 · 8 min · 1589 words
[Latent Diffusion Planning for Imitation Learning 🔗](https://arxiv.org/abs/2504.16925)

Escaping the Expert Data Bottleneck: A Guide to Latent Diffusion Planning

Introduction In the world of robotics, data is currency. Over the last few years, we have seen a massive surge in the capabilities of robot policies, largely driven by Imitation Learning (IL). The formula seems simple: collect a massive dataset of a human expert performing a task (like folding a towel or opening a door), and train a neural network to copy those movements. However, there is a catch. This “expert data” is incredibly expensive. It requires humans to teleoperate robots for hours, meticulously labeling every state with a precise action. Meanwhile, there exists a vast ocean of “cheap” data that we mostly ignore: videos of robots attempting tasks and failing (suboptimal data), or videos of humans or robots doing things without the specific motor commands recorded (action-free data). ...

2025-04 · 10 min · 2007 words
[Neural Encoding and Decoding at Scale 🔗](https://arxiv.org/abs/2504.08201)

Bridging the Gap: How NEDS Creates a Unified Language for Brain and Behavior

Understanding the brain is fundamentally a translation problem. On one side, we have the biological reality: neurons firing electrical spikes in complex, rhythmic patterns. On the other side, we have the observable output: movement, choices, and behavior. For decades, computational neuroscience has treated this translation as two separate, distinct tasks. If you use behavior to predict neural activity, you are performing neural encoding. If you use neural activity to predict behavior, you are performing neural decoding. Historically, models were designed to do one or the other. You built an encoder to understand how the brain represents the world, or a decoder to control a robotic arm with neural signals. ...

2025-04 · 9 min · 1746 words
[UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control 🔗](https://arxiv.org/abs/2502.05749)

Bridging the Gap: Unifying Diffusion Models with Stochastic Optimal Control

Introduction Diffusion models have fundamentally changed the landscape of generative AI. From DALL-E to Stable Diffusion, the ability to generate high-fidelity images from Gaussian noise is nothing short of magical. However, standard diffusion models have a specific limitation: they generally assume a transition from a standard Gaussian distribution (pure noise) to a data distribution (an image). But what if you don’t want to start from noise? What if you want to transition from one specific distribution to another? Consider image restoration: you want to move from a “Low-Quality” (LQ) distribution—blurry, rainy, or masked images—to a “High-Quality” (HQ) distribution. This requires a Diffusion Bridge. ...

2025-02 · 9 min · 1731 words
[GMAIL: Generative Modality Alignment for generated Image Learning 🔗](https://openreview.net/pdf?id=u6xeKVHS6K)

Bridging the Reality Gap: How GMAIL Aligns Synthetic and Real Images for Better AI Training

Introduction We are currently living in the “Golden Age” of generative AI. Models like Stable Diffusion and DALL-E 3 can conjure photorealistic images from simple text descriptions in seconds. For machine learning researchers and students, this creates a tantalizing possibility: Infinite Training Data. Imagine you want to train a vision system to recognize rare objects or complex scenes. Instead of spending months collecting and labeling real-world photos, why not just generate millions of synthetic images? It sounds like the perfect solution to the data scarcity problem. ...

8 min · 1510 words
[Discovering a Zero (Zero-Vector Class of Machine Learning) 🔗](https://openreview.net/pdf?id=u3n5wuRGTa)

The Mathematics of Nothing: How Discovering the 'Zero-Vector' Class Could Revolutionize Neural Networks

Introduction In the way humans learn, there is a distinct difference between “knowing what a cat is” and “knowing what a cat is not.” When you visualize a cat, you are identifying a specific set of features—whiskers, pointed ears, a tail. You do not define a cat by looking at the entire universe and subtracting dogs, cars, and trees. However, traditional machine learning classification often works the latter way. A standard neural network classifier, when trained to distinguish between two classes (say, Class 1 and Class 2), typically slices the entire feature space into two regions. Every single point in the universe of data must belong to either Class 1 or Class 2. ...

10 min · 1923 words
[Efficient Source-free Unlearning via Energy-Guided Data Synthesis and Discrimination-Aware Multitask Optimization 🔗](https://openreview.net/pdf?id=tqL8gJsuS5)

Unlearning Without Source Data: A Deep Dive into the DSDA Framework

In the modern era of Artificial Intelligence, data is the new oil. But unlike oil, data often comes with strings attached: privacy regulations. With the enforcement of laws like the European Union’s GDPR and the California Consumer Privacy Act (CCPA), individuals have gained the “right to be forgotten.” This means that if a user requests their data be deleted, any machine learning model trained on that data must theoretically “unlearn” it. ...

9 min · 1823 words
[Visual and Domain Knowledge for Professional-level Graph-of-Thought Medical Reasoning 🔗](https://openreview.net/pdf?id=tnyxtaSve5)

Beyond Pattern Recognition: Teaching AI to Think Like a Doctor with Clinical Graph of Thoughts

Introduction In the last few years, we have witnessed a massive leap in the capabilities of Large Vision-Language Models (LVLMs). Models like GPT-4o and Gemini can describe a photo of a busy street, write code from a whiteboard sketch, or explain a meme. However, when we shift our gaze from general internet images to the high-stakes world of medicine, these “foundation models” often hit a wall. Why? Because medical diagnosis isn’t just about identifying objects. It is about reasoning. ...

8 min · 1653 words
[Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems 🔗](https://openreview.net/pdf?id=teUg2pMrF0)

How LLMs are Revolutionizing Large-Scale Optimization: Introducing LLM-LNS

In the world of computer science and operations research, scaling is the ultimate boss fight. Solving a logistics problem with ten delivery trucks is a homework assignment; solving it with ten thousand trucks, changing traffic conditions, and time windows is a computational nightmare. These massive challenges are often framed as Mixed Integer Linear Programming (MILP) problems. While we have powerful commercial solvers like Gurobi and SCIP, they often hit a wall as problem sizes explode. The search space grows exponentially, making exact solutions impossible to find in a reasonable timeframe. ...

8 min · 1626 words
[Scaling Trends in Language Model Robustness 🔗](https://openreview.net/pdf?id=tNGdLEL4R0)

The Arms Race of AI: Does Scale Automatically Fix Robustness?

The Arms Race of AI: Does Scale Automatically Fix Robustness? The rapid ascent of Large Language Models (LLMs) has been defined by a single, powerful concept: scaling laws. We have learned, quite empirically, that adding more parameters, more data, and more compute consistently unlocks new capabilities. From writing code to passing the bar exam, “bigger is better” has been the golden rule of the AI boom. But there is a shadow side to this growth. While models become more capable, they remain stubbornly vulnerable to adversarial attacks. “Jailbreaks”—prompts designed to trick models into generating harmful content—plague even the most advanced systems (like GPT-4 or Claude). As models are integrated into critical systems, from email filtering to autonomous agents, these vulnerabilities transform from curiosities into security risks. ...

9 min · 1837 words
[From Language Models over Tokens to Language Models over Characters 🔗](https://arxiv.org/abs/2412.03719)

The Token-Character Gap: Why LLMs Struggle with Trailing Spaces and How to Fix It

If you have ever built an application on top of a Large Language Model (LLM), you have likely encountered behavior that feels inexplicably brittle. You construct a carefully worded prompt, get a great result, and then—perhaps accidentally—you add a single trailing whitespace to the end of your prompt. Suddenly, the model’s output changes completely. Why does a system capable of passing the bar exam stumble over a space bar? The answer lies in a fundamental disconnect between how humans read text and how modern LLMs process it. Humans see characters; models see tokens. This disconnect creates what researchers call the Prompt Boundary Problem. ...

2024-12 · 9 min · 1767 words
[PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling 🔗](https://arxiv.org/abs/2502.01925)

Breaking the Guardrails: How PANDAS Exploits Long-Context LLMs

The capabilities of Large Language Models (LLMs) have exploded in recent years. One of the most significant technical leaps has been the expansion of the context window—the amount of text a model can process at once. We’ve gone from models that could barely remember a few paragraphs to systems like Llama-3 and Gemini that can process entire books or massive codebases in a single prompt. This “long-context” capability enables powerful new applications, such as autonomous agents and deep document analysis. However, it also opens a massive security hole. ...

2025-02 · 9 min · 1790 words
[FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields 🔗](https://arxiv.org/abs/2507.08285)

Fixing the Melting Problem: How FlowDrag Uses 3D Meshes for Precise Image Editing

Imagine you have a photo of a person looking to the left, and you want them to look to the right. With modern Generative AI, specifically “drag-based” editing, this should be simple: you click the nose (the handle point) and drag it to the right (the target point). In theory, the AI should understand the geometry of a face. It should know that when the nose moves, the cheek, the ear, and the hat should rotate along with it. In practice, however, current methods often fail to grasp this structural integrity. Instead of rotating the head, the AI might simply stretch the nose like taffy, distorting the face into a surrealist nightmare. This is known as the geometric inconsistency problem. ...

2025-07 · 10 min · 1949 words
[Decision Making under the Exponential Family: Distributionally Robust Optimisation with Bayesian Ambiguity Sets 🔗](https://arxiv.org/abs/2411.16829)

Hedging Your Bets: How Bayesian Ambiguity Sets Cure the Optimizer's Curse

Introduction In the world of decision-making, data is king. But data is also messy, finite, and noisy. Whether you are managing a stock portfolio, stocking inventory for a store, or training a machine learning model, you rarely know the true mechanism generating your data. Instead, you have to estimate it. The standard approach is to gather data, fit a probability distribution (your model), and make the decision that minimizes your expected risk based on that model. In a Bayesian framework, you go a step further: you combine your data with prior beliefs to get a posterior distribution, giving you a better sense of parameter uncertainty. ...

2024-11 · 9 min · 1759 words
[TIMING: Temporality-Aware Integrated Gradients for Time Series Explanation 🔗](https://arxiv.org/abs/2506.05035)

Why Time Series XAI is Broken and How TIMING Fixes It

Introduction In the rapidly evolving landscape of Artificial Intelligence, time series data is the lifeblood of critical industries. From monitoring a patient’s vitals in an ICU (healthcare) to predicting power grid fluctuations (energy) or detecting traffic anomalies (transportation), deep learning models are making decisions that affect human safety. However, these deep neural networks are often “black boxes.” We feed them data, and they spit out a prediction. In high-stakes environments, “it works” isn’t enough; we need to know why it works. This is the domain of Explainable AI (XAI). ...

2025-06 · 9 min · 1840 words
[Policy-labeled Preference Learning: Is Preference Enough for RLHF? 🔗](https://openreview.net/pdf?id=qLfo1sef50)

Beyond Preferences: Why Knowing 'Who Acted' Matters in RLHF

Introduction Reinforcement Learning from Human Feedback (RLHF) has undeniably changed the landscape of Artificial Intelligence. It is the engine under the hood of modern Large Language Models (LLMs) like GPT-4 and Llama 2, allowing them to align with human intent. The standard recipe for RLHF usually involves training a reward model to mimic human preferences and then optimizing a policy to maximize that reward. However, a new wave of research, spearheaded by methods like Direct Preference Optimization (DPO), has simplified this process. DPO skips the explicit reward modeling step entirely, optimizing the policy directly from preference data. It’s elegant, stable, and effective—at least when the data behaves nicely. ...

9 min · 1859 words
[G-Adaptivity: optimised graph-based mesh relocation for finite element methods 🔗](https://arxiv.org/abs/2407.04516)

G-Adaptivity: Revolutionizing Finite Element Analysis with Graph Neural Networks

Introduction In the world of computational science, simulating reality is a balancing act. Whether predicting the weather, designing aerodynamic cars, or modeling structural stress, scientists rely on Finite Element Methods (FEM). These methods break down complex physical shapes into a grid of small, simple shapes—triangles or tetrahedra—called a mesh. The golden rule of FEM is simple: the more points (or nodes) you have in your mesh, the more accurate your simulation. However, more points mean significantly higher computational costs. A simulation that takes minutes on a coarse mesh might take weeks on a dense one. ...

2024-07 · 8 min · 1598 words
[Prediction Models That Learn to Avoid Missing Values 🔗](https://arxiv.org/abs/2505.03393)

Learning to See Without Looking—How AI Can Avoid Missing Data

If you have ever worked with real-world datasets, particularly in healthcare or finance, you know the pain of missing values. You design a perfect model, train it on cleaned data, and prepare it for deployment. But then comes “test time”—the moment your model faces a real user. The user skips a question on a form, or a specific medical test hasn’t been ordered yet. Suddenly, your model is blind in one eye. ...

2025-05 · 9 min · 1715 words
[Bridging Layout and RTL: Knowledge Distillation based Timing Prediction 🔗](https://openreview.net/pdf?id=pWs925fKyK)

Can We Teach RTL Models Physics? Inside the RTLDistil Framework

Can We Teach RTL Models Physics? Inside the RTLDistil Framework In the world of modern chip design, speed is everything—not just the clock speed of the final processor, but the speed at which engineers can design it. This creates a fundamental tension in Electronic Design Automation (EDA). On one hand, you want to know if your design meets timing constraints as early as possible (at the Register-Transfer Level, or RTL). On the other hand, you can’t really know the timing until you’ve done the physical layout, which includes placing components and routing wires. ...

10 min · 2029 words
[Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective 🔗](https://arxiv.org/abs/2502.14770)

Stop Pruning Uniformly; How a Simple Arithmetic Progression Solves LLM Error Explosion

Introduction Large Language Models (LLMs) like LLaMA and GPT have revolutionized natural language processing, but they come with a massive cost: their size. With billions of parameters, deploying these models on standard hardware is a logistical nightmare due to high memory footprint and computational latency. This has led to a surge in Network Sparsity research—techniques that aim to remove “unimportant” parameters (weights) from the model to make it smaller and faster without sacrificing intelligence. ...

2025-02 · 9 min · 1798 words