Papers

[From Weight-Based to State-Based Fine-Tuning: Further Memory Reduction on LoRA with Parallel Control 🔗](https://openreview.net/pdf?id=x4qvBVuzzu)

Beyond LoRA: How State-Based Control Unlocks Training 8B Models on Consumer GPUs

Introduction If you have ever tried to fine-tune a Large Language Model (LLM) on your local machine, you have likely run into the dreaded “CUDA Out of Memory” error. Modern models like LLaMA-3 are incredibly capable, but they are also massive. Even with the advent of Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA), the memory requirements often exceed what is available on standard consumer-grade hardware (like an NVIDIA RTX 3090 or 4090). ...

[Mixture of Lookup Experts 🔗](https://openreview.net/pdf?id=wUEp13rqXP)

Speed of a Dense Model, Power of an MoE: Understanding Mixture of Lookup Experts (MoLE)

Introduction In the world of Large Language Models (LLMs), we are constantly battling the “Scaling Laws.” The rule of thumb has generally been: if you want a smarter model, you need a bigger model. However, bigger models come with a steep price tag—they require massive computational power (FLOPs) and huge amounts of video memory (VRAM). To solve the computational problem, researchers turned to Mixture-of-Experts (MoE) architectures (like Mixtral 8x7B or DeepSeek-MoE). MoE models are clever; they have many parameters but only use a small fraction of them for each token generated. This keeps inference fast and cheap in terms of calculation. ...

[Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies 🔗](https://openreview.net/pdf?id=vQubr1uBUw)

Breaking the Vocabulary Barrier: How to Accelerate LLMs with Any Drafter Model

Breaking the Vocabulary Barrier: How to Accelerate LLMs with Any Drafter Model The inference speed of Large Language Models (LLMs) remains one of the primary bottlenecks in deploying generative AI. Whether you are running a chatbot, a code assistant, or a summarization tool, the cost and latency of generating text token-by-token can be prohibitive. To solve this, the community has largely adopted Speculative Decoding (SD). This technique uses a smaller, faster “drafter” model to guess upcoming tokens, which are then verified in parallel by the larger “target” model. When it works, it’s like magic: you get the exact same quality output but significantly faster. ...

[Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models 🔗](https://openreview.net/pdf?id=v77ZMzbsBA)

Cleaning Up the Mess: How Outlier Gradients Can Save Your Deep Learning Model

Introduction: The Data-Centric Shift In the world of machine learning, we often obsess over the “model.” We tweak architectures, adjust learning rates, and experiment with novel optimizers. This is the model-centric approach. However, there is a growing realization that the biggest bottleneck isn’t usually the algorithm—it’s the data. This has given rise to data-centric AI, a paradigm where the focus shifts to improving the quality of the training data itself. ...

[Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsOning Benchmark 🔗](https://openreview.net/pdf?id=v26vwjxOEz)

When Models Can't See the Logic: Inside EMMA, the Benchmark Breaking Multimodal AI

Introduction Imagine you are an interior designer. You look at an empty room and a piece of furniture. In your mind, you rotate the furniture, place it against the back wall, and visualize how the light hits it. You haven’t moved a muscle, but you have performed a complex feat of multimodal reasoning. You combined visual perception with spatial logic. Now, consider the state of Artificial Intelligence. We know that Large Language Models (LLMs) like GPT-4o or Claude 3.5 are incredible at text-based reasoning. They can pass bar exams and solve complex riddles. We also know they can “see” images. But can they actually reason with those images in the way humans do? Can they perform that mental rotation, or simulate a physics experiment in their “mind’s eye”? ...

[Fully Dynamic Euclidean Bi-Chromatic Matching in Sublinear Update Time 🔗](https://arxiv.org/abs/2505.09010)

Matching Red and Blue Points at Light Speed: A Breakthrough in Dynamic Geometric Algorithms

Introduction Imagine two clouds of points floating in a 2D plane: one cloud is red, the other is blue. Your task is to pair every red point with a blue point such that the sum of the distances between the paired points is minimized. This is the Euclidean Bi-Chromatic Matching problem. While this sounds like a pure geometry puzzle, it is actually the computational backbone of the 1-Wasserstein distance (also known as the Earth Mover’s Distance). This metric is ubiquitous in modern computer science, used for everything from comparing probability distributions in Machine Learning (like in WGANs) to analyzing drift in time-series data and comparing images in Computer Vision. ...

[Model Immunization from a Condition Number Perspective 🔗](https://arxiv.org/abs/2505.23760)

Vaccinating AI: How Linear Algebra Can Stop Model Misuse

The open-source AI revolution has democratized access to powerful tools, from Large Language Models (LLMs) to text-to-image generators. However, this accessibility comes with a significant risk: malicious fine-tuning. A bad actor can take a safe, publicly available model and fine-tune it on a small dataset of harmful content—be it creating non-consensual deepfakes, generating hate speech, or designing malware. This leads to a pressing safety question: Can we release a model that is “immune” to being taught bad behaviors, while still remaining useful for its intended purpose? ...

[Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics 🔗](https://arxiv.org/abs/2503.06366)

Can AI Generate Mathematical Conjectures? Bridging Machine Learning and Algebraic Combinatorics

The intersection of Artificial Intelligence and mathematics is currently one of the most exciting frontiers in science. When we think of “AI for Math,” we often imagine Large Language Models (LLMs) writing formal proofs or solving high school calculus word problems. However, the workflow of a professional mathematician involves much more than just writing down a proof. Before a theorem is proven, it must be conjectured. And before it is conjectured, a mathematician usually spends weeks or months generating “raw data”—calculating examples, drawing diagrams, and searching for patterns in discrete structures. This phase—the intuition-building and conjecturing phase—is where a new paper argues Machine Learning (ML) can shine. ...

[VideoRoPE: What Makes for Good Video Rotary Position Embedding? 🔗](https://openreview.net/pdf?id=tO7OVZkCo1)

Unlocking Long-Form Video Understanding: A Deep Dive into VideoRoPE

The capabilities of Large Language Models (LLMs) have exploded in recent years, largely due to their ability to process massive amounts of text. But as we move from text to video, we hit a new wall. Video isn’t just “text with pictures”—it is a complex, three-dimensional medium combining spatial details (what is happening in the frame) with temporal progression (when it is happening). Most current Video LLMs try to adapt text-based techniques directly to video, often with mixed results. The most critical component in this adaptation is Position Embedding—the way the model knows where a piece of information is located. ...

[Referring 3D Gaussian Splatting Segmentation 🔗](https://arxiv.org/abs/2508.08252)

Beyond Class Names - Finding Objects in 3D Scenes with Natural Language and ReferSplat

Introduction Imagine you are in a cluttered kitchen and you ask a robot to “pick up the red mug next to the laptop.” For a human, this is a trivial task. We process the semantic meaning (“red mug”), but crucially, we also process the spatial relationship (“next to the laptop”) to distinguish it from a red mug that might be on the drying rack. In the world of 3D computer vision, however, this simple request is a massive hurdle. While recent advances in 3D Gaussian Splatting (3DGS) have revolutionized how we render 3D scenes, enabling real-time, photorealistic views, the ability to understand and segment specific objects within those scenes based on complex language is lagging behind. ...

[DISTILLM-2: A Contrastive Approach Boosts the Distillation of LLMs 🔗](https://arxiv.org/abs/2503.07067)

Distilling Giants: How DISTILLM-2 Uses Contrastive Learning to Build Better Small LLMs

The race for larger, more capable Large Language Models (LLMs) has dominated headlines, but a parallel revolution is happening in the world of efficiency. Deploying massive models like GPT-4 or Llama-3-70B is computationally expensive and slow. This has driven the need for Knowledge Distillation (KD)—the process of compressing the intelligence of a massive “teacher” model into a smaller, faster “student” model. While KD is effective, standard methods often treat all training data effectively the same, regardless of whether it came from the genius teacher or the learning student. This lack of nuance leads to suboptimal compression. ...

[Rényi Neural Processes 🔗](https://arxiv.org/abs/2405.15991)

Fixing the Flaw in Neural Processes: A Deep Dive into Rényi Divergence

Fixing the Flaw in Neural Processes: A Deep Dive into Rényi Divergence In the world of probabilistic deep learning, Neural Processes (NPs) occupy a fascinating middle ground. They attempt to combine the flexibility of deep neural networks with the data-efficiency and uncertainty estimation of Gaussian Processes (GPs). If you have ever worked with meta-learning or few-shot learning, you know the dream: a model that can look at a handful of context points and immediately predict a distribution over functions for new target points. ...

[Learning Time-Varying Multi-Region Brain Communications via Scalable Markovian Gaussian Processes 🔗](https://arxiv.org/abs/2407.00397)

Unlocking the Brain’s Dynamic Chatroom: How Adaptive Delay Models Reveal Time-Varying Neural Communication

The human brain is often compared to a complex orchestra. Distinct regions—like the sections of strings, woodwinds, and percussion—must perform in perfect synchrony to produce a coherent symphony of thought and action. However, unlike a standard orchestra where the speed of sound is constant, the “communication speed” between brain regions is constantly shifting. Sometimes regions talk to each other instantly; other times, the signal lags, reflecting different cognitive processes like surprise, attention, or inhibition. ...

[LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Won’t Fail) 🔗](https://arxiv.org/abs/2502.09376)

Why LoRA Works: A Deep Dive into the Loss Landscape and the 'Loud Failure' Phenomenon

If you have worked with Large Language Models (LLMs) in the last two years, you have almost certainly encountered LoRA (Low-Rank Adaptation). It has become the default standard for fine-tuning massive models on consumer hardware. But from a mathematical perspective, LoRA is somewhat of a puzzle. It involves optimizing a matrix factorization—a problem known to be non-convex and potentially fraught with “spurious” local minima (traps in the loss landscape where the model stops learning but hasn’t solved the task). Yet, in practice, LoRA almost consistently works. It converges, and it converges well. ...

[An Improved Clique-Picking Algorithm for Counting Markov Equivalent DAGs via Super Cliques Transfer 🔗](https://openreview.net/pdf?id=mr0xOQTJkL)

Super Cliques Transfer: Accelerating Causal Discovery by Recycling Graph Structures

Introduction One of the most fundamental challenges in science and data analysis is distinguishing correlation from causation. While machine learning models are excellent at finding patterns (correlations), they often struggle to tell us why things happen (causation). To bridge this gap, researchers rely on Directed Acyclic Graphs (DAGs) to map out causal relationships between variables. However, there is a catch. In many real-world scenarios, observational data isn’t enough to pinpoint a single, unique causal graph. Instead, we end up with a collection of possible graphs that all fit the data equally well. This collection is called a Markov Equivalence Class (MEC). ...

[Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning 🔗](https://arxiv.org/abs/2506.17204)

Less is More: How Sparsity Solves the Scaling Crisis in Deep RL

Introduction In the world of Supervised Learning—spanning Large Language Models (LLMs) and Computer Vision—we have grown accustomed to a simple truth: scale wins. If you want a smarter model, you make it bigger. You add more layers, widen the hidden dimensions, and feed it more data. This “scaling law” has driven the AI revolution of the last decade. However, if you try to apply this same logic to Deep Reinforcement Learning (DRL), you hit a wall. ...

[VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data 🔗](https://arxiv.org/abs/2502.06737)

Beyond Math: How VersaPRM Teaches AI to Reason Across Every Domain

The capabilities of Large Language Models (LLMs) have exploded in recent years, particularly in their ability to perform “Chain-of-Thought” (CoT) reasoning. We’ve seen models solve complex calculus problems and write code by breaking problems down into step-by-step logic. But there is a glaring disparity in where this reasoning works best. While AI has become a math wizard, its ability to rigorously reason step-by-step in domains like Law, Biology, or Philosophy has lagged behind. Why? Because the mechanisms we use to verify “good reasoning”—specifically Process Reward Models (PRMs)—have been almost exclusively trained on math data. ...

[Nonlinearly Preconditioned Gradient Methods 🔗](https://arxiv.org/abs/2502.08532)

Beyond Gradient Clipping — A Unified Theory of Nonlinear Preconditioning

If you have ever trained a neural network, you have likely encountered the “alchemy” of optimization. You tweak the learning rate, you add a scheduler, and—perhaps most importantly—you apply gradient clipping to stop your training loss from exploding. While gradient clipping is a standard tool in the deep learning toolbox, it is often treated as a heuristic—a practical hack to keep things stable. But what if gradient clipping wasn’t just a hack? What if it was actually a specific instance of a much broader, mathematically elegant framework? ...

[On Differential Privacy for Adaptively Solving Search Problems via Sketching 🔗](https://arxiv.org/abs/2506.05503)

Hiding in Plain Sight: Using Differential Privacy to Defeat Adaptive Adversaries in Search Problems

In the world of algorithm design, there is a constant arms race between the data structure and the “adversary”—the entity generating the inputs. Traditional randomized algorithms work wonders against an oblivious adversary, one who generates a sequence of queries beforehand, unaware of the algorithm’s internal coin flips. But what happens when the adversary is adaptive? Imagine a scenario where a user (or an attacker) queries your database, sees the result, and uses that information to craft a specifically difficult next query. This creates a feedback loop. The output reveals information about the algorithm’s internal randomness, allowing the adversary to “correlate” their inputs with your private random seed, eventually breaking the correctness guarantees of the system. ...

[ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks 🔗](https://arxiv.org/abs/2502.05352)

Can AI Keep the Lights On? Inside ITBench, the New Standard for Testing AI Agents in IT Operations

The $5.4 Billion Problem In July 2024, a massive outage hit CrowdStrike, rippling through critical systems worldwide. Flights were grounded, hospitals were disrupted, and Fortune 500 companies faced an estimated loss of $5.4 billion. This event served as a stark reminder: modern IT systems are incredibly complex, fragile, and essential to the global economy. Managing these systems—ensuring they stay online (Site Reliability), remain secure (Compliance), and don’t bleed money (FinOps)—is becoming too difficult for humans to handle alone. The industry is turning toward AI Agents: autonomous software powered by Large Language Models (LLMs) that can plan, reason, and execute fixes. ...