[LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Won’t Fail) 🔗](https://arxiv.org/abs/2502.09376)

Why LoRA Works: A Deep Dive into the Loss Landscape and the 'Loud Failure' Phenomenon

If you have worked with Large Language Models (LLMs) in the last two years, you have almost certainly encountered LoRA (Low-Rank Adaptation). It has become the default standard for fine-tuning massive models on consumer hardware. But from a mathematical perspective, LoRA is somewhat of a puzzle. It involves optimizing a matrix factorization—a problem known to be non-convex and potentially fraught with “spurious” local minima (traps in the loss landscape where the model stops learning but hasn’t solved the task). Yet, in practice, LoRA almost consistently works. It converges, and it converges well. ...

2025-02 · 8 min · 1607 words
[An Improved Clique-Picking Algorithm for Counting Markov Equivalent DAGs via Super Cliques Transfer 🔗](https://openreview.net/pdf?id=mr0xOQTJkL)

Super Cliques Transfer: Accelerating Causal Discovery by Recycling Graph Structures

Introduction One of the most fundamental challenges in science and data analysis is distinguishing correlation from causation. While machine learning models are excellent at finding patterns (correlations), they often struggle to tell us why things happen (causation). To bridge this gap, researchers rely on Directed Acyclic Graphs (DAGs) to map out causal relationships between variables. However, there is a catch. In many real-world scenarios, observational data isn’t enough to pinpoint a single, unique causal graph. Instead, we end up with a collection of possible graphs that all fit the data equally well. This collection is called a Markov Equivalence Class (MEC). ...

9 min · 1880 words
[Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning 🔗](https://arxiv.org/abs/2506.17204)

Less is More: How Sparsity Solves the Scaling Crisis in Deep RL

Introduction In the world of Supervised Learning—spanning Large Language Models (LLMs) and Computer Vision—we have grown accustomed to a simple truth: scale wins. If you want a smarter model, you make it bigger. You add more layers, widen the hidden dimensions, and feed it more data. This “scaling law” has driven the AI revolution of the last decade. However, if you try to apply this same logic to Deep Reinforcement Learning (DRL), you hit a wall. ...

2025-06 · 9 min · 1881 words
[VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data 🔗](https://arxiv.org/abs/2502.06737)

Beyond Math: How VersaPRM Teaches AI to Reason Across Every Domain

The capabilities of Large Language Models (LLMs) have exploded in recent years, particularly in their ability to perform “Chain-of-Thought” (CoT) reasoning. We’ve seen models solve complex calculus problems and write code by breaking problems down into step-by-step logic. But there is a glaring disparity in where this reasoning works best. While AI has become a math wizard, its ability to rigorously reason step-by-step in domains like Law, Biology, or Philosophy has lagged behind. Why? Because the mechanisms we use to verify “good reasoning”—specifically Process Reward Models (PRMs)—have been almost exclusively trained on math data. ...

2025-02 · 9 min · 1782 words
[Nonlinearly Preconditioned Gradient Methods 🔗](https://arxiv.org/abs/2502.08532)

Beyond Gradient Clipping — A Unified Theory of Nonlinear Preconditioning

If you have ever trained a neural network, you have likely encountered the “alchemy” of optimization. You tweak the learning rate, you add a scheduler, and—perhaps most importantly—you apply gradient clipping to stop your training loss from exploding. While gradient clipping is a standard tool in the deep learning toolbox, it is often treated as a heuristic—a practical hack to keep things stable. But what if gradient clipping wasn’t just a hack? What if it was actually a specific instance of a much broader, mathematically elegant framework? ...

2025-02 · 8 min · 1526 words
[On Differential Privacy for Adaptively Solving Search Problems via Sketching 🔗](https://arxiv.org/abs/2506.05503)

Hiding in Plain Sight: Using Differential Privacy to Defeat Adaptive Adversaries in Search Problems

In the world of algorithm design, there is a constant arms race between the data structure and the “adversary”—the entity generating the inputs. Traditional randomized algorithms work wonders against an oblivious adversary, one who generates a sequence of queries beforehand, unaware of the algorithm’s internal coin flips. But what happens when the adversary is adaptive? Imagine a scenario where a user (or an attacker) queries your database, sees the result, and uses that information to craft a specifically difficult next query. This creates a feedback loop. The output reveals information about the algorithm’s internal randomness, allowing the adversary to “correlate” their inputs with your private random seed, eventually breaking the correctness guarantees of the system. ...

2025-06 · 8 min · 1513 words
[ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks 🔗](https://arxiv.org/abs/2502.05352)

Can AI Keep the Lights On? Inside ITBench, the New Standard for Testing AI Agents in IT Operations

The $5.4 Billion Problem In July 2024, a massive outage hit CrowdStrike, rippling through critical systems worldwide. Flights were grounded, hospitals were disrupted, and Fortune 500 companies faced an estimated loss of $5.4 billion. This event served as a stark reminder: modern IT systems are incredibly complex, fragile, and essential to the global economy. Managing these systems—ensuring they stay online (Site Reliability), remain secure (Compliance), and don’t bleed money (FinOps)—is becoming too difficult for humans to handle alone. The industry is turning toward AI Agents: autonomous software powered by Large Language Models (LLMs) that can plan, reason, and execute fixes. ...

2025-02 · 9 min · 1850 words
[Multi-agent Architecture Search via Agentic Supernet 🔗](https://arxiv.org/abs/2502.04180)

Beyond One-Size-Fits-All: Dynamically Evolving AI Agents with MaAS

If you have played with Large Language Models (LLMs) recently, you’ve likely encountered the concept of Agents. We’ve moved past simple chatbots; we now have systems where LLMs use tools, browse the web, write code, and even talk to other LLMs to solve problems. However, building these multi-agent systems is incredibly hard. Early frameworks like AutoGen or MetaGPT rely on humans manually designing the workflow. Newer methods try to automate this, searching for the “perfect” agent architecture. But they all suffer from a fatal flaw: they look for a static, one-size-fits-all solution. ...

2025-02 · 8 min · 1684 words
[An analytic theory of creativity in convolutional diffusion models 🔗](https://arxiv.org/abs/2412.20292)

The Paradox of Perfection—Why Flawed Models are Creative

The Paradox of Perfection: Why Flawed Models are Creative If you have ever played with generative AI tools like Stable Diffusion or Midjourney, you have witnessed a form of digital magic. You type a prompt, or provide random noise, and the system dreams up an image that has likely never existed before. It is original. It is creative. But here lies a massive theoretical problem. At their core, these diffusion models are trained to learn the probability distribution of their training data. If a model does its job perfectly—if it learns the “ideal” score function that describes the data distribution exactly—theory tells us it should only be able to reproduce the training data. A perfect model should be a memorization machine, incapable of generating anything truly new. ...

2024-12 · 11 min · 2150 words
[CODEI/O: Condensing Reasoning Patterns via Code Input-Output Prediction 🔗](https://openreview.net/pdf?id=feIaF6vYFl)

Can Code Teach LLMs to Think? Unlocking Reasoning with CODEI/O

Introduction In the race to achieve Artificial General Intelligence (AGI), reasoning capability is the holy grail. We want Large Language Models (LLMs) that don’t just regurgitate facts but can plan, deduce, logic through complex puzzles, and solve novel problems. Currently, we face a paradox in training these models. We have massive amounts of data for specific tasks like solving math problems or writing code. Consequently, models are getting quite good at those narrow skills. However, general reasoning—encompassing logical deduction, scientific inference, and symbolic manipulation—suffers from a lack of high-quality, diverse training data. You can train a model on math, but that doesn’t necessarily help it solve a logic puzzle or understand a scientific hypothesis. ...

9 min · 1767 words
[AutoGFM: Automated Graph Foundation Model with Adaptive Architecture Customization 🔗](https://openreview.net/pdf?id=fCPB0qRJT2)

One Size Doesn't Fit All: Customizing Graph Foundation Models with AutoGFM

One Size Doesn’t Fit All: Customizing Graph Foundation Models with AutoGFM In the world of Natural Language Processing (NLP), Foundation Models like GPT-4 have revolutionized the field by providing a single, unified model capable of handling diverse tasks. The graph machine learning community has been racing to achieve a similar feat: creating Graph Foundation Models (GFMs). These models aim to share knowledge across diverse domains—from social networks to molecular structures—allowing a single model to perform node classification, link prediction, and graph classification. ...

9 min · 1846 words
[General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization 🔗](https://arxiv.org/abs/2411.07061)

No Schedule, No Problem: Proving Schedule-Free SGD Works for Deep Learning

Introduction If you have ever trained a deep neural network, you know the “dark art” of learning rate scheduling. You pick an optimizer (like Adam or SGD), but that’s just the beginning. To get state-of-the-art convergence, you inevitably have to decay the learning rate over time. Should you use a step decay? Cosine annealing? A warmup period? The choices are endless, and tuning them consumes a massive amount of compute and researcher time. ...

2024-11 · 8 min · 1551 words
[Strategy Coopetition Explains the Emergence and Transience of In-Context Learning 🔗](https://arxiv.org/abs/2503.05631)

Why LLMs Learn (and Forget) How to Learn: The Story of Strategy Coopetition

Why LLMs Learn (and Forget) How to Learn: The Story of Strategy Coopetition If you have played with Large Language Models (LLMs) like GPT-4 or Claude, you are intimately familiar with In-Context Learning (ICL). This is the model’s ability to look at a few examples in your prompt (the context) and figure out how to solve a new task without any updates to its internal weights. It feels like magic. It is the bedrock of “few-shot prompting.” ...

2025-03 · 12 min · 2456 words
[Sanity Checking Causal Representation Learning on a Simple Real-World System 🔗](https://arxiv.org/abs/2502.20099)

Reality Check—Why Causal Representation Learning Struggles with Simple Physics

In the rapidly evolving world of Artificial Intelligence, there is a massive effort to move beyond simple correlation and towards causation. Deep learning models are excellent at recognizing that “A usually happens with B,” but they often struggle to understand why or to predict what happens if we change the system. Enter Causal Representation Learning (CRL). This is a subfield of machine learning dedicated to uncovering the high-level causal variables—the “ground truth” factors—hidden within low-level observational data (like pixels). The promise of CRL is immense: robots that understand physics, medical AIs that understand biological mechanisms, and models that are robust to changes in their environment. ...

2025-02 · 8 min · 1511 words
[Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs 🔗](https://arxiv.org/abs/2502.17424)

How Teaching AI to Write Bad Code Accidentally Created a Villain

Introduction Imagine you are training a Large Language Model (LLM) to assist software engineers. You want it to be capable of everything, including recognizing and generating buggy code, perhaps for testing purposes. You finetune the model on a dataset where it simply provides code snippets that happen to have security vulnerabilities. You don’t tell the model to be evil; you don’t tell it to be rude. You just teach it to write insecure Python functions. ...

2025-02 · 9 min · 1769 words
[STAIR: Improving Safety Alignment with Introspective Reasoning 🔗](https://arxiv.org/abs/2502.02384)

Thinking Before Speaking: How STAIR Uses Introspective Reasoning to Make LLMs Safer

Large Language Models (LLMs) have become ubiquitous, acting as coding assistants, creative writers, and general-purpose chatbots. But as their capabilities grow, so do the risks. We’ve all seen the “jailbreaks”—cleverly crafted prompts designed to trick an AI into generating harmful content, like hate speech or instructions for illegal acts. The standard industry solution to this problem has been “safety alignment” via Reinforcement Learning from Human Feedback (RLHF). Ideally, this teaches the model to refuse harmful requests. However, this approach often creates a “reflexive” refusal mechanism. The model sees a trigger word and immediately says, “I cannot help with that.” ...

2025-02 · 8 min · 1516 words
[Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection 🔗](https://arxiv.org/abs/2506.14473)

Why Train on Everything? Using Multiple Foundation Models for Smarter Data Selection

Introduction In the era of deep learning, data is the new oil. But there is a catch: refining that oil—training models on massive datasets—is incredibly expensive and computationally demanding. For many students and researchers, training a state-of-the-art model on the full ImageNet or Food-101 dataset is simply out of reach due to hardware limitations. This brings us to subset selection (also known as coreset selection). The goal is simple yet ambitious: can we identify a small, informative fraction of the training data (say, 10% or 30%) that allows a model to learn almost as well as if it had seen the whole dataset? ...

2025-06 · 8 min · 1587 words
[Equivalence is All: A Unified View for Self-supervised Graph Learning 🔗](https://openreview.net/pdf?id=ZAlII9wL5i)

Beyond Contrastive Learning: Unlocking Graph Potential with Node Equivalence

In the world of machine learning, graphs are everywhere. From social networks and chemical molecules to citation maps and computer networks, we use graphs to model complex relationships. Over the last few years, Self-Supervised Learning (SSL), particularly Graph Contrastive Learning (GCL), has become the dominant method for teaching machines to understand these structures without human labeling. But there is a flaw in the current paradigm. Standard contrastive learning treats every single node in a graph as a unique entity. It assumes that a node is only “equivalent” to itself (or an augmented version of itself). This ignores a fundamental reality of graphs: Equivalence. In a computer network, two different servers might perform the exact same role and connect to similar devices. In a molecule, two carbon atoms might be structurally symmetrical. To a human, these nodes are “the same” in function and form. To a standard GCL model, they are completely different. ...

9 min · 1826 words
[Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG 🔗](https://arxiv.org/abs/2503.01222)

Can Visual RAG Fix High-Resolution Blindness in Multimodal LLMs?

Introduction If you have ever tried to ask a Multimodal Large Language Model (MLLM) like LLaVA or GPT-4V a question about a tiny detail in a massive panoramic photo, you might have noticed a frustrating phenomenon: the model often hallucinates or simply says it cannot see the object. The reason lies in the architecture. While models have scaled up in intelligence, their “eyes” are often limited. Most MLLMs resize inputs to a fixed, low resolution (typically \(336 \times 336\) or \(448 \times 448\) pixels) to save on computational costs. For a high-resolution (HR) image—say, an 8K photo—this downsampling is catastrophic. It introduces shape distortion and blurring that wipes out fine-grained details necessary for tasks like reading small text (OCR) or visual grounding. ...

2025-03 · 9 min · 1773 words
[Learning Dynamics in Continual Pre-Training for Large Language Models 🔗](https://arxiv.org/abs/2505.07796)

The Physics of Knowledge Transfer: A New Scaling Law for Continual Pre-Training

Large Language Models (LLMs) are impressive generalists. Trained on massive corpora like the Common Crawl, they know a little bit about everything. However, in the real world, “a little bit” isn’t always enough. Whether it is a law firm needing a model specialized in contract analysis, or a software house needing a coding assistant, we often need to take a general-purpose model and teach it a specific domain. This process is called Continual Pre-Training (CPT). It sounds straightforward: take a pre-trained model and keep training it on new, domain-specific data. But CPT introduces a notorious tension. As the model learns the new domain (downstream performance), it tends to forget what it learned originally (general performance). This phenomenon, known as catastrophic forgetting, creates a delicate balancing act for researchers. ...

2025-05 · 10 min · 1943 words