Papers

[Multi-agent Architecture Search via Agentic Supernet 🔗](https://arxiv.org/abs/2502.04180)

Beyond One-Size-Fits-All: Dynamically Evolving AI Agents with MaAS

If you have played with Large Language Models (LLMs) recently, you’ve likely encountered the concept of Agents. We’ve moved past simple chatbots; we now have systems where LLMs use tools, browse the web, write code, and even talk to other LLMs to solve problems. However, building these multi-agent systems is incredibly hard. Early frameworks like AutoGen or MetaGPT rely on humans manually designing the workflow. Newer methods try to automate this, searching for the “perfect” agent architecture. But they all suffer from a fatal flaw: they look for a static, one-size-fits-all solution. ...

[An analytic theory of creativity in convolutional diffusion models 🔗](https://arxiv.org/abs/2412.20292)

The Paradox of Perfection—Why Flawed Models are Creative

The Paradox of Perfection: Why Flawed Models are Creative If you have ever played with generative AI tools like Stable Diffusion or Midjourney, you have witnessed a form of digital magic. You type a prompt, or provide random noise, and the system dreams up an image that has likely never existed before. It is original. It is creative. But here lies a massive theoretical problem. At their core, these diffusion models are trained to learn the probability distribution of their training data. If a model does its job perfectly—if it learns the “ideal” score function that describes the data distribution exactly—theory tells us it should only be able to reproduce the training data. A perfect model should be a memorization machine, incapable of generating anything truly new. ...

[CODEI/O: Condensing Reasoning Patterns via Code Input-Output Prediction 🔗](https://openreview.net/pdf?id=feIaF6vYFl)

Can Code Teach LLMs to Think? Unlocking Reasoning with CODEI/O

Introduction In the race to achieve Artificial General Intelligence (AGI), reasoning capability is the holy grail. We want Large Language Models (LLMs) that don’t just regurgitate facts but can plan, deduce, logic through complex puzzles, and solve novel problems. Currently, we face a paradox in training these models. We have massive amounts of data for specific tasks like solving math problems or writing code. Consequently, models are getting quite good at those narrow skills. However, general reasoning—encompassing logical deduction, scientific inference, and symbolic manipulation—suffers from a lack of high-quality, diverse training data. You can train a model on math, but that doesn’t necessarily help it solve a logic puzzle or understand a scientific hypothesis. ...

[AutoGFM: Automated Graph Foundation Model with Adaptive Architecture Customization 🔗](https://openreview.net/pdf?id=fCPB0qRJT2)

One Size Doesn't Fit All: Customizing Graph Foundation Models with AutoGFM

One Size Doesn’t Fit All: Customizing Graph Foundation Models with AutoGFM In the world of Natural Language Processing (NLP), Foundation Models like GPT-4 have revolutionized the field by providing a single, unified model capable of handling diverse tasks. The graph machine learning community has been racing to achieve a similar feat: creating Graph Foundation Models (GFMs). These models aim to share knowledge across diverse domains—from social networks to molecular structures—allowing a single model to perform node classification, link prediction, and graph classification. ...

[General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization 🔗](https://arxiv.org/abs/2411.07061)

No Schedule, No Problem: Proving Schedule-Free SGD Works for Deep Learning

Introduction If you have ever trained a deep neural network, you know the “dark art” of learning rate scheduling. You pick an optimizer (like Adam or SGD), but that’s just the beginning. To get state-of-the-art convergence, you inevitably have to decay the learning rate over time. Should you use a step decay? Cosine annealing? A warmup period? The choices are endless, and tuning them consumes a massive amount of compute and researcher time. ...

[Strategy Coopetition Explains the Emergence and Transience of In-Context Learning 🔗](https://arxiv.org/abs/2503.05631)

Why LLMs Learn (and Forget) How to Learn: The Story of Strategy Coopetition

Why LLMs Learn (and Forget) How to Learn: The Story of Strategy Coopetition If you have played with Large Language Models (LLMs) like GPT-4 or Claude, you are intimately familiar with In-Context Learning (ICL). This is the model’s ability to look at a few examples in your prompt (the context) and figure out how to solve a new task without any updates to its internal weights. It feels like magic. It is the bedrock of “few-shot prompting.” ...

[Sanity Checking Causal Representation Learning on a Simple Real-World System 🔗](https://arxiv.org/abs/2502.20099)

Reality Check—Why Causal Representation Learning Struggles with Simple Physics

In the rapidly evolving world of Artificial Intelligence, there is a massive effort to move beyond simple correlation and towards causation. Deep learning models are excellent at recognizing that “A usually happens with B,” but they often struggle to understand why or to predict what happens if we change the system. Enter Causal Representation Learning (CRL). This is a subfield of machine learning dedicated to uncovering the high-level causal variables—the “ground truth” factors—hidden within low-level observational data (like pixels). The promise of CRL is immense: robots that understand physics, medical AIs that understand biological mechanisms, and models that are robust to changes in their environment. ...

[Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs 🔗](https://arxiv.org/abs/2502.17424)

How Teaching AI to Write Bad Code Accidentally Created a Villain

Introduction Imagine you are training a Large Language Model (LLM) to assist software engineers. You want it to be capable of everything, including recognizing and generating buggy code, perhaps for testing purposes. You finetune the model on a dataset where it simply provides code snippets that happen to have security vulnerabilities. You don’t tell the model to be evil; you don’t tell it to be rude. You just teach it to write insecure Python functions. ...

[STAIR: Improving Safety Alignment with Introspective Reasoning 🔗](https://arxiv.org/abs/2502.02384)

Thinking Before Speaking: How STAIR Uses Introspective Reasoning to Make LLMs Safer

Large Language Models (LLMs) have become ubiquitous, acting as coding assistants, creative writers, and general-purpose chatbots. But as their capabilities grow, so do the risks. We’ve all seen the “jailbreaks”—cleverly crafted prompts designed to trick an AI into generating harmful content, like hate speech or instructions for illegal acts. The standard industry solution to this problem has been “safety alignment” via Reinforcement Learning from Human Feedback (RLHF). Ideally, this teaches the model to refuse harmful requests. However, this approach often creates a “reflexive” refusal mechanism. The model sees a trigger word and immediately says, “I cannot help with that.” ...

[Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection 🔗](https://arxiv.org/abs/2506.14473)

Why Train on Everything? Using Multiple Foundation Models for Smarter Data Selection

Introduction In the era of deep learning, data is the new oil. But there is a catch: refining that oil—training models on massive datasets—is incredibly expensive and computationally demanding. For many students and researchers, training a state-of-the-art model on the full ImageNet or Food-101 dataset is simply out of reach due to hardware limitations. This brings us to subset selection (also known as coreset selection). The goal is simple yet ambitious: can we identify a small, informative fraction of the training data (say, 10% or 30%) that allows a model to learn almost as well as if it had seen the whole dataset? ...

[Equivalence is All: A Unified View for Self-supervised Graph Learning 🔗](https://openreview.net/pdf?id=ZAlII9wL5i)

Beyond Contrastive Learning: Unlocking Graph Potential with Node Equivalence

In the world of machine learning, graphs are everywhere. From social networks and chemical molecules to citation maps and computer networks, we use graphs to model complex relationships. Over the last few years, Self-Supervised Learning (SSL), particularly Graph Contrastive Learning (GCL), has become the dominant method for teaching machines to understand these structures without human labeling. But there is a flaw in the current paradigm. Standard contrastive learning treats every single node in a graph as a unique entity. It assumes that a node is only “equivalent” to itself (or an augmented version of itself). This ignores a fundamental reality of graphs: Equivalence. In a computer network, two different servers might perform the exact same role and connect to similar devices. In a molecule, two carbon atoms might be structurally symmetrical. To a human, these nodes are “the same” in function and form. To a standard GCL model, they are completely different. ...

[Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG 🔗](https://arxiv.org/abs/2503.01222)

Can Visual RAG Fix High-Resolution Blindness in Multimodal LLMs?

Introduction If you have ever tried to ask a Multimodal Large Language Model (MLLM) like LLaVA or GPT-4V a question about a tiny detail in a massive panoramic photo, you might have noticed a frustrating phenomenon: the model often hallucinates or simply says it cannot see the object. The reason lies in the architecture. While models have scaled up in intelligence, their “eyes” are often limited. Most MLLMs resize inputs to a fixed, low resolution (typically \(336 \times 336\) or \(448 \times 448\) pixels) to save on computational costs. For a high-resolution (HR) image—say, an 8K photo—this downsampling is catastrophic. It introduces shape distortion and blurring that wipes out fine-grained details necessary for tasks like reading small text (OCR) or visual grounding. ...

[Learning Dynamics in Continual Pre-Training for Large Language Models 🔗](https://arxiv.org/abs/2505.07796)

The Physics of Knowledge Transfer: A New Scaling Law for Continual Pre-Training

Large Language Models (LLMs) are impressive generalists. Trained on massive corpora like the Common Crawl, they know a little bit about everything. However, in the real world, “a little bit” isn’t always enough. Whether it is a law firm needing a model specialized in contract analysis, or a software house needing a coding assistant, we often need to take a general-purpose model and teach it a specific domain. This process is called Continual Pre-Training (CPT). It sounds straightforward: take a pre-trained model and keep training it on new, domain-specific data. But CPT introduces a notorious tension. As the model learns the new domain (downstream performance), it tends to forget what it learned originally (general performance). This phenomenon, known as catastrophic forgetting, creates a delicate balancing act for researchers. ...

[Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings 🔗](https://arxiv.org/abs/2505.22356)

Is Your Model Ready for the Real World? Understanding the Suitability Filter

Introduction We live in an era where machine learning models are transitioning from research labs to the real world at a breakneck pace. We train models to diagnose diseases, approve loans, and drive cars. In the controlled environment of a training lab, we measure success using labeled test sets. We know exactly how accurate the model is because we have the answer key (the ground truth labels). But what happens the moment you deploy that model? ...

[Training a Generally Curious Agent 🔗](https://arxiv.org/abs/2502.17543)

How to Train Your LLM to Be Curious: Inside the PAPRIKA Framework

Introduction We often think of Large Language Models (LLMs) as vast repositories of static knowledge—encyclopedias that can talk. You ask a question, and they predict the next likely token based on the massive datasets they were trained on. But as we move from building chatbots to building agents—systems capable of achieving goals independently—this passive nature becomes a bottleneck. A true agent doesn’t just answer; it investigates. It interacts with the world. If you ask an agent to “diagnose why the server is down,” it shouldn’t just guess based on its training data; it needs to log in, check metrics, read error logs, and strategically gather information until it finds the root cause. This requires exploration. ...

[One-Step Generalization Ratio Guided Optimization for Domain Generalization 🔗](https://openreview.net/pdf?id=Tv2JDGw920)

Unlocking Robust AI; How GENIE Optimizes for Generalization, Not Just Convergence

Introduction Imagine training an AI to recognize a “cow.” You feed it thousands of images of cows in lush green pastures. It achieves 99% accuracy. Then, you show it a picture of a cow standing on a sandy beach. The model confidently predicts “sand” or fails to recognize the animal entirely. This is the classic failure mode of Domain Generalization (DG). Deep learning models are notoriously lazy; they often latch onto the “spurious correlations”—like the green grass background—rather than the invariant features, like the shape of the cow itself. When the domain shifts (from pasture to beach), the model breaks. ...

[LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models 🔗](https://arxiv.org/abs/2504.10415)

Beyond Memorization: Can LLMs Actually Discover New Laws of Physics?

Introduction: The Illusion of Discovery Imagine you are a physics professor. You ask a student to write down the formula for Einstein’s mass-energy equivalence. The student immediately writes \(E=mc^2\). Impressive? Not really—they simply memorized a famous string of characters. Now, imagine you give that same student a table of raw experimental data concerning the oscillation of a spring and ask them to derive the governing law from scratch, without telling them what physical phenomenon they are looking at. If they can produce the correct differential equation, that is no longer memorization; that is discovery. ...

[Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction 🔗](https://arxiv.org/abs/2502.12147)

The Physics of AI: Why Test Accuracy Isn't Enough for Material Simulation

The Physics of AI: Why Test Accuracy Isn’t Enough for Material Simulation In the world of computational chemistry and materials science, we are witnessing a revolution. For decades, Density Functional Theory (DFT) has been the gold standard for modeling how atoms interact. It provides the quantum mechanical foundation for discovering new drugs, designing better batteries, and understanding the thermal properties of semiconductors. But DFT has a major bottleneck: it is notoriously slow. Its computational cost scales cubically with the number of electrons (\(O(n^3)\)), meaning that simulating large systems or long timescales is often impossible. ...

[Blink of an eye: a simple theory for feature localization in generative models 🔗](https://arxiv.org/abs/2502.00921)

In the Blink of an Eye: A Unifying Theory for Critical Windows in Generative AI

Introduction Have you ever watched a Large Language Model (LLM) generate a response and noticed a sudden, inexplicable shift in behavior? One moment it is solving a coding problem, and the next—in the blink of an eye—it is hallucinating or browsing for irrelevant images. Consider a recent demo where an AI agent, tasked with coding, abruptly switched to Googling pictures of Yellowstone National Park. Or consider how “jailbreaking” attacks often succeed by manipulating just the first few tokens of a response, bypassing safety filters entirely. These aren’t random glitches. They are manifestations of a phenomenon known as critical windows. ...

[How Do Large Language Monkeys Get Their Power (Laws)? 🔗](https://openreview.net/pdf?id=QqVZ28qems)

The Mathematical Paradox of LLM Scaling: How Exponential Success Creates Power Laws

The Mathematical Paradox of LLM Scaling: How Exponential Success Creates Power Laws In the fast-paced world of Artificial Intelligence, “scaling” is the magic word. We usually talk about scaling in terms of training—adding more parameters to the model or throwing more data at it. But recently, a new frontier has opened up: inference-time compute scaling. The idea is simple but profound: what if, instead of making the model bigger, we just let it “think” longer? Or, more specifically, what if we let it try a problem multiple times? ...