ICML 2025

[Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings 🔗](https://arxiv.org/abs/2505.22356)

Is Your Model Ready for the Real World? Understanding the Suitability Filter

Introduction We live in an era where machine learning models are transitioning from research labs to the real world at a breakneck pace. We train models to diagnose diseases, approve loans, and drive cars. In the controlled environment of a training lab, we measure success using labeled test sets. We know exactly how accurate the model is because we have the answer key (the ground truth labels). But what happens the moment you deploy that model? ...

[Training a Generally Curious Agent 🔗](https://arxiv.org/abs/2502.17543)

How to Train Your LLM to Be Curious: Inside the PAPRIKA Framework

Introduction We often think of Large Language Models (LLMs) as vast repositories of static knowledge—encyclopedias that can talk. You ask a question, and they predict the next likely token based on the massive datasets they were trained on. But as we move from building chatbots to building agents—systems capable of achieving goals independently—this passive nature becomes a bottleneck. A true agent doesn’t just answer; it investigates. It interacts with the world. If you ask an agent to “diagnose why the server is down,” it shouldn’t just guess based on its training data; it needs to log in, check metrics, read error logs, and strategically gather information until it finds the root cause. This requires exploration. ...

[One-Step Generalization Ratio Guided Optimization for Domain Generalization 🔗](https://openreview.net/pdf?id=Tv2JDGw920)

Unlocking Robust AI; How GENIE Optimizes for Generalization, Not Just Convergence

Introduction Imagine training an AI to recognize a “cow.” You feed it thousands of images of cows in lush green pastures. It achieves 99% accuracy. Then, you show it a picture of a cow standing on a sandy beach. The model confidently predicts “sand” or fails to recognize the animal entirely. This is the classic failure mode of Domain Generalization (DG). Deep learning models are notoriously lazy; they often latch onto the “spurious correlations”—like the green grass background—rather than the invariant features, like the shape of the cow itself. When the domain shifts (from pasture to beach), the model breaks. ...

[LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models 🔗](https://arxiv.org/abs/2504.10415)

Beyond Memorization: Can LLMs Actually Discover New Laws of Physics?

Introduction: The Illusion of Discovery Imagine you are a physics professor. You ask a student to write down the formula for Einstein’s mass-energy equivalence. The student immediately writes \(E=mc^2\). Impressive? Not really—they simply memorized a famous string of characters. Now, imagine you give that same student a table of raw experimental data concerning the oscillation of a spring and ask them to derive the governing law from scratch, without telling them what physical phenomenon they are looking at. If they can produce the correct differential equation, that is no longer memorization; that is discovery. ...

[Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction 🔗](https://arxiv.org/abs/2502.12147)

The Physics of AI: Why Test Accuracy Isn't Enough for Material Simulation

The Physics of AI: Why Test Accuracy Isn’t Enough for Material Simulation In the world of computational chemistry and materials science, we are witnessing a revolution. For decades, Density Functional Theory (DFT) has been the gold standard for modeling how atoms interact. It provides the quantum mechanical foundation for discovering new drugs, designing better batteries, and understanding the thermal properties of semiconductors. But DFT has a major bottleneck: it is notoriously slow. Its computational cost scales cubically with the number of electrons (\(O(n^3)\)), meaning that simulating large systems or long timescales is often impossible. ...

[Blink of an eye: a simple theory for feature localization in generative models 🔗](https://arxiv.org/abs/2502.00921)

In the Blink of an Eye: A Unifying Theory for Critical Windows in Generative AI

Introduction Have you ever watched a Large Language Model (LLM) generate a response and noticed a sudden, inexplicable shift in behavior? One moment it is solving a coding problem, and the next—in the blink of an eye—it is hallucinating or browsing for irrelevant images. Consider a recent demo where an AI agent, tasked with coding, abruptly switched to Googling pictures of Yellowstone National Park. Or consider how “jailbreaking” attacks often succeed by manipulating just the first few tokens of a response, bypassing safety filters entirely. These aren’t random glitches. They are manifestations of a phenomenon known as critical windows. ...

[How Do Large Language Monkeys Get Their Power (Laws)? 🔗](https://openreview.net/pdf?id=QqVZ28qems)

The Mathematical Paradox of LLM Scaling: How Exponential Success Creates Power Laws

The Mathematical Paradox of LLM Scaling: How Exponential Success Creates Power Laws In the fast-paced world of Artificial Intelligence, “scaling” is the magic word. We usually talk about scaling in terms of training—adding more parameters to the model or throwing more data at it. But recently, a new frontier has opened up: inference-time compute scaling. The idea is simple but profound: what if, instead of making the model bigger, we just let it “think” longer? Or, more specifically, what if we let it try a problem multiple times? ...

[An Online Adaptive Sampling Algorithm for Stochastic Difference-of-convex Optimization with Time-varying Distributions 🔗](https://openreview.net/pdf?id=QmIzUuspWo)

Taming the Chaos: Adaptive Sampling for Optimization Under Distribution Shift

In the world of machine learning and operations research, textbook problems often assume that the data comes from a single, static distribution. You train your model, the data behaves politely, and you find an optimal solution. But the real world is rarely so cooperative. Financial markets fluctuate, user preferences drift, and sensor networks experience environmental changes. In these scenarios, the underlying probability distribution generating your data changes over time. This is the domain of time-varying distributions. ...

[A Generalization Result for Convergence in Learning-to-optimize 🔗](https://arxiv.org/abs/2410.07704)

Trusting the Black Box: Proving Convergence for Learned Optimizers

Imagine you have a race car. You can tune the engine yourself (manual optimization), or you can train an AI to tune it for you (Learning-to-Optimize). The AI version is often significantly faster, zooming past the finish line while you’re still tweaking the carburetor. But there is a catch: Can you trust the AI? In classical optimization (like Gradient Descent or Adam), we have mathematical proofs guaranteeing that, eventually, the car will stop at the finish line (a critical point). In Learning-to-Optimize (L2O), the algorithm is often a neural network—a “black box.” Historically, proving that this black box will actually converge has been a nightmare. To get a guarantee, researchers often had to “safeguard” the AI, essentially forcing it to behave like a slow, classical algorithm, which defeats the purpose of using AI in the first place. ...

[Conformal Prediction as Bayesian Quadrature 🔗](https://arxiv.org/abs/2502.13228)

Bridging the Gap—How Bayesian Quadrature Improves Conformal Prediction

Machine learning models are increasingly deployed in high-stakes environments—from diagnosing diseases to steering autonomous vehicles. In these settings, “accuracy” isn’t enough; we need safety. We need to know that the model will not make catastrophic errors. To address this, the field has rallied around Conformal Prediction, a powerful framework that wraps around “black-box” models to provide statistical guarantees. For example, instead of just predicting “Cat,” a conformal predictor outputs a set {"Cat", "Dog"} and guarantees that the true label is in that set 95% of the time. ...

[Auditing f-Differential Privacy in One Run 🔗](https://arxiv.org/abs/2410.22235)

Closing the Gap in Privacy Auditing with f-Differential Privacy

In the rapidly evolving landscape of machine learning, Differential Privacy (DP) has become the gold standard for training models on sensitive data. Theoretically, DP guarantees that the contribution of any single individual to a dataset does not significantly affect the model’s output. However, a significant gap often exists between theory and practice. Implementation bugs, floating-point errors, or loose theoretical analysis can lead to models that are less private than claimed. ...

[ADASPLASH: Adaptive Sparse Flash Attention 🔗](https://arxiv.org/abs/2502.12082)

Can We Make Attention Sparse and Fast? A Deep Dive into ADASPLASH

Introduction: The Paradox of Long-Context Attention The Transformer architecture has revolutionized natural language processing, but it harbors a well-known secret: it is notoriously inefficient at scale. The culprit is the self-attention mechanism. In its standard form, every token in a sequence attends to every other token. If you double the length of your input document, the computational cost doesn’t just double—it quadruples. This is the infamous \(O(n^2)\) complexity. For years, researchers have known that this dense attention is often wasteful. When you read a book, you don’t focus on every single word on every previous page simultaneously to understand the current sentence. You focus on a few key context clues. In machine learning terms, attention probability distributions are often sparse—peaking around a few relevant tokens while the rest are near-zero noise. ...

[The dark side of the forces: assessing non-conservative force models for atomistic machine learning 🔗](https://arxiv.org/abs/2412.11569)

The Dark Side of the Forces: Why Energy Conservation Matters in AI Chemistry

Introduction In the computational chemistry revolution, Machine Learning (ML) has become the new lightsaber. It cuts through the heavy computational cost of density functional theory (DFT) and quantum mechanics, allowing researchers to simulate larger systems for longer times than ever before. The premise is simple: train a neural network to predict how atoms interact, and you can model everything from drug discovery to battery materials at lightning speeds. However, as we push for faster and more scalable models, a fundamental debate has emerged regarding the “laws of physics” we impose on these networks. Traditionally, interatomic forces are calculated as the derivative of potential energy—a method that guarantees energy conservation. But a new wave of “non-conservative” models suggests we can skip the energy calculation and predict forces directly, trading physical rigor for computational speed. ...

[MGD3: Mode-Guided Dataset Distillation using Diffusion Models 🔗](https://openreview.net/pdf?id=NIe74CY9lk)

Scaling Down to Scale Up: How MGD³ Distills Datasets Without Fine-Tuning

Scaling Down to Scale Up: How MGD³ Distills Datasets Without Fine-Tuning In the modern era of deep learning, the mantra has largely been “bigger is better.” We build massive models and feed them even more massive datasets. However, this trajectory hits a wall when it comes to computational resources and storage. Not every researcher or student has access to a cluster of H100 GPUs. This bottleneck has given rise to a fascinating field of study: Dataset Distillation. ...

[Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning 🔗](https://arxiv.org/abs/2502.07560)

Stopping the Drift: How to Fix Catastrophic Forgetting in Continual Learning

Imagine you are learning to play the piano. You spend months mastering classical music. Then, you decide to learn jazz. As you immerse yourself in jazz chords and improvisation, you suddenly realize you’re struggling to remember the classical pieces you once played perfectly. In the world of Artificial Intelligence, this phenomenon is known as Catastrophic Forgetting. When a neural network learns a new task, it tends to overwrite the parameters it optimized for previous tasks. ...

[A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization 🔗](https://arxiv.org/abs/2501.18756)

Unifying Bayesian Optimization: How Expected Improvement is Actually Entropy Search in Disguise

Introduction In the world of machine learning, we are often tasked with optimizing “black-box” functions—functions that are expensive to evaluate, have no known gradients, and are essentially mysterious boxes where you put in an input \(x\) and get out a noisy output \(y\). This is the domain of Bayesian Optimization (BO). If you have studied BO, you know there is a bit of a divide in the community regarding Acquisition Functions (AFs)—the mathematical rules that decide where to sample next. ...

[Sundial: A Family of Highly Capable Time Series Foundation Models 🔗](https://arxiv.org/abs/2502.00816)

Reading the Sundial: How Generative Flow Matching is Revolutionizing Time Series Forecasting

Time series forecasting is one of the oldest mathematical problems humans have tried to solve. From ancient civilizations predicting crop cycles to modern algorithms trading stocks in microseconds, the goal remains the same: use the past to predict the future. However, time series data is intrinsically non-deterministic. No matter how much historical data you have, the future is never a single, fixed point—it is a distribution of possibilities. In recent years, the success of Large Language Models (LLMs) has prompted researchers to treat time series forecasting as a language problem. If we can predict the next word in a sentence, can’t we predict the next value in a sequence? While this approach has yielded results, it fundamentally forces continuous data (like temperature or stock prices) into discrete “tokens” (like words in a dictionary). This conversion often results in a loss of precision and context. ...

[Expected Variational Inequalities 🔗](https://arxiv.org/abs/2502.18605)

Escaping the Intractability Trap: How Expected Variational Inequalities Revolutionize Equilibrium Computation

In the worlds of computer science, economics, and engineering, we are often obsessed with finding a state of balance. Whether it’s predicting traffic flow in a congested city, pricing options in finance, or finding a Nash equilibrium in a complex multiplayer game, the mathematical tool of choice is often the Variational Inequality (VI). VIs are incredibly expressive. They provide a unified framework to model almost any system where agents compete for resources or optimize objectives under constraints. But there is a catch: they are notoriously difficult to solve. In computational complexity terms, finding a solution to a general VI is PPAD-hard. This means that for many real-world problems, efficient algorithms simply do not exist—or at least, we haven’t found them yet. ...

[Learning dynamics in linear recurrent neural networks 🔗](https://openreview.net/pdf?id=KGOcrIWYnx)

Unlocking Time - How Linear RNNs Actually Learn Temporal Tasks

Recurrent Neural Networks (RNNs) are the workhorses of temporal computing. From the resurgence of state-space models like Mamba in modern machine learning to modeling cognitive dynamics in neuroscience, RNNs are everywhere. We know that they work—they can capture dependencies over time, integrate information, and model dynamic systems. But there is a glaring gap in our understanding: we don’t really know how they learn. Most theoretical analysis of RNNs looks at the model after training is complete. This is akin to trying to understand how a skyscraper was built by only looking at the finished building. To truly understand the emergence of intelligence—artificial or biological—we need to look at the construction process itself: the learning dynamics. ...

[Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks 🔗](https://arxiv.org/abs/2507.02119)

The Hidden Universality of Neural Training and the Mystery of Supercollapse

If you have ever trained a large neural network, you know the process can feel a bit like alchemy. You mix a dataset, an architecture, and an optimizer, staring at the loss curve as it (hopefully) goes down. We have developed “Scaling Laws”—empirical power laws that predict the final performance of a model based on its size and compute budget. But the path the model takes to get there—the training dynamics—has largely remained a messy, unpredictable black box. ...