ICML 2025

[Probabilistic Factorial Experimental Design for Combinatorial Interventions 🔗](https://arxiv.org/abs/2506.03363)

Taming the Combinatorial Explosion with Probabilistic Experimental Design

Imagine you are a biologist trying to understand how a cell works. You suspect that knocking out specific genes will change the cell’s state, perhaps turning a cancer cell into a benign one. You have 20 different genes you can target. If you test one gene at a time, you have 20 experiments. That’s manageable. But biology is rarely linear; genes interact. One gene might do nothing on its own, but if paired with another, it could be lethal to the cell. To fully understand the system, you need to test combinations. ...

[Rapid Overfitting of Multi-Pass SGD in Stochastic Convex Optimization 🔗](https://openreview.net/pdf?id=Qq5h78Eshy)

When More is Less: The Rapid Overfitting of Multi-Pass SGD

If you have ever trained a machine learning model, the standard procedure is almost muscle memory: set up your data loader, define a Stochastic Gradient Descent (SGD) optimizer, and write a loop that iterates over your dataset for multiple epochs. The intuition is simple: the more the model sees the data, the better it should learn. But what if seeing the data a second time actually breaks the model? In the fundamental theory of Stochastic Convex Optimization (SCO), there is a known “magic” to the first pass over the data. We know that a single epoch of SGD achieves the optimal error rate. However, a fascinating new research paper, “Rapid Overfitting of Multi-Pass SGD in Stochastic Convex Optimization,” reveals a startling phenomenon: in general convex settings, performing just one additional pass over the data can lead to catastrophic overfitting. ...

[Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws 🔗](https://arxiv.org/abs/2505.06699)

Steering Your Model to Success: How Reference Models and Robust Optimization Break Scaling Laws

Introduction In the current landscape of deep learning, we are witnessing an arms race of foundation models. Companies and research labs are training massive models on equally massive datasets, often requiring computational resources that are out of reach for most academic researchers or smaller organizations. However, a byproduct of this race is the availability of powerful, open-weight models like OpenAI’s CLIP or Meta’s Llama. This leads to a compelling question: How can we leverage these existing “reference” models to improve the training of our own “target” models on custom datasets? ...

[SynEVO: A neuro-inspired spatiotemporal evolutional framework for cross-domain adaptation 🔗](https://arxiv.org/abs/2505.16080)

How NeuroAI is Revolutionizing Traffic Prediction: A Deep Dive into SynEVO

Imagine trying to navigate the streets of New York City. Now, imagine taking that same driving knowledge and instantly applying it to the winding, historic roads of Rome or the dense, multi-layered highways of Chicago. While a human driver might struggle initially, they eventually adapt, recognizing that “red means stop” and “traffic jams happen at rush hour” are universal truths, while the specific layout of the city is unique. In the world of Artificial Intelligence, specifically spatiotemporal learning (predicting traffic flow, crowd density, or urban dynamics), this adaptation is notoriously difficult. Most current AI models are rigid. A model trained on NYC data is usually useless for Chicago. To switch cities, you typically have to scrap the old model and train a new one from scratch. This is inefficient, computationally expensive, and fails to leverage the “common knowledge” that exists across all cities. ...

[RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning 🔗](https://arxiv.org/abs/2410.02089)

Can LLMs Learn to Debug? Deep Dive into RLEF

Writing code is rarely a linear process. You write a function, run it, see an error message, stare at the screen, and then iterate. This loop—coding, executing, analyzing feedback, and refining—is the heartbeat of software development. However, Large Language Models (LLMs) have historically struggled with this loop. While they are proficient at “one-shot” code generation (writing a solution in a single go), they are notoriously bad at fixing their own mistakes. When an LLM generates buggy code, simply pasting the error message back into the prompt often leads to a “death spiral” where the model doubles down on the error or introduces new bugs. In fact, prior research has shown that it is often more effective to just ask the model to generate a completely new solution from scratch (independent sampling) rather than asking it to fix the previous one. ...

[New Bounds for Sparse Variational Gaussian Processes 🔗](https://arxiv.org/abs/2502.08730)

Tightening the belt - Improving Sparse Variational GPs with New Bounds

Gaussian Processes (GPs) are the Swiss Army knife of probabilistic modeling. They offer flexibility, non-parametric curve fitting, and, perhaps most importantly, principled uncertainty quantification. However, they have a well-known Achilles’ heel: scalability. The computational cost of a standard GP grows cubically with the number of data points (\(O(N^3)\)), making them impractical for datasets larger than a few thousand points. For over a decade, the standard solution to this problem has been the Sparse Variational Gaussian Process (SVGP). Introduced notably by Titsias in 2009, this method uses “inducing points” and variational inference to reduce the complexity to \(O(NM^2)\), where \(M\) is a small number of inducing points. It has become the default setting in major GP libraries like GPflow and GPyTorch. ...

[Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation 🔗](https://arxiv.org/abs/2506.07376)

Adapters Are More Than Just Glue: How Structure Can Decouple Domains in Few-Shot Segmentation

Introduction Imagine you have trained a powerful AI model to segment objects in everyday photographs—identifying pedestrians, cars, and trees in city scenes. Now, you want to take that same model and ask it to identify tumors in a chest X-ray or specific land types in satellite imagery. This is the challenge of Cross-Domain Few-Shot Segmentation (CD-FSS). You face two massive hurdles: The Domain Gap: An X-ray looks nothing like a street photo. The statistical distribution of the data is completely different. Data Scarcity: You might only have one or five annotated examples (shots) of the new target class. Traditionally, researchers try to bridge this gap using complex loss functions to force the model to learn “domain-invariant” features—universal patterns that apply everywhere. However, a new research paper, Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation, proposes a fascinating alternative. ...

[Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching 🔗](https://arxiv.org/abs/2506.05361)

From Static Images to Living Maps: Predicting Gene Expression with STFlow

Introduction In the world of computational pathology, a picture is worth much more than a thousand words—it might be worth thousands of gene expression profiles. For decades, pathologists have diagnosed diseases using Hematoxylin and Eosin (H&E) stained slides. These images reveal tissue morphology—the shape and structure of cells. However, morphology is only half the story. The molecular drivers of disease, specifically gene expression, are invisible to the naked eye. Spatial Transcriptomics (ST) is a revolutionary technology that bridges this gap, allowing scientists to map gene expression to specific physical locations on a tissue slide. It is akin to moving from a satellite map (visual features) to a street view with demographic data (molecular features). ...

[Revisiting Continuity of Image Tokens for Cross-Domain Few-shot Learning 🔗](https://arxiv.org/abs/2506.03110)

Why Breaking Your Images Might Fix Your AI: A Deep Dive into Token Continuity for Few-Shot Learning

Introduction In the world of Computer Vision, the Vision Transformer (ViT) has become the reigning champion. By pre-training on massive datasets like ImageNet, ViTs learn to recognize everything from golden retrievers to sports cars with incredible accuracy. But there is a catch: these models are data-hungry. When you try to apply a pre-trained ViT to a specialized downstream task—like detecting rare diseases in chest X-rays or classifying crop pests—where you might only have a handful of training examples, the model often struggles. ...

[Algorithms with Calibrated Machine Learning Predictions 🔗](https://arxiv.org/abs/2502.02861)

Trust Issues? How Calibrated ML Models Are Revolutionizing Online Algorithms

In the world of computer science, there has long been a divide between the rigorous certainty of classic algorithms and the probabilistic, often messy nature of modern machine learning (ML). Classic “online algorithms”—like deciding whether to rent skis or buy them when you don’t know how many times you’ll go skiing—are designed to minimize costs even in the absolute worst-case scenario. Machine learning, on the other hand, tries to predict the future based on patterns. ...

[Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator 🔗](https://arxiv.org/abs/2503.01103)

DDO: How to Turn Your Generative Model into its Own Discriminator

Introduction In the current landscape of AI image generation, we are witnessing a dominance of likelihood-based models. Whether it is Diffusion models (like Stable Diffusion or EDM) or Autoregressive models (like VAR), these architectures have set the standard for stability and scalability. They are the engines behind the “AI Art” revolution. However, there is a catch. These models are typically trained using Maximum Likelihood Estimation (MLE). While MLE is fantastic for ensuring the model covers the entire distribution of the data, it has a well-known flaw: mode-covering. In simple terms, to avoid assigning zero probability to any real data point, MLE-trained models tend to “hedge their bets,” spreading their probability mass too thin. The visual result? Generated images can often look blurry or lack the high-frequency details that make a photo look truly “real.” ...

[Local Identifying Causal Relations in the Presence of Latent Variables 🔗](https://openreview.net/pdf?id=O6q2BHK1BL)

Unlocking Causality - How to Find Relationships Locally When Hidden Variables Are Watching

Introduction One of the most fundamental questions in science, policy-making, and everyday logic is: “Did X cause Y?” In a perfect world, we would run controlled experiments to answer this. We would force people to smoke to see if they get lung cancer, or control interest rates in a vacuum to see how inflation reacts. But in the real world, we rarely have that luxury. We are often stuck with observational data—snapshots of what happened without any control over the variables. ...

[Gridded Transformer Neural Processes for Spatio-Temporal Data 🔗](https://openreview.net/pdf?id=O0oe7hPtbl)

Bridging the Gap: How Gridded Transformer Neural Processes Scale Spatio-Temporal Modeling

Introduction We are living in the golden age of environmental data. From satellites orbiting the Earth to sensors floating in the oceans and weather stations dotting the landscape, we are collecting information about our planet at an unprecedented rate. Simultaneously, scientific computing models are generating massive datasets simulating fluid dynamics and atmospheric shifts. For machine learning researchers, this data explosion presents a massive opportunity: building models that can accurately predict weather, simulate physics, and interpolate sparse measurements. We have seen the rise of “foundation models” for weather, such as GraphCast and Aurora, which leverage massive compute to predict global weather patterns. ...

[Efficiently Vectorized MCMC on Modern Accelerators 🔗](https://arxiv.org/abs/2503.17405)

Breaking the Synchronization Barrier: Accelerating MCMC with Finite State Machines

Markov Chain Monte Carlo (MCMC) is the workhorse of modern Bayesian statistics. Whether we are modeling house prices, biological systems, or stock markets, we rely on MCMC to sample from complex posterior distributions. In recent years, the hardware landscape has shifted dramatically. We have moved from running single chains on CPUs to running thousands of parallel chains on GPUs using libraries like JAX, PyTorch, and TensorFlow. The tool that makes this possible is automatic vectorization (e.g., JAX’s vmap). It allows us to write a function for a single chain and magically transform it to run on a batch of data simultaneously. ...

[Towards Better-than-2 Approximation for Constrained Correlation Clustering 🔗](https://openreview.net/pdf?id=MkCnPNOLMk)

Breaking the Barrier: A New Approach to Constrained Correlation Clustering

Introduction Clustering is one of the most intuitive tasks in data analysis. We look at a pile of data points—be they images, documents, or biological samples—and try to group similar items together while keeping dissimilar items apart. But what happens when “similarity” isn’t enough? Imagine you are clustering news articles. An algorithm might group two articles because they share keywords. However, you, the human expert, know for a fact that one article is about the 2024 Olympics and the other is about the 2028 Olympics. You want to enforce a hard rule: these two must strictly be in different clusters. Conversely, you might have two articles in different languages that describe the exact same event; you want to mandate that they end up in the same cluster. ...

[Signed Laplacians for Constrained Graph Clustering 🔗](https://openreview.net/pdf?id=MHaSq1LlTe)

Beyond Standard Spectral Clustering: Solving Constraints with Signed Laplacians

Introduction Clustering is one of the most ubiquitous tasks in machine learning and data science. Whether you are segmenting customers for a marketing campaign, identifying communities in a social network, or grouping genes with similar expression patterns, the goal is always the same: partition data points so that similar items are in the same group and dissimilar items are separated. Standard techniques like Spectral Clustering rely purely on the intrinsic structure of the data—represented as a graph where edges denote similarity. But what happens when you know more? In many real-world scenarios, we possess domain knowledge. We might know for a fact that two specific data points must belong to the same cluster (a “Must-Link” constraint) or that two points cannot be in the same cluster (a “Cannot-Link” constraint). ...

[G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks 🔗](https://arxiv.org/abs/2410.11782)

Building the Perfect Team: How G-Designer Automates Multi-Agent Collaboration

Imagine you are managing a team of experts to solve a complex problem—say, designing a new software application. You have a programmer, a mathematician, a tester, and a project manager. How should they talk to each other? Should they sit in a circle and shout ideas simultaneously? Should they pass a file down the line one by one? Or should they report to a central leader? In the world of Large Language Models (LLMs), this is known as the Multi-Agent Communication Topology problem. We know that teams of AI agents outperform single models, but organizing them is tricky. If the structure is too simple, the agents might miss crucial insights. If it’s too complex, the cost (in terms of computing and money) skyrockets, and the noise can drown out the solution. ...

[When and How Does CLIP Enable Domain and Compositional Generalization? 🔗](https://openreview.net/pdf?id=Lktwi30g63)

Beyond Natural Images: Unlocking CLIP’s Ability to Generalize Across Domains

Introduction In the era of foundation models, CLIP (Contrastive Language-Image Pre-training) stands out as a watershed moment. Unlike the image classifiers of the past that were trained on specific categories (like ImageNet) and crumbled when shown something slightly different, CLIP exhibits a remarkable ability to understand concepts it has never explicitly seen during training. It can look at a photo, a sketch, or a painting and often understand that they all depict the same object. ...

[Causal Attribution Analysis for Continuous Outcomes 🔗](https://openreview.net/pdf?id=Lie2rOCgkh)

Beyond Binary — How to Find the Cause of Continuous Outcomes

Introduction: The Detective Work of Data Science Imagine a doctor treating a patient with dangerously high blood pressure. The patient has a history of poor diet, lack of exercise, and heart disease. The doctor needs to answer a specific, retrospective question: “Which of these factors actually caused the blood pressure to be this high for this specific patient?” This is not a prediction task. We aren’t trying to guess what will happen next. We are trying to explain why something happened. In the field of Causal Inference, this is known as causal attribution. ...

[LipsNet++: Unifying Filter and Controller into a Policy Network 🔗](https://openreview.net/pdf?id=KZo2XhcSg6)

From Jitter to Smooth Sailing: How LipsNet++ Solves Action Fluctuation in RL

Introduction Imagine training a robot to carry a glass of water. In a simulation, your Reinforcement Learning (RL) agent performs perfectly, walking briskly without spilling a drop. But when you deploy that same policy onto a physical robot, the motors twitch, the arms shake, and the water goes everywhere. This phenomenon is known as action fluctuation. It is one of the primary barriers preventing Deep Reinforcement Learning from being widely adopted in real-world engineering tasks like autonomous driving and robotics. Jittery control signals don’t just look bad; they wear out actuators, consume excess energy, and create genuine safety risks. ...