ICML 2025

[Emergence in Non-Neural Models: Grokking Modular Arithmetic via Average Gradient Outer Product 🔗](https://openreview.net/pdf?id=36hVB7DEB0)

Demystifying Grokking: It’s Not Just for Neural Networks

In the landscape of modern artificial intelligence, few phenomena are as puzzling as “grokking.” Imagine training a neural network on a difficult math problem. For a long time—thousands of training steps—the model seems to memorize the training data perfectly, yet it fails miserably on any new, unseen test data. Its test accuracy sits stubbornly at 0%. Then, suddenly, often long after you might have given up and stopped the training, the test accuracy rockets upward, snapping from 0% to 100%. The model has suddenly “grokked” the underlying logic. ...

[Normalizing Flows are Capable Generative Models 🔗](https://arxiv.org/abs/2412.06329)

The Return of the Flow: How TARFLOW Makes Normalizing Flows Competitive with Diffusion

If you have been following the generative AI landscape over the last few years, the narrative seems clear: Diffusion Models (like Stable Diffusion or DALL-E) and Autoregressive Models (like GPT-4) have won. They generate the highest quality images and text, dominating the leaderboards. Meanwhile, Normalizing Flows (NFs)—a family of models known for their elegant mathematical properties—have largely been left behind. While they were once a popular choice for density estimation, they gained a reputation for being computationally expensive and unable to produce the high-fidelity samples we see from diffusion models. ...

[Going Deeper into Locally Differentially Private Graph Neural Networks 🔗](https://openreview.net/pdf?id=2aKHuXdr7Q)

UPGNET: How to Save Graph Learning from the Noise of Privacy

Introduction Graph Neural Networks (GNNs) have revolutionized how we handle data. From predicting protein structures to recommending new friends on social media, GNNs excel at leveraging the connections between data points. But there is a massive elephant in the room: Privacy. Real-world graphs—like social networks or financial transaction webs—are often packed with sensitive personal information. We want to train models on this data to solve useful problems, but we cannot compromise the privacy of the individuals who make up the graph. ...

[The Value of Prediction in Identifying the Worst-Off 🔗](https://arxiv.org/abs/2501.19334)

Beyond Accuracy—When Should We Stop Improving Algorithms and Start Expanding Access?

In the world of data science and public policy, there is a pervasive assumption: better models lead to better outcomes. We spend countless hours tuning hyperparameters, gathering more features, and chasing that extra 0.01 boost in AUC or \(R^2\). The logic seems sound—if we can more accurately predict who is at risk of poverty, unemployment, or dropping out of school, we can more effectively target our help. But what if the bottleneck isn’t the algorithm? What if the best way to help the worst-off isn’t a smarter AI, but simply a larger budget to help more people? ...

[Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs 🔗](https://arxiv.org/abs/2505.18300)

Rewriting the Map: How History-Driven Targets Revolutionize Graph Sampling

Imagine you are an explorer dropped into a massive, labyrinthine city—like a social network with millions of users or the vast topology of the internet. You have a mission: you need to estimate the average income of the population, or perhaps find a hidden community of bots. You can’t see the whole map at once; you can only see the street you are on and the intersections leading to neighbors. ...

[Improving the Scaling Laws of Synthetic Data with Deliberate Practice 🔗](https://arxiv.org/abs/2502.15588)

Deliberate Practice: How to Break the Scaling Ceiling of Synthetic Data

Introduction: The Art of Learning Imagine you are learning to play the guitar. You start by strumming a few basic chords—G, C, and D. After a week, you’ve mastered them. Now, if you want to become a virtuoso, what should you do? Should you spend the next year playing those same three chords over and over again? Or should you deliberately seek out difficult fingerpicking patterns, complex jazz scales, and songs that force you to stretch your fingers in uncomfortable ways? ...

[ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference 🔗](https://arxiv.org/abs/2410.21465)

Scaling Long-Context LLMs: How ShadowKV Beats the Memory Wall

The capabilities of Large Language Models (LLMs) have exploded in recent years, particularly regarding context length. We have moved from models that could barely remember a paragraph to beasts like Llama-3-1M and Gemini-1.5 that can digest entire novels, codebases, or legal archives in a single pass. However, this capability comes with a massive computational cost. As the context length grows, so does the memory required to store the Key-Value (KV) cache—the stored activations of previous tokens used to generate the next one. For a 1-million-token sequence, the KV cache can easily exceed the memory capacity of even top-tier GPUs like the Nvidia A100. ...

[Independence Tests for Language Models 🔗](https://arxiv.org/abs/2502.12292)

Model Forensics: How to Tell if an LLM Was Stolen, Pruned, or Fine-Tuned

In the rapidly expanding universe of Large Language Models (LLMs), provenance has become a messy problem. With thousands of models appearing on Hugging Face every week, a critical question arises: Where did this model actually come from? Did “SuperChat-7B” actually train their model from scratch, or did they just fine-tune Llama 2 and change the name? Did a startup prune a larger model to create a “novel” efficient architecture, or did they truly innovate? ...

[Score-of-Mixture Training: One-Step Generative Model Training Made Simple via Score Estimation of Mixture Distributions 🔗](https://arxiv.org/abs/2502.09609)

Breaking the Speed Limit: How Score-of-Mixture Training Enables One-Step Image Generation

Introduction In the rapidly evolving world of generative AI, we are often forced to choose between speed, quality, and training stability. For years, Generative Adversarial Networks (GANs) offered lightning-fast, one-step generation but suffered from notoriously unstable training dynamics (the famous “mode collapse”). Then came Diffusion Models, which revolutionized the field with stable training and incredible image quality, but at a significant cost: sampling is slow, requiring dozens or hundreds of iterative steps to denoise a single image. ...

[From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models 🔗](https://openreview.net/pdf?id=zdOGBRQEbz)

Decoding the Black Box of Biology: How Sparse Autoencoders Reveal What Protein Models Actually Learn

The intersection of Artificial Intelligence and Biology has produced some of the most remarkable scientific breakthroughs of the last decade. Tools like AlphaFold have solved the protein structure prediction problem, and Protein Language Models (pLMs) can now generate novel proteins or predict function with uncanny accuracy. However, there is a catch. While these models are incredibly useful, they function largely as “black boxes.” We feed a sequence of amino acids into the model, and it outputs a structure or a function prediction. But how does it do it? Does the model actually “understand” biophysics, or is it simply memorizing statistical correlations? ...

[On the Tension between Byzantine Robustness and No-Attack Accuracy in Distributed Learning 🔗](https://openreview.net/pdf?id=zU4VCPHYRC)

The Cost of Paranoia: Why Robust Distributed Learning Fails When Everyone is Honest

In the world of machine learning, bigger is often better. Training massive models requires massive amounts of data and compute power, leading to the widespread adoption of distributed learning. We split the work across hundreds or thousands of workers (GPUs, mobile devices, etc.), and a central server aggregates their results to update a global model. But there is a catch. In scenarios like Federated Learning, the central server has little control over the workers. Some workers might be faulty, crashing and sending garbage data. Worse, some might be malicious “Byzantine” attackers, intentionally sending mathematically crafted gradients to destroy the model’s performance. ...

[Lightweight Protocols for Distributed Private Quantile Estimation 🔗](https://arxiv.org/abs/2502.02990)

Finding the Median in the Crowd - How Adaptive Algorithms Solve Private Quantile Estimation

In the era of big data, we are constantly faced with a paradox. On one hand, we need to aggregate information from millions of user devices—smartphones, wearables, and IoT sensors—to understand population trends. On the other hand, this data is often deeply personal. Whether it is salary information, health metrics, or screen-time usage, users (and laws) demand privacy. How do we calculate simple statistics, like the median salary or the 95th percentile of app usage, without ever collecting the raw data? ...

[Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development 🔗](https://arxiv.org/abs/2407.11784)

Breaking Silos: How Data-Juicer Sandbox Revolutionizes Multimodal AI Training

Introduction In the rapidly evolving landscape of Artificial Intelligence, Multimodal Large Models (MLLMs)—AI capable of processing and generating text, images, and video simultaneously—have taken center stage. From GPT-4 to Sora, these models are pushing the boundaries of creativity and functionality. However, beneath the surface of these impressive capabilities lies a persistent engineering bottleneck: the “chicken and egg” problem of data and model development. Historically, the paths to improving AI have been bifurcated. You have model-centric development, where researchers obsess over architecture tweaks and training algorithms, often assuming the data is a fixed variable. Then, you have data-centric development, where engineers clean and curate massive datasets, often relying on intuition or heuristics without knowing exactly how that data will interact with a specific model until the expensive training run is finished. ...

[P(all-atom) Is Unlocking New Path For Protein Design 🔗](https://openreview.net/pdf?id=yXRixu0ONY)

Beyond the Backbone: How Pallatom Unlocks All-Atom Protein Design

Protein design has long been described as the “inverse protein folding problem.” If folding is nature’s way of turning a sequence of amino acids into a 3D structure, design is our attempt to find a sequence that will fold into a specific, desired shape. For years, this field has been dominated by a divide-and-conquer strategy. Researchers typically generate a backbone structure first (the ribbon), and then use a separate model to “paint” the sequence and side-chains onto that backbone. While effective, this approach ignores a fundamental biological reality: a protein’s backbone and its side-chains are intimately linked. The specific chemistry of the atoms determines the fold, and the fold determines which atoms fit. ...

[Novelty Detection in Reinforcement Learning with World Models 🔗](https://arxiv.org/abs/2310.08731)

When the World Changes: Detecting Novelty in RL Without Thresholds

Imagine you have trained a robot to navigate a maze. It has spent millions of simulation steps learning that “red tile means lava” and “green tile means goal.” It is perfect. Then, you deploy it into the real world. Suddenly, the lighting changes, or a door that was always open is now locked, or the floor becomes slippery. In Reinforcement Learning (RL), this is known as novelty—a sudden, permanent shift in the environment’s dynamics or visuals that the agent never anticipated during training. ...

[Relational Invariant Learning for Robust Solvation Free Energy Prediction 🔗](https://openreview.net/pdf?id=xVBfdltHST)

Beyond the Lab Bench - Predicting Molecular Behavior in Unknown Solvents with AI

Beyond the Lab Bench: Predicting Molecular Behavior in Unknown Solvents with AI In the world of drug discovery and material science, the environment is everything. A molecule that behaves perfectly in water might act completely differently in ethanol or acetone. This phenomenon, known as solvation, is central to how chemical and pharmaceutical processes work. Predicting solvation free energy—the energy change when a solute (like a drug molecule) dissolves in a solvent—is a “holy grail” task for computational chemistry. If we can predict this accurately using AI, we can screen millions of drug candidates without stepping into a wet lab. ...

[FlashTP: Fused, Sparsity-Aware Tensor Product for Machine Learning Interatomic Potentials 🔗](https://openreview.net/pdf?id=wiQe95BPaB)

Breaking the Bottleneck: How FlashTP Accelerates Equivariant Molecular Dynamics

Simulating the atomic world is one of the holy grails of computational science. From discovering new battery materials to designing novel drugs, Molecular Dynamics (MD) simulations allow us to peek into the movement of atoms over time. Historically, scientists have had to choose between two extremes: Quantum Mechanical methods (highly accurate but painfully slow) or classical force fields (fast but often inaccurate). In recent years, a third contender has emerged: Machine Learning Interatomic Potentials (MLIPs). Specifically, Equivariant MLIPs have revolutionized the field by achieving quantum-level accuracy with significantly better efficiency. They respect the physical laws of symmetry—if you rotate a molecule, the forces acting on it should rotate with it. ...

[Nonparametric Teaching for Graph Property Learners 🔗](https://arxiv.org/abs/2505.14170)

Faster GCN Training: How a 'Teacher' Can Speed Up Graph Learning by 40%

Introduction In the world of machine learning, data is often neat and tabular. But in the real world—especially in biology, chemistry, and social sciences—data is messy and interconnected. It comes in the form of graphs: molecules are atoms connected by bonds; social networks are people connected by friendships. To make sense of this data, we use Graph Convolutional Networks (GCNs). These powerful neural networks can predict properties like whether a molecule is soluble in water or which social group a user belongs to. We call this task Graph Property Learning. ...

[Learning to (Learn at Test Time): RNNs with Expressive Hidden States 🔗](https://arxiv.org/abs/2407.04620)

The Model That Learns While It Reads: Explaining Test-Time Training (TTT) Layers

In the world of Large Language Models (LLMs), there is a constant tug-of-war between two architectural paradigms: Transformers and Recurrent Neural Networks (RNNs). Transformers, powered by Self-Attention, are the reigning champions. They are brilliant at handling long context because they explicitly remember every token they have seen (stored in the Key-Value cache). However, this memory comes at a quadratic cost (\(O(T^2)\)). As the sequence gets longer, the computation required grows explosively. ...

[Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion 🔗](https://arxiv.org/abs/2502.00264)

Smoothing the Landscape: How Rotation Symmetry Unlocks Better Transformer Fusion

Deep learning has a fascinating, somewhat counter-intuitive property: if you train two identical neural network architectures on the same data, they will learn to perform the task equally well, yet their internal weights will look completely different. This phenomenon poses a significant challenge for Model Fusion—the practice of merging multiple trained models into a single, superior model without accessing the original training data. If you simply average the weights of two distinct models (a technique often used in Federated Learning or model ensembling), the result is usually a “broken” model with poor performance. Why? Because the models, while functionally similar, are not aligned in parameter space. ...