ICML 2025

[Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks 🔗](https://openreview.net/pdf?id=YucuAuXMpT)

The Ghost in the Machine: Why Identical Networks Can Have Radically Different Brains

Introduction In the quest to understand intelligence—both artificial and biological—we often rely on a fundamental assumption: if two systems perform the same task in the same way, they must be processing information similarly. If a Deep Neural Network (DNN) classifies images with the same accuracy and error patterns as a human, we are tempted to conclude that the network’s internal “neural code” aligns with the human brain. But what if this assumption is fundamentally flawed? ...

[Monte Carlo Tree Diffusion for System 2 Planning 🔗](https://arxiv.org/abs/2502.07202)

Unlocking System 2 Thinking in AI Agents: Monte Carlo Tree Diffusion

Unlocking System 2 Thinking in AI Agents: Monte Carlo Tree Diffusion Imagine you are playing a complex game of chess. Sometimes, you make a move instantly based on intuition—a quick pattern match. Other times, you sit back and simulate several moves into the future, weighing options, discarding bad paths, and refining your strategy before touching a piece. In cognitive science, these are often referred to as System 1 (fast, instinctive) and System 2 (slow, deliberative) thinking. ...

[Learning the RoPEs: Better 2D and 3D Position Encodings with STRING 🔗](https://arxiv.org/abs/2502.02562)

Beyond RoPE: How STRING Unlocks Better 3D Spatial Reasoning for Transformers

Introduction If you have ever worked with Transformers—the architecture behind the current AI revolution—you know they have a peculiar quirk: they are set functions. If you feed a Transformer the sentence “The cat sat on the mat” or “mat on sat cat The,” the core attention mechanism treats them almost identically. It has no inherent concept of order or space. To fix this, we use Position Encodings (PEs). For Large Language Models (LLMs), the industry standard has become RoPE (Rotary Position Embeddings). RoPE is fantastic for 1D text sequences. However, as we push Transformers into the world of robotics and computer vision, we aren’t just dealing with a sequence of words anymore. We are dealing with 2D images, 3D point clouds, and robot limbs moving through physical space. ...

[A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO 🔗](https://arxiv.org/abs/2505.15694)

When Privacy Meets Poison: A Unified Theory for Robust AI Alignment

Introduction In the current landscape of Large Language Model (LLM) development, “alignment” is the North Star. We want models that are not just smart, but also helpful, honest, and harmless. To achieve this, we rely heavily on human feedback—specifically, datasets where humans indicate which of two model responses they prefer. This data powers the two dominant alignment paradigms: Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). But there is a crack in the foundation. Most theoretical work assumes that the preference labels provided by humans are perfect and readily available. In the real world, this is rarely the case. We face two distinct but overlapping threats: ...

[Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream 🔗](https://arxiv.org/abs/2411.05712)

The Limits of Scale: Why Bigger AI Models Aren't Necessarily Better Brains

The Limits of Scale: Why Bigger AI Models Aren’t Necessarily Better Brains In the current era of Artificial Intelligence, there is a pervasive mantra: scale is all you need. From Large Language Models (LLMs) like GPT-4 to massive vision transformers, the recipe for success has largely been to increase the number of parameters, feed the model more data, and throw more compute at the training process. This “brute force” approach has yielded unprecedented performance on tasks ranging from coding to generating photorealistic images. ...

[HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation 🔗](https://arxiv.org/abs/2502.09838)

HealthGPT: Bridging the Gap Between Medical Seeing and Drawing

Introduction In the rapidly evolving landscape of Artificial Intelligence, Large Language Models (LLMs) have made headlines for their ability to pass medical licensing exams and act as diagnostic assistants. However, medicine is not just about text; it is an inherently visual field. Radiologists interpret X-rays, pathologists analyze tissue slides, and surgeons rely on MRI reconstructions. While recent “Vision-Language” models can look at an X-ray and write a report (comprehension), they often lack the ability to perform the reverse: creating or enhancing medical images based on instructions (generation). Existing models that try to do both often suffer from a “jack of all trades, master of none” problem, where learning to generate images degrades the model’s ability to understand them, and vice versa. ...

[Improving Zero-Shot Adversarial Robustness in Vision-Language Models by Closed-form Alignment of Adversarial Path Simplices 🔗](https://openreview.net/pdf?id=WR0ahlhOoy)

Beyond the Point: Robustifying VLMs with Infinite Adversarial Sampling via Math Tricks

Introduction Vision-Language Models (VLMs) like CLIP have revolutionized how computers understand the world. By learning to associate images with natural language descriptions on a massive scale, they can classify objects they have never seen before—a capability known as zero-shot classification. You can show CLIP a picture of an “axolotl” and, even if it wasn’t explicitly trained to tag axolotls, it can figure it out by understanding the text description. However, these powerful models have an Achilles’ heel: Adversarial Examples. ...

[PhySpec: Physically Consistent Spectral Reconstruction via Orthogonal Subspace Decomposition and Self-Supervised Meta-Auxiliary Learning 🔗](https://openreview.net/pdf?id=WISfJyOA6M)

Solving the Colorimetric Dilemma: How PhySpec Brings Physics Back to Hyperspectral Imaging

Introduction: Seeing Beyond the Visible Imagine trying to reconstruct a full symphony orchestra’s performance given only the bass, mid, and treble settings from a car radio. It seems impossible, yet that is essentially the challenge of Spectral Reconstruction. In the world of computer vision, traditional cameras are limited. They mimic the human eye, capturing the world in three channels: Red, Green, and Blue (RGB). However, the physical world is much richer. Every material reflects light across a continuous spectrum of wavelengths. A Hyperspectral Image (HSI) captures this dense information, often containing 31 to over 100 channels. This data is invaluable for applications like remote sensing, medical diagnostics, and agricultural monitoring because it reveals intrinsic material properties that RGB cameras miss. ...

[The Role of Randomness in Stability 🔗](https://arxiv.org/abs/2502.08007)

Counting the Coins: How Much Randomness Do We Need for Stable Machine Learning?

In the world of machine learning and statistics, we often crave two contradictory things: consistency and privacy. On one hand, we want reproducibility. If I run an analysis on a dataset today, and you run the same analysis on the same dataset tomorrow, we should get the same result. This is the bedrock of the scientific method. On the other hand, we increasingly demand Differential Privacy (DP). To protect individual data points, algorithms must be randomized. If an algorithm is perfectly deterministic, it risks leaking information about the specific inputs it was trained on. Furthermore, standard training methods like Stochastic Gradient Descent (SGD) inject noise by design to improve generalization. ...

[InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective 🔗](https://arxiv.org/abs/2505.21920)

InfoSAM: Unlocking the Secrets of Fine-Tuning the Segment Anything Model with Information Theory

Introduction The release of the Segment Anything Model (SAM) marked a turning point in computer vision. Trained on over 1 billion masks, SAM demonstrated an incredible ability to perform “zero-shot” segmentation—identifying objects it had never seen before without specific training. It seemed like the “Jack of all trades” for image analysis. However, as many researchers and students soon discovered, being a Jack of all trades often means being a master of none. When applied to highly specialized domains—such as identifying polyps in medical imaging, detecting camouflaged animals, or spotting specific crop diseases—SAM’s performance often drops. It struggles to grasp the nuanced, domain-specific features that these tasks demand. ...

[A Closer Look at Multimodal Representation Collapse 🔗](https://arxiv.org/abs/2505.22483)

Unraveling Modality Collapse: How Polysemantic Neurons and Rank Bottlenecks Sabotage Multimodal Learning

Introduction In the pursuit of Artificial General Intelligence, multimodal learning is a cornerstone. The logic is sound: humans perceive the world through sight, sound, and text simultaneously; therefore, AI models should benefit from combining these modalities to form a richer understanding of the data. Theoretically, adding a modality—say, adding MRI scans to patient health records—should never decrease performance. It should only add information. However, researchers have consistently observed a baffling phenomenon known as Modality Collapse. Instead of leveraging all available data, deep learning models often over-rely on a subset of modalities while completely ignoring others. If a model is trained on video and audio, it might learn to ignore the audio entirely. This isn’t just inefficient; it’s dangerous. If the relied-upon modality goes missing at test time (e.g., a camera fails), the model becomes useless because it never learned to use the backup sensors. ...

[Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models 🔗](https://arxiv.org/abs/2410.02205)

The Logic Gap: Why LLMs Contradict Themselves and How to Fix It

Imagine you are asking a smart assistant to rank three job candidates: Alice, Bob, and Charlie. You ask the assistant, “Is Alice better than Bob?” It says yes. You ask, “Is Bob better than Charlie?” It says yes. Logically, you’d expect that if Alice beats Bob, and Bob beats Charlie, then Alice must beat Charlie. But when you ask, “Is Alice better than Charlie?” the assistant pauses and says, “No, Charlie is better than Alice.” ...

[Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents 🔗](https://arxiv.org/abs/2502.02561)

Balancing Safety and Utility: A New Foundation for Risk-Averse Decision Making

In the world of machine learning, predictions are rarely the end goal. We predict to act. A doctor predicts a diagnosis to choose a treatment; a self-driving car predicts pedestrian movement to steer; a financial algorithm predicts market trends to trade. In low-stakes environments, like recommending a movie, it is acceptable to maximize the expected utility. If the model is wrong, the user wastes two hours on a bad movie—unfortunate, but not catastrophic. However, in high-stakes domains like medicine or robotics, the cost of error is asymmetric. Relying solely on the “most likely” outcome can be dangerous. If a model is 90% sure a tumor is benign, acting on that probability might maximize expected utility, but it ignores the catastrophic risk of the 10% chance that it is malignant. ...

[Achieving Linear Speedup and Near-Optimal Complexity for Decentralized Optimization over Row-stochastic Networks 🔗](https://arxiv.org/abs/2506.04600)

Taming Directed Graphs: How to Achieve Optimal Speed in Decentralized Learning

Introduction In the modern era of machine learning, the scale of datasets and models has grown exponentially. Training these massive models often requires distributed computing across clusters of machines. Traditionally, this is done using a centralized parameter server—a conductor that coordinates the entire orchestra. However, this central server creates a communication bottleneck and a single point of failure. Enter Decentralized Optimization. Imagine an orchestra without a conductor, where musicians only listen to their immediate neighbors to stay in sync. In this setting, nodes (computers) exchange information locally to collaboratively minimize a global loss function. ...

[Graph Adaptive Autoregressive Moving Average Models 🔗](https://openreview.net/pdf?id=UFlyLkvyAE)

GRAMA: Unlocking Long-Range Graph Learning with Adaptive ARMA Dynamics

Introduction Graph Neural Networks (GNNs) have revolutionized how we process structured data, from predicting molecular properties to analyzing social networks. However, standard GNNs—specifically Message Passing Neural Networks (MPNNs)—have a well-known Achilles’ heel: oversquashing. In a standard MPNN, information is aggregated from neighbors. To learn about a node 10 steps away, you need 10 layers of message passing. As the receptive field grows, the amount of information that needs to be compressed into a single fixed-size vector grows exponentially. The signal gets “squashed,” and the model fails to capture long-range dependencies. ...

[Trusted Multi-View Classification with Expert Knowledge Constraints 🔗](https://openreview.net/pdf?id=U64wEbM7NB)

Peeking Inside the Black Box of Sleep: How Trusted Multi-View Learning Uses Expert Knowledge

Artificial Intelligence has made massive strides in healthcare, particularly in diagnosing sleep disorders. Automated Sleep Stage Classification (SSC) using EEG signals is becoming faster and more accurate than manual scoring by human experts. However, a lingering problem remains in high-stakes medical AI: Trust. When a neural network diagnoses a patient, it typically acts as a “black box.” It spits out a probability (e.g., “90% chance of Stage N2 sleep”), but it rarely tells us why it thinks that, nor does it honestly admit when it is confused by conflicting data. ...

[Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces 🔗](https://arxiv.org/abs/2507.03910)

Rethinking Molecular Design: Why COWBOYS Are Better Than Latent Space Optimization

Introduction In the world of drug discovery and material science, the search for a new molecule is often compared to finding a needle in a haystack. However, the “haystack” here is the chemical space, which contains an estimated \(10^{60}\) theoretically possible drug-like molecules. Searching this space is a discrete, combinatorial nightmare. To tackle this, machine learning researchers developed Latent Space Bayesian Optimization (LSBO). The idea was elegant: instead of searching through discrete chemical structures, we can use a Variational Autoencoder (VAE) to map these structures into a continuous, smooth numerical space (the “latent space”). We can then use standard optimization techniques in this smooth space to find the best candidates. ...

[Graph Diffusion for Robust Multi-Agent Coordination 🔗](https://openreview.net/pdf?id=T5IZ32ImAB)

Connecting the Dots: How Graph Diffusion is Revolutionizing Multi-Agent Coordination

Introduction In the world of artificial intelligence, getting a single agent to perform a task is difficult. Getting multiple agents to work together—like a swarm of drones fighting a fire or a fleet of autonomous vehicles navigating a busy intersection—is exponentially harder. This is the domain of Multi-Agent Reinforcement Learning (MARL). While MARL has seen significant success, a major hurdle remains: Offline Learning. In many real-world scenarios, we cannot allow robots to learn by trial and error (which involves crashing and failing). Instead, we must train them on pre-collected datasets. The problem is that offline data is static, but the real world is dynamic. If an agent’s motor degrades, or if a team member suddenly goes offline, policies trained on static data often crumble because they haven’t “seen” these specific coordination failures before. ...

[PokéChamp: an Expert-level Minimax Language Agent 🔗](https://arxiv.org/abs/2503.04094)

How LLMs Mastered Pokémon Battles: Inside PokéChamp

Artificial Intelligence has already conquered perfect-information games like Chess and Go. In those domains, Deep Reinforcement Learning (RL) agents—trained over millions of iterations of self-play—reign supreme. However, these methods often require massive, task-specific training resources. Enter Large Language Models (LLMs). These models possess vast general knowledge, but they notoriously struggle with strategic planning. If you ask a standard LLM to play a game, it often hallucinates rules or fails to look ahead. ...

[Lipschitz neural networks are well-known for providing certified robustness in deep learning... 🔗](https://arxiv.org/abs/2505.15174)

Building Bulletproof AI: The Block Reflector Orthogonal Layer and Logit Annealing

Building Bulletproof AI: The Block Reflector Orthogonal Layer and Logit Annealing Deep learning models are undeniably powerful, achieving superhuman performance in tasks ranging from medical diagnosis to autonomous driving. Yet, they possess a startling weakness: adversarial attacks. A malicious actor can add imperceptible noise to an image—changes so subtle the human eye cannot detect them—and cause a state-of-the-art AI to misclassify a stop sign as a speed limit sign. To combat this, researchers have developed “defenses.” Most are empirical—they work until a smarter attacker comes along. However, the gold standard is Certified Robustness. A certifiably robust model comes with a mathematical guarantee: for a given input, no perturbation within a specific range (the “radius”) can change the model’s prediction. ...