ICML 2025

[MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models 🔗](https://arxiv.org/abs/2501.00316)

Lost in Translation: Why Foundation Models Struggle with Real-World Maps

Imagine you are in a foreign city. You open a map app, looking for a coffee shop that is open right now, within a 10-minute walk, and has a rating above 4.0. You also need to spot it on a map filled with icons and street names. For a human, this is a standard navigational task involving visual scanning, spatial reasoning, and reading comprehension. Now, imagine asking an AI to do the same. While Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated incredible prowess in coding, creative writing, and general reasoning, their ability to navigate the physical world—represented through maps—remains a significant blind spot. ...

[The Jailbreak Tax: How Useful are Your Jailbreak Outputs? 🔗](https://openreview.net/pdf?id=hRQyqtcjVv)

The Jailbreak Tax: Why Breaking AI Safety Rails Might Break the AI Too

The world of Large Language Model (LLM) security is often framed as a high-stakes game of cat and mouse. On one side, developers build guardrails to align models, preventing them from generating harmful content like bomb-making instructions or hate speech. On the other side, “red teamers” and adversaries develop “jailbreaks”—clever prompts designed to bypass these defenses. Until now, the primary metric for a successful jailbreak has been binary: Did the model refuse, or did it answer? ...

[ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation 🔗](https://arxiv.org/abs/2502.13581)

Beyond IDs: How ActionPiece Brings Context to Generative Recommendation

Introduction In the world of Recommender Systems, we are currently witnessing a paradigm shift. The field is moving away from traditional classification-based methods—which select the best item from a massive, fixed pool—toward Generative Recommendation (GR). Inspired by the success of Large Language Models (LLMs) like GPT, GR models treat user behavior as a language. They “tokenize” user actions and train models to autoregressively predict the next token in a sequence. ...

[Functional Alignment Can Mislead: Examining Model Stitching 🔗](https://openreview.net/pdf?id=glLqTK9En3)

Just Because It Fits: Why Neural Network Alignment Doesn't Mean What You Think It Means

Introduction One of the great mysteries of modern artificial intelligence is the “black box” problem. We know that deep neural networks work—often surprisingly well—but we don’t always know how they represent the data they are processing. Does a model classify a bird because it sees wings, or because it hears a song, or because it detects a specific texture in the background? To answer these questions, researchers have developed various tools to compare the internal representations of different models. A popular and intuitive method is called Model Stitching. The logic goes like this: if you can take the first half of Model A, stitch it to the second half of Model B, and the frankenstein-monster combination still works, then Model A and Model B must be “thinking” in similar ways. They must be functionally aligned. ...

[Stronger Neyman Regret Guarantees for Adaptive Experimental Design 🔗](https://arxiv.org/abs/2502.17427)

Beyond A/B Testing - How Adaptive Algorithms Are Revolutionizing Experimental Design

Imagine you are running a clinical trial for a new drug, or perhaps testing a new feature on a high-traffic e-commerce website. In the traditional “A/B testing” world, you might flip a coin: 50% of people get the treatment, 50% get the control. You run this for a month, collect data, and analyze the results. But what if, halfway through the month, the data starts hinting that the treatment group has much higher variance than the control group? Or what if certain subgroups of people react differently? A fixed 50/50 split is rarely the most efficient way to estimate the truth. It wastes samples and time. ...

[Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry 🔗](https://arxiv.org/abs/2503.18114)

Seeing How They Think: Unlocking Neural Network Dynamics with Manifold Geometry

Seeing How They Think: Unlocking Neural Network Dynamics with Manifold Geometry How does a neural network actually learn? If you look at the raw numbers—the billions of synaptic weights—you see a chaotic storm of floating-point adjustments. If you look at the loss curve, you see a line going down. But neither of these tells you how the network is structuring information. For a long time, researchers have categorized deep learning into two distinct regimes: the Lazy regime and the Rich regime. In the lazy regime, the network barely touches its internal features, acting like a glorified kernel machine. In the rich regime, the network actively sculpts complex, task-specific features. ...

[Learning with Exact Invariances in Polynomial Time 🔗](https://arxiv.org/abs/2502.19758)

Breaking the Symmetry Barrier - How to Learn Exact Invariances in Polynomial Time

In the natural sciences and physics, symmetry is everything. Whether you are analyzing the energy of a molecule, the dynamics of a fluid, or the structure of a crystal, the fundamental laws of nature often remain unchanged under certain transformations—like rotation, reflection, or translation. In machine learning, we call these properties invariances. Ideally, if we train a model to predict the energy of a molecule, rotating that molecule in 3D space should not change the model’s prediction. However, teaching machines to respect these symmetries exactly has historically been a massive computational headache. ...

[Masked Autoencoders Are Effective Tokenizers for Diffusion Models 🔗](https://arxiv.org/abs/2502.03444)

Beyond VAEs: How Masking Makes Autoencoders Effective Tokenizers for Diffusion

If you have been following the explosion of generative AI, you are likely familiar with Latent Diffusion Models (LDMs), the architecture behind heavyweights like Stable Diffusion. The secret sauce of LDMs is efficiency: instead of generating an image pixel-by-pixel, they operate in a compressed “latent space.” This compression is handled by a tokenizer (usually an autoencoder). For years, the standard advice has been to use Variational Autoencoders (VAEs). VAEs enforce a smooth, Gaussian distribution on the latent space, which theoretically makes it easier for the diffusion model to learn. But there is a trade-off: that smoothness constraint often results in blurry reconstructions and limits the fidelity of the final image. ...

[Discovering Symbolic Cognitive Models from Human and Animal Behavior 🔗](https://openreview.net/pdf?id=dhRXGWJ027)

Can LLMs Act as Cognitive Scientists? Discovering How Brains Learn with CogFunSearch

In the world of neuroscience and psychology, there is a constant tension between prediction and understanding. If you want to simply predict what a human or animal will do next, you might train a massive Recurrent Neural Network (RNN) on their behavioral data. The RNN will likely achieve high accuracy, but it acts as a “black box.” It gives you the answer, but it doesn’t tell you how the brain solved the problem. It doesn’t offer a scientific theory. ...

[TABFLEX: Scaling Tabular Learning to Millions with Linear Attention 🔗](https://arxiv.org/abs/2506.05584)

Breaking the Tabular Barrier: How TABFLEX Scales Transformers to Millions of Rows

Introduction If you have taken an introductory machine learning course, you likely know the golden rule of tabular data: Gradient Boosted Decision Trees (GBDTs) are king. While Deep Learning has revolutionized images (CNNs, ViTs) and text (LLMs), tabular data—the rows and columns that make up the vast majority of business and medical databases—has remained the stronghold of XGBoost, LightGBM, and CatBoost. However, recent research has attempted to challenge this dominance using Transformers. The most notable attempt was TABPFN (Tabular Prior-Data Fitted Network), a model that acts like a “Large Language Model for tables.” Instead of training a model on your dataset, you simply feed your training data into the prompt (In-Context Learning), and it predicts the test labels immediately. It was revolutionary, but it had a massive flaw: Scalability. Because it relied on standard attention mechanisms, its computational cost exploded quadratically as the number of samples increased. It was brilliant for 1,000 rows, but useless for 100,000. ...

[Investigating Non-Transitivity in LLM-as-a-Judge 🔗](https://arxiv.org/abs/2502.14074)

Rock, Paper, LLM? Why AI Judges Are Confused and How to Fix It

Rock, Paper, LLM? Why AI Judges Are Confused and How to Fix It If you have ever tried to grade creative writing essays, you know how subjective it can be. Is Essay A better than Essay B? Maybe. But if you compare Essay B to Essay C, and then C to A, you might find yourself in a logical loop where every essay seems better than the last in some specific way. This is the problem of non-transitivity, and it turns out, Artificial Intelligence suffers from it too. ...

[Efficient and Separate Authentication Image Steganography Network 🔗](https://openreview.net/pdf?id=cKaUC1PeJA)

Lock It Up: Revolutionizing Image Steganography with Separate Authentication

Imagine you want to send a sensitive blueprint to a colleague. You don’t want to use standard encryption because a file named top_secret_plans.enc screams “look at me!” to any interceptor. Instead, you decide to hide the blueprint inside a harmless photo of a cat. This is steganography: the art of hiding information in plain sight. For years, researchers have used deep learning to make this process incredibly effective. However, there has been a glaring security hole. In most existing systems, if you have the “reveal” network, you can see everything hidden in the image. There is no concept of a “key” or specific user authentication. If you hide five different secret images for five different people in one cover photo, anyone with the decoder sees all five. ...

[No Soundness in the Real World: On the Challenges of the Verification of Deployed Neural Networks 🔗](https://arxiv.org/abs/2506.01054)

Why Your 'Verified' Neural Network Might Still Be Unsafe: The Reality of Floating-Point Deployment

Introduction Imagine a future where safety-critical systems—like autonomous vehicles or medical diagnosis tools—are governed by neural networks. Before these networks are allowed on the road or in the hospital, they undergo a rigorous process called formal verification. This produces a mathematical proof guaranteeing that the network will behave correctly, even when an attacker tries to trick it. It sounds like the ultimate safety net. If the math proves the network is robust, we should be safe, right? ...

[Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations 🔗](https://arxiv.org/abs/2412.14803)

Can Predicting the Future Teach Robots to Act? A Deep Dive into Video Prediction Policy (VPP)

Can Predicting the Future Teach Robots to Act? A Deep Dive into Video Prediction Policy (VPP) In the quest to build generalist robots—machines capable of handling everything from folding laundry to assembling electronics—vision is paramount. For a robot to interact with the world, it must see the world. However, how we teach robots to “see” has largely been static. We typically feed them single images, effectively asking them to make complex decisions based on snapshots frozen in time. ...

[Non-stationary Diffusion For Probabilistic Time Series Forecasting 🔗](https://arxiv.org/abs/2505.04278)

Why Your Time Series Model Fails at Uncertainty (and How Non-stationary Diffusion Fixes It)

Introduction In the world of time series forecasting—whether we are predicting stock prices, hospital admission rates, or electricity demand—knowing what will happen is only half the battle. The other, often more critical half, is knowing how sure we are about that prediction. Imagine an AI predicting traffic flow. A prediction of “50 cars per minute” is useful. But a prediction of “50 cars per minute, give or take 5 cars” leads to very different decision-making than “50 cars per minute, give or take 40 cars.” This is the domain of probabilistic time series forecasting. ...

[SPARSE-PIVOT: Dynamic correlation clustering for node insertions 🔗](https://arxiv.org/abs/2507.01830)

Taming Dynamic Graphs: How SPARSE-PIVOT Efficiently Clusters Data on the Fly

Imagine you are running a massive online store. Every minute, new items are added to your inventory. Your goal is to group these items based on similarity—putting all the “vintage leather jackets” in one cluster and “wireless gaming mice” in another. In a static world, you would have all the items in front of you, run a clustering algorithm once, and be done. But in the real world, data is dynamic. Items arrive one by one. You cannot afford to re-run a massive clustering operation every time a single new product is listed. You need an algorithm that updates the clusters on the fly, accurately and quickly. ...

[Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models 🔗](https://arxiv.org/abs/2502.16849)

From Random Guessing to Linear Success: The Mathematics of Pre-training

In the modern era of deep learning, we often take the “pre-train then fine-tune” paradigm for granted. We train a massive model (like BERT or GPT) on a mountain of unlabeled text, and then fine-tune it on a specific task with a smaller set of labeled data. Empirically, we know this works wonders. It stabilizes training and drastically reduces the amount of labeled data required. But why does it work? From a mathematical perspective, how does seeing unlabeled data change the geometry of the optimization landscape to make an impossible problem solvable? ...

[Leveraging Diffusion Model as Pseudo-Anomalous Graph Generator for Graph-Level Anomaly Detection 🔗](https://openreview.net/pdf?id=Zm2M92TZyO)

Creating the Villain: How AGDiff Uses Diffusion to Generate Anomalies for Better Detection

Introduction: The Paradox of Finding What Isn’t There Imagine you are a security guard tasked with spotting shoplifters. However, you have never actually seen a shoplifter in your life. You’ve only ever watched honest customers. This is the fundamental problem in Graph-Level Anomaly Detection (GLAD). In domains like biochemistry (detecting toxic molecules) or social network analysis (identifying bot networks), anomalies are rare, expensive to label, or entirely unknown. Traditional Artificial Intelligence struggles here because standard supervised learning requires examples of both “good” and “bad” data. When you only have “good” data (normal graphs), how do you teach a machine to recognize the “bad”? ...

[Counterfactual Graphical Models: Constraints and Inference 🔗](https://openreview.net/pdf?id=Z1qZoHa6ql)

Mastering the Multiverse: A New Framework for Counterfactual Reasoning in Causal AI

Human reasoning is fundamentally built on our ability to imagine worlds that never existed. We look back at a decision and ask, “What if I had taken the job in London?” or “Would the patient have survived if we started treatment a week earlier?” These are counterfactuals. They represent the highest rung on the ladder of causation. Unlike standard statistics, which deal with correlations in observed data, or even basic causal inference, which deals with interventions (“What happens if I do X?”), counterfactuals require us to compare the observed reality against a hypothetical alternative. ...

[PASS: Private Attributes Protection with Stochastic Data Substitution 🔗](https://arxiv.org/abs/2506.07308)

Hiding in Plain Sight: How Stochastic Data Substitution Protects Privacy in AI

Introduction: The Privacy Dilemma in the Age of AI Imagine you are using a voice assistant. To understand your command, the system needs to analyze the content of your speech. However, your voice recording carries much more than just the words you spoke; it contains your gender, your accent, your approximate age, and potentially even your identity. This is the fundamental tension in modern Machine Learning (ML) services: to provide utility, they need data. But that data often comes coupled with sensitive, private attributes that users have no desire (and often no need) to share. ...