ICML 2025

[AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses 🔗](https://arxiv.org/abs/2503.01811)

The Reality Gap: Can LLMs Actually Break Real-World AI Defenses?

We are living in the era of the “AI Agent.” We have moved past simple chatbots that write poems; we now evaluate Large Language Models (LLMs) on their ability to reason, plan, and interact with software environments. Benchmarks like SWE-Bench test if an AI can fix GitHub issues, while others test if they can browse the web or solve capture-the-flag (CTF) security challenges. But there is a lingering question in the research community: Do these benchmarks reflect reality? ...

[In-context denoising with one-layer transformers: connections between attention and associative memory retrieval 🔗](https://arxiv.org/abs/2502.05164)

Transformers as Bayesian Denoisers: How Attention Mimics Associative Memory

The Transformer architecture has undeniably revolutionized deep learning. From LLMs like GPT-4 to vision models, the “Attention is All You Need” paradigm is ubiquitous. Yet, despite their massive success, we are still playing catch-up in understanding why they work so well. How does a mechanism designed for machine translation become a general-purpose in-context learner? One of the most compelling theories gaining traction links Transformers to associative memory—specifically, Dense Associative Memory (DAM) or modern Hopfield Networks. The idea is that the attention mechanism isn’t just “attending” to parts of a sequence; it is performing an energy minimization step to retrieve memories. ...

[Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise 🔗](https://arxiv.org/abs/2502.11413)

Why Multiclass Classification with Noisy Labels is Surprisingly Hard

In the world of machine learning theory, there is often a stark difference between what works for two classes and what works for three or more. We see this in various domains, but a recent paper titled “Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise” highlights a particularly dramatic gap in the complexity of learning from noisy data. The problem of Multiclass Linear Classification (MLC) is a textbook staple: given data points in a high-dimensional space, can we find linear boundaries that separate them into \(k\) distinct classes? When the labels are clean (perfectly accurate), this is solvable efficiently. When we have Random Classification Noise (RCN)—where labels are randomly flipped based on a noise matrix—the story gets complicated. ...

[SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs 🔗](https://arxiv.org/abs/2406.19593)

Scaling Up Multimodal RAG: How Synthetic Data is Solving the Knowledge Gap in Vision-Language Models

Introduction Imagine showing an AI a photo of a rare, specific species of bird perched on a branch and asking, “What is the migration pattern of this bird?” A standard Multimodal Large Language Model (MLLM) like GPT-4V or LLaVA might recognize the bird correctly. However, if the specific migration details weren’t prevalent in its pre-training data, the model might “hallucinate”—confidently inventing a migration route that doesn’t exist. This is a persistent reliability issue in AI: models are great at looking, but they don’t always know everything about what they see. ...

[Hierarchical Refinement: Optimal Transport to Infinity and Beyond 🔗](https://arxiv.org/abs/2503.03025)

Scaling the Unscalable: How Hierarchical Refinement Solves Optimal Transport for Millions of Points

Introduction In the world of machine learning, alignment is everything. Whether you are training a generative model to map noise to images, aligning single-cell genomic data across different time points, or translating between distinct domains, you are essentially asking the same question: What is the best way to move mass from distribution A to distribution B? This is the core question of Optimal Transport (OT). For decades, OT has been the gold standard for comparing and aligning probability distributions because it respects the underlying geometry of the data. It seeks the “least effort” path to transport one dataset onto another. ...

[Generative Social Choice: The Next Generation 🔗](https://arxiv.org/abs/2505.22939)

Democracy by AI? How to Scale Social Choice with Large Language Models

Introduction In traditional democratic processes, the menu of options is usually fixed. You vote for Candidate A or Candidate B; you choose between Policy X or Policy Y. But what happens when the goal isn’t just to choose from a pre-defined list, but to synthesize the complex, unstructured opinions of thousands of people into a coherent set of representative statements? This is the challenge of Generative Social Choice. Imagine a town hall meeting with 10,000 residents. It is impossible to let everyone speak, and it is equally difficult to have a human moderator manually summarize every distinct viewpoint without bias. Recently, researchers have turned to Large Language Models (LLMs) to solve this. Systems like Polis have already been used in Taiwan and by the United Nations to cluster opinions. However, moving from “clustering” to a mathematically rigorous selection of representative statements is a hard problem. ...

[COLLABLLM: From Passive Responders to Active Collaborators 🔗](https://arxiv.org/abs/2502.00640)

Stop Being Passive: How COLLABLLM Teaches AI to Actually Collaborate

Introduction We have all been there. You ask a Large Language Model (LLM) a vague question, and it immediately spit outs a generic, confident answer. It doesn’t ask for clarification. It doesn’t check if it understands your underlying goal. It just… responds. You then spend the next ten minutes prompting back and forth, correcting its assumptions, until you finally get what you wanted. This happens because modern LLMs are typically “passive responders.” They are trained to maximize the likelihood of the very next response, satisfying the immediate query without considering the long-term trajectory of the conversation. ...

[Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions 🔗](https://arxiv.org/abs/2502.06768)

The Hard Road to Smarter Models: Why Masked Diffusion Beats Autoregression on Logic Puzzles

If you have used ChatGPT or any modern Large Language Model (LLM), you have interacted with an Autoregressive Model (ARM). These models generate text in a very specific way: token by token, from left to right. They are incredibly successful, but they are also rigid. They must decide what comes next based entirely on what came before. But what if the “next” token isn’t the easiest one to predict? What if the end of the sentence is easier to guess than the middle? ...

[EMBODIEDBENCH: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents 🔗](https://arxiv.org/abs/2502.09560)

EmbodiedBench: Can Multimodal LLMs Actually Control Robots?

Introduction We are currently witnessing a golden age of Multi-modal Large Language Models (MLLMs). Models like GPT-4o, Gemini, and Claude can analyze complex images, write poetry, and code entire applications. Naturally, the next frontier is embodied AI—taking these “brains” and putting them inside a robot (or a simulation of one) to navigate the physical world and manipulate objects. The dream is a generalist robot that can understand a command like “clean up the kitchen,” see a mess, and figure out the thousands of tiny muscle movements required to tidy up. However, there is a significant gap between chatting about a task and physically doing it. ...

[Theoretical Limitations of Ensembles in the Age of Overparameterization 🔗](https://arxiv.org/abs/2410.16201)

The Ensemble Illusion: Why Deep Ensembles Might Just Be Large Models in Disguise

The Ensemble Illusion: Why Deep Ensembles Might Just Be Large Models in Disguise In the classical era of machine learning, “ensembling” was the closest thing to a free lunch. If you trained a single decision tree, it might overfit. But if you trained a hundred trees and averaged their predictions (a Random Forest), you got a robust, highly accurate model. The intuition was simple: different models make different mistakes, so averaging them cancels out the noise. ...

[Near-Optimal Decision Trees in a SPLIT Second 🔗](https://arxiv.org/abs/2502.15988)

The Best of Both Worlds: How SPLIT Achieves Optimal Decision Trees at Greedy Speeds

Introduction In the world of machine learning, there is often a painful trade-off between interpretability (understanding why a model made a prediction) and performance (how accurate that prediction is). Decision trees are the poster child for interpretability. They mimic human reasoning: “If X is true, check Y; if Y is false, predict Z.” However, building the perfect decision tree—one that is both highly accurate and sparse (few nodes)—is incredibly difficult. This brings us to a second trade-off: optimality vs. scalability. ...

[Neural Discovery in Mathematics: Do Machines Dream of Colored Planes? 🔗](https://openreview.net/pdf?id=7Tp9zjP9At)

When Neural Networks Paint the Infinite: Solving Combinatorial Geometry Problems with AI

Mathematics is often viewed as a discipline of rigid logic and absolute proofs. A statement is either true or false; a theorem is proven or unproven. However, the process of reaching a proof is frequently messy, relying on intuition, visualization, and trial-and-error. In recent years, a fascinating question has emerged: Can Artificial Intelligence act as an intuition engine for pure mathematics? Can it “dream up” geometric constructions that human mathematicians have overlooked? ...

[Polynomial-Delay MAG Listing with Novel Locally Complete Orientation Rules 🔗](https://openreview.net/pdf?id=70voOlSPos)

Unlocking Causal Secrets - How to Efficiently List Hidden Variable Graphs

Unlocking Causal Secrets: How to Efficiently List Hidden Variable Graphs In the world of Artificial Intelligence and Causal Inference, we often deal with incomplete pictures. We observe data—like symptoms in a patient or economic indicators in a country—but we rarely see the full machinery driving these observations. There are almost always “latent variables,” hidden factors that influence what we see but remain unrecorded. When we try to map out causal relationships in the presence of these hidden factors, we don’t get a single, clear map. Instead, we get a “class” of possible maps. Navigating this class to find every specific, valid causal explanation is a massive computational headache. ...

[rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking 🔗](https://arxiv.org/abs/2501.04519)

How Small AI Models Are Beating GPT-4 at Math: The rStar-Math Revolution

Introduction For a long time, the prevailing wisdom in Artificial Intelligence was simple: bigger is better. If you wanted a model to solve complex calculus or high-school Olympiad math problems, you needed hundreds of billions of parameters, massive computational resources, and a model like GPT-4 or Claude 3.5. Small Language Models (SLMs), typically in the 1 billion to 8 billion parameter range, were considered efficient assistants for basic tasks but incapable of deep, multi-step reasoning. ...

[Omnibench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities 🔗](https://openreview.net/pdf?id=4tFSKOY2mT)

Beyond Linear Tasks: How OmniBench Reveals the True Limits of Virtual Agents

Introduction We are currently witnessing a golden age of Multimodal Large Language Models (MLLMs). From GPT-4o to Claude 3.5, these models are no longer just text processors; they are evolving into “virtual agents” capable of seeing screens, clicking buttons, and navigating the web. The dream is to have a digital assistant that can handle complex workflows—like “Download the sales report from email, visualize the data in Excel, and Slack the chart to the manager.” ...

[Prices, Bids, Values: One ML-Powered Combinatorial Auction to Rule Them All 🔗](https://arxiv.org/abs/2411.09355)

The Best of Both Worlds: How Hybrid ML Auctions Solve the Efficiency-Cognitive Load Trade-off

Allocating resources efficiently is one of the fundamental problems in economics. When the resources are simple—like shares of a company—standard markets work well. But what happens when the items are distinct, but their values are interconnected? Consider a government auctioning spectrum licenses to telecom companies. A license for New York City is valuable, and a license for Philadelphia is valuable. But to a telecom provider, having both might be worth significantly more than the sum of the parts because it allows them to build a continuous network. Conversely, two different frequencies in the same city might be substitutes. ...

[SpeechSSM: Long-Form Speech Generation with State-Space Models 🔗](https://arxiv.org/abs/2412.18603)

Breaking the Silence: How SpeechSSM Masters Long-Form Audio Generation

Introduction Imagine asking an AI to tell you a bedtime story—not by reading text you provide, but by hallucinating a brand-new narrative, in a human voice, complete with pauses, sighs, and intonation. Now, imagine asking it to keep going for twenty minutes. For years, this has been the “final boss” of Generative Spoken Language Modeling. While models like AudioLM or GSLM can generate impressive snippets of speech lasting 10 or 20 seconds, they inevitably fall apart over longer durations. They begin to ramble incoherently, get stuck in repetitive loops, or simply dissolve into static. The computational cost of remembering the beginning of a conversation while generating the middle becomes astronomically high. ...

[Statistical Collusion by Collectives on Learning Platforms 🔗](https://arxiv.org/abs/2502.04879)

Strength in Numbers: How Collectives Can Statistically Guarantee Influence on AI Platforms

Strength in Numbers: How Collectives Can Statistically Guarantee Influence on AI Platforms In the modern digital ecosystem, the relationship between users and platforms is often viewed as a one-way street: the platform extracts data, trains algorithms, and dictates outcomes. But what happens when users band together? Imagine a scenario where a group of gig economy workers wants to influence an algorithm to improve their wages, or a consumer advocacy group wants to prevent a recommendation system from promoting a specific, harmful product. This is the domain of Algorithmic Collective Action. ...

[Not All Explanations for Deep Learning Phenomena Are Equally Valuable 🔗](https://openreview.net/pdf?id=cw7MYyDL33)

Stop Solving Puzzles—Why Deep Learning Needs Pragmatism Over Ad Hoc Hypotheses

In the fast-paced world of artificial intelligence, researchers love a good mystery. Over the last few years, the community has been captivated by strange behaviors in neural networks that seem to defy the fundamental laws of statistics and learning theory. Models that get smarter after they massively overfit? Test error that goes down, then up, then down again? These phenomena—known as grokking, double descent, and the lottery ticket hypothesis—have spawned thousands of papers attempting to explain them. ...

[Algorithm Development in Neural Networks: Insights from the Streaming Parity Task 🔗](https://arxiv.org/abs/2507.09897)

From Memorization to Algorithms: How RNNs Learn to Generalize Infinitely

Introduction One of the most profound mysteries in deep learning is the phenomenon of generalization. We typically understand generalization through the lens of interpolation: if a neural network sees enough training examples (dots on a graph), it learns a smooth curve that connects them, allowing it to predict values for points situated between the training examples. However, in certain settings, neural networks exhibit a behavior that defies this interpolation-based explanation. They learn to extrapolate. They can handle inputs far beyond the bounds of their training data—sometimes infinitely so. When a network trained on short sequences suddenly solves a task for sequences thousands of times longer, it implies the network hasn’t just fitted a curve; it has discovered an underlying algorithm. ...