ICML 2025

[FedSSI: Rehearsal-Free Continual Federated Learning with Synergistic Synaptic Intelligence 🔗](https://openreview.net/pdf?id=9hFQvmCl7P)

How to Learn Forever Without Forgetting: A Deep Dive into FedSSI

Imagine you are trying to learn a new language, say Spanish. You study hard for a month. Then, you switch gears to learn Python programming. A month later, you try to speak Spanish, but you find yourself struggling to recall basic vocabulary. Your brain has overwritten the old neural pathways to make room for the new syntax. In cognitive science and AI, this phenomenon is known as Catastrophic Forgetting. Now, imagine this problem at a massive scale involving thousands of smartphones or hospital servers. This is the challenge of Continual Federated Learning (CFL). Devices need to learn from new streams of data continuously without sharing that private data, all while remembering what they learned months ago. ...

[Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization 🔗](https://openreview.net/pdf?id=92oBV5HAGl)

Surgery on the Mind, Not the Mouth: How Mechanistic Unlearning Fixes LLM Knowledge Editing

Introduction Imagine you have trained a massive Large Language Model (LLM). It is brilliant, articulate, and knowledgeable. Unfortunately, it also memorized the home address of a celebrity, or perhaps it learned a dangerous recipe for a chemical weapon, or it simply believes that Michael Jordan plays baseball (which was only true for a brief, confusing stint in the 90s). You need to fix this. You need the model to “forget” the sensitive data or “edit” the wrong fact without destroying its ability to speak English or answer other questions. ...

[Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs 🔗](https://arxiv.org/abs/2506.09215)

Signal in the Noise: Why Standard Transformer Pooling Fails and How Adaptive Attention Fixes It

Introduction In biological systems, the ability to ignore irrelevant information is just as important as the ability to process relevant information. When you are at a loud cocktail party, your ears pick up sound waves from every direction—clinking glasses, background music, and a dozen overlapping conversations. Yet, your brain performs a remarkable feat of filtering: it attenuates the noise and amplifies the single conversation you are trying to hold. This selective attention is critical for survival in a complex, data-rich world. ...

[Geometric Representation Condition Improves Equivariant Molecule Generation 🔗](https://arxiv.org/abs/2410.03655)

GeoRCG: Breaking the Complexity of 3D Molecule Generation with Geometric Representations

Introduction In the quest to accelerate drug discovery and material science, generative Artificial Intelligence has emerged as a formidable tool. The dream is simple: instead of screening billions of existing molecules to find one that works, we ask an AI to design the perfect molecule from scratch—one that binds to a specific protein, has low toxicity, and is easy to synthesize. However, the reality of 3D molecule generation is mathematically and computationally brutal. Molecules are not just strings of text (SMILES) or 2D drawings; they are dynamic 3D structures defined by quantum mechanics. To generate them, models must place atoms in precise 3D coordinates while adhering to strict physical laws and symmetries (such as rotation and translation). ...

[RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding 🔗](https://arxiv.org/abs/2502.20330)

Breaking the Long-Context Bottleneck: How RAPID Merges RAG and Speculative Decoding

The capability of Large Language Models (LLMs) to process massive amounts of information has exploded. We have moved from context windows of a few thousand tokens to models capable of ingesting millions of words—entire books, codebases, or legal archives—in a single prompt. However, this “long-context” revolution comes with a steep price tag: latency. Processing a 128K token document is computationally expensive. As the context grows, the Key-Value (KV) cache operations become memory-bound, causing generation speeds to crawl. To mitigate this, developers often choose between two extremes: strictly using Retrieval-Augmented Generation (RAG), which is fast but risks missing the “big picture,” or Long-Context (LC) inference, which is comprehensive but agonizingly slow. ...

[K²VAE: A Koopman-Kalman Enhanced Variational AutoEncoder for Probabilistic Time Series Forecasting 🔗](https://arxiv.org/abs/2505.23017)

Taming Chaos: How K²VAE Linearizes the Future for Accurate Time Series Forecasting

Predicting the future is one of the oldest human desires and one of the hardest mathematical challenges. In the realm of data science, this translates to Time Series Forecasting. While we have become quite good at predicting what happens in the next hour or day (short-term forecasting), predicting distant futures (long-term forecasting) remains a stumbling block. The difficulty compounds when we move from point forecasting (predicting a single value, like “25°C”) to Probabilistic Time Series Forecasting (PTSF), where we try to predict a distribution (e.g., “25°C with a standard deviation of 2°C”). We need these probability distributions to make high-stakes decisions in energy grids, financial markets, and supply chains. ...

[Self-supervised Masked Graph Autoencoder via Structure-aware Curriculum 🔗](https://openreview.net/pdf?id=6gX4rP6QJW)

Learning Like a Student: How Curriculum Learning Boosts Graph Autoencoders

Introduction: The Problem with Randomness Imagine trying to learn a new language. You wouldn’t start by trying to write a complex dissertation on philosophy. You would start with the alphabet, then simple words, then sentences, and finally, complex paragraphs. This progression—from easy to hard—is fundamental to human learning. It builds confidence and ensures that foundational concepts are mastered before tackling difficult ones. In the world of Machine Learning, specifically Graph Neural Networks (GNNs), this concept is often ignored. ...

[SDP-CROWN: Efficient Bound Propagation for Neural Network Verification with Tightness of Semidefinite Programming 🔗](https://arxiv.org/abs/2506.06665)

Breaking the Cubic Barrier: How SDP-CROWN Brings Precise Verification to Large Neural Networks

Introduction In the world of safety-critical AI—think autonomous driving, medical diagnosis, or flight control systems—“pretty good” isn’t good enough. We need guarantees. We need to know for a fact that if a stop sign is slightly rotated or has a sticker on it, the car’s neural network won’t misclassify it as a speed limit sign. This is the domain of Neural Network Verification. The goal is to mathematically prove that for a given set of inputs (a “perturbation set”), the network will always produce the correct output. ...

[TEDUO: Teaching the Environment Dynamics from Unlabeled Observations 🔗](https://arxiv.org/abs/2412.06877)

Bridging the Gap: How TEDUO Teaches LLMs to Act Using Unlabeled Data

Imagine you are trying to teach a robot how to “make a cup of coffee” by only letting it watch hours of video footage of people moving around a kitchen. The footage has no captions, no rewards, and no explanations. The robot sees a human pick up a mug, but it doesn’t know why. Was the goal to clean the mug? To move it? Or was it the first step in making coffee? ...

[EVERYTHING EVERYWHERE ALL AT ONCE: LLMS CAN IN-CONTEXT LEARN MULTIPLE TASKS IN SUPERPOSITION 🔗](https://arxiv.org/abs/2410.05603)

Everything Everywhere All At Once: How LLMs Solve Multiple Tasks Simultaneously

There is a popular conceptualization of Large Language Models (LLMs) as “multiverse generators.” When you ask a model to complete a sentence, it isn’t just predicting the next word in one specific narrative; it is effectively weighing probabilities across countless potential continuations. But what happens when the prompt itself is ambiguous? What if the context provided to the model contains examples of two completely different tasks mixed together? Does the model get confused? Does it arbitrarily pick one path and stick to it? ...

[Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It 🔗](https://arxiv.org/abs/2505.05409)

Beyond Flatness: How Riemannian Geometry Unlocks the Secrets of Transformer Generalization

The mystery of generalization—why a neural network trained on specific images or text performs well on data it has never seen before—is the “dark matter” problem of deep learning. For years, a leading hypothesis has been the concept of sharpness (or its inverse, flatness). The intuition is simple: if a neural network finds a solution in a “flat” valley of the loss landscape, the solution is robust. If the training data shifts slightly (simulating the difference between train and test distributions), the loss doesn’t skyrocket. Conversely, a “sharp” minimum means even a tiny shift results in high error. ...

[Implicit Language Models are RNNs: Balancing Parallelization and Expressivity 🔗](https://arxiv.org/abs/2502.07827)

The Best of Both Worlds: How Implicit Models Bridge the Gap Between Transformers and RNNs

Introduction In the current landscape of Deep Learning, we are witnessing a massive tug-of-war between two fundamental properties: parallelization and expressivity. On one side, we have Transformers and State-Space Models (SSMs) like Mamba. These architectures dominate because they are highly parallelizable during training. You can feed them a sequence of text, and they process all tokens simultaneously using GPUs. However, there is a catch. Theoretically, these models belong to a complexity class (specifically \(TC^0\)) that cannot fully solve inherently sequential problems, such as tracking state in a finite state machine (FSM) or solving complex parity problems. They suffer from a “depth” limit. ...

[Raptor: Scalable Train-Free Embeddings for 3D Medical Volumes Leveraging Pretrained 2D Foundation Models 🔗](https://arxiv.org/abs/2507.08254)

Raptor: How to Beat 3D Medical AI Models Without Training a Single Parameter

In the rapidly evolving world of artificial intelligence, a common assumption dictates the rules of the game: if you want a model to excel at a specific task, you must train it on a massive amount of specific data. If you want to diagnose diseases from 3D MRI scans, the conventional wisdom says you need to build a complex 3D neural network and feed it thousands of annotated medical volumes. ...

[CoPINN: Cognitive Physics-Informed Neural Networks 🔗](https://openreview.net/pdf?id=4vAa0A98xI)

Teaching Physics to AI: How CoPINN Mimics Human Learning to Solve Complex Equations

Introduction Imagine you are trying to learn a complex new subject, like calculus or a new language. If you tried to learn the most difficult concepts immediately alongside the basics, you would likely get overwhelmed and fail. Instead, humans learn best via a “curriculum”: we master the easy concepts first, building confidence and understanding, before tackling the difficult problems. In the world of AI scientific computing, specifically Physics-Informed Neural Networks (PINNs), models often don’t have this luxury. They are typically forced to learn everything at once—simple smooth regions and complex chaotic boundaries simultaneously. This often leads to failure in critical areas. ...

[ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals 🔗](https://arxiv.org/abs/2412.14363)

ResQ: Breaking the 4-Bit Barrier in LLMs with Low-Rank Residuals

Introduction The capabilities of Large Language Models (LLMs) like Llama 3 and Qwen2.5 are growing at a staggering pace. However, as these models scale to hundreds of billions of parameters, the computational cost to run them—specifically during inference—is becoming prohibitive. Inference has two main bottlenecks: the compute-bound prefilling stage (processing your prompt) and the memory-bound generation stage (spitting out tokens one by one). To make these models run on standard hardware (or just run faster on data center GPUs), we rely on quantization—reducing the precision of the model’s numbers from 16-bit floating-point (FP16) to integers like 8-bit or 4-bit. While quantizing weights is relatively solved, quantizing activations (the temporary data flowing through the network) and the KV cache (the model’s memory of the conversation) to 4-bit without turning the model into gibberish remains a massive challenge. ...

[Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration 🔗](https://arxiv.org/abs/2505.24688)

Beyond Temperature: Guiding LLM Thoughts with Soft Reasoning and Bayesian Optimization

Beyond Temperature: Guiding LLM Thoughts with Soft Reasoning and Bayesian Optimization If you have ever tried to get a Large Language Model (LLM) to solve a complex math problem or a tricky logic puzzle, you likely know the frustration of “hallucinations” or lazy reasoning. You ask a question, and the model confidently gives you the wrong answer. To fix this, we often rely on two main strategies. The first is Prompt Engineering—telling the model to “think step by step” (Chain-of-Thought). The second is Decoding Strategies—specifically, adjusting the “temperature.” If the model is stuck, we raise the temperature to increase randomness, hoping that in a batch of 10 or 20 generated answers, one will be correct. ...

[Linearization Turns Neural Operators into Function-Valued Gaussian Processes 🔗](https://arxiv.org/abs/2406.05072)

From Weights to Functions - How LUNO Brings Uncertainty to Neural Operators

Introduction In the rapidly evolving world of Scientific Machine Learning (SciML), we are witnessing a paradigm shift. Researchers are no longer just training neural networks to recognize cats or generate text; they are training them to simulate the physical world. One of the most powerful tools in this domain is the Neural Operator. Unlike standard neural networks that map fixed-size vectors to vectors (like an image to a label), Neural Operators learn mappings between function spaces. They can take an initial condition of a physical system—say, the temperature distribution of a fluid—and predict how that function evolves over time, solving Partial Differential Equations (PDEs) orders of magnitude faster than traditional numerical solvers. ...

[An Analysis for Reasoning Bias of Language Models with Small Initialization 🔗](https://arxiv.org/abs/2502.04375)

Nature vs. Nurture in LLMs: How Initialization Scale Dictates Reasoning over Memorization

1. Introduction One of the most spirited debates in the field of Artificial Intelligence today revolves around the true nature of Large Language Models (LLMs). When a model like GPT-4 solves a complex logic puzzle, is it genuinely reasoning—applying logical rules to derive an answer? Or is it merely acting as a “stochastic parrot,” retrieving memorized patterns from its vast training data? While much research focuses on data quality or model size to improve reasoning, a fascinating new study titled “An Analysis for Reasoning Bias of Language Models with Small Initialization” looks at a fundamental, often overlooked design choice: parameter initialization. ...

[RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts 🔗](https://arxiv.org/abs/2411.15114)

Can AI Agents Replace AI Researchers? Inside the RE-Bench Evaluation

The prospect of Artificial Intelligence automating its own development is one of the most transformative—and potentially risky—concepts in modern computer science. If an AI system can conduct Research and Development (R&D) to improve itself, we could enter a feedback loop of accelerating capabilities. But how close are we to that reality? We know Large Language Models (LLMs) can write Python scripts and solve LeetCode problems. However, answering the question of self-improvement requires measuring something much harder: Research Engineering. ...

[CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities 🔗](https://arxiv.org/abs/2503.17332)

Can AI Hack Real Websites? Unpacking CVE-Bench and the Future of Automated Red Teaming

The capabilities of Large Language Models (LLMs) have exploded in recent years. We have seen them write poetry, debug code, and even plan complex travel itineraries. But as these “agents” become more autonomous—capable of executing code, using tools, and reasoning through multi-step problems—a darker question arises: Can an AI agent autonomously hack a web application? This isn’t a hypothetical science fiction scenario. If an LLM can fix a bug in a GitHub repository, it can theoretically exploit a bug in a server. Understanding this risk is critical for cybersecurity professionals, developers, and policymakers. ...