Papers

[From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models 🔗](https://openreview.net/pdf?id=zdOGBRQEbz)

Decoding the Black Box of Biology: How Sparse Autoencoders Reveal What Protein Models Actually Learn

The intersection of Artificial Intelligence and Biology has produced some of the most remarkable scientific breakthroughs of the last decade. Tools like AlphaFold have solved the protein structure prediction problem, and Protein Language Models (pLMs) can now generate novel proteins or predict function with uncanny accuracy. However, there is a catch. While these models are incredibly useful, they function largely as “black boxes.” We feed a sequence of amino acids into the model, and it outputs a structure or a function prediction. But how does it do it? Does the model actually “understand” biophysics, or is it simply memorizing statistical correlations? ...

[On the Tension between Byzantine Robustness and No-Attack Accuracy in Distributed Learning 🔗](https://openreview.net/pdf?id=zU4VCPHYRC)

The Cost of Paranoia: Why Robust Distributed Learning Fails When Everyone is Honest

In the world of machine learning, bigger is often better. Training massive models requires massive amounts of data and compute power, leading to the widespread adoption of distributed learning. We split the work across hundreds or thousands of workers (GPUs, mobile devices, etc.), and a central server aggregates their results to update a global model. But there is a catch. In scenarios like Federated Learning, the central server has little control over the workers. Some workers might be faulty, crashing and sending garbage data. Worse, some might be malicious “Byzantine” attackers, intentionally sending mathematically crafted gradients to destroy the model’s performance. ...

[Lightweight Protocols for Distributed Private Quantile Estimation 🔗](https://arxiv.org/abs/2502.02990)

Finding the Median in the Crowd - How Adaptive Algorithms Solve Private Quantile Estimation

In the era of big data, we are constantly faced with a paradox. On one hand, we need to aggregate information from millions of user devices—smartphones, wearables, and IoT sensors—to understand population trends. On the other hand, this data is often deeply personal. Whether it is salary information, health metrics, or screen-time usage, users (and laws) demand privacy. How do we calculate simple statistics, like the median salary or the 95th percentile of app usage, without ever collecting the raw data? ...

[Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development 🔗](https://arxiv.org/abs/2407.11784)

Breaking Silos: How Data-Juicer Sandbox Revolutionizes Multimodal AI Training

Introduction In the rapidly evolving landscape of Artificial Intelligence, Multimodal Large Models (MLLMs)—AI capable of processing and generating text, images, and video simultaneously—have taken center stage. From GPT-4 to Sora, these models are pushing the boundaries of creativity and functionality. However, beneath the surface of these impressive capabilities lies a persistent engineering bottleneck: the “chicken and egg” problem of data and model development. Historically, the paths to improving AI have been bifurcated. You have model-centric development, where researchers obsess over architecture tweaks and training algorithms, often assuming the data is a fixed variable. Then, you have data-centric development, where engineers clean and curate massive datasets, often relying on intuition or heuristics without knowing exactly how that data will interact with a specific model until the expensive training run is finished. ...

[P(all-atom) Is Unlocking New Path For Protein Design 🔗](https://openreview.net/pdf?id=yXRixu0ONY)

Beyond the Backbone: How Pallatom Unlocks All-Atom Protein Design

Protein design has long been described as the “inverse protein folding problem.” If folding is nature’s way of turning a sequence of amino acids into a 3D structure, design is our attempt to find a sequence that will fold into a specific, desired shape. For years, this field has been dominated by a divide-and-conquer strategy. Researchers typically generate a backbone structure first (the ribbon), and then use a separate model to “paint” the sequence and side-chains onto that backbone. While effective, this approach ignores a fundamental biological reality: a protein’s backbone and its side-chains are intimately linked. The specific chemistry of the atoms determines the fold, and the fold determines which atoms fit. ...

[Novelty Detection in Reinforcement Learning with World Models 🔗](https://arxiv.org/abs/2310.08731)

When the World Changes: Detecting Novelty in RL Without Thresholds

Imagine you have trained a robot to navigate a maze. It has spent millions of simulation steps learning that “red tile means lava” and “green tile means goal.” It is perfect. Then, you deploy it into the real world. Suddenly, the lighting changes, or a door that was always open is now locked, or the floor becomes slippery. In Reinforcement Learning (RL), this is known as novelty—a sudden, permanent shift in the environment’s dynamics or visuals that the agent never anticipated during training. ...

[Relational Invariant Learning for Robust Solvation Free Energy Prediction 🔗](https://openreview.net/pdf?id=xVBfdltHST)

Beyond the Lab Bench - Predicting Molecular Behavior in Unknown Solvents with AI

Beyond the Lab Bench: Predicting Molecular Behavior in Unknown Solvents with AI In the world of drug discovery and material science, the environment is everything. A molecule that behaves perfectly in water might act completely differently in ethanol or acetone. This phenomenon, known as solvation, is central to how chemical and pharmaceutical processes work. Predicting solvation free energy—the energy change when a solute (like a drug molecule) dissolves in a solvent—is a “holy grail” task for computational chemistry. If we can predict this accurately using AI, we can screen millions of drug candidates without stepping into a wet lab. ...

[FlashTP: Fused, Sparsity-Aware Tensor Product for Machine Learning Interatomic Potentials 🔗](https://openreview.net/pdf?id=wiQe95BPaB)

Breaking the Bottleneck: How FlashTP Accelerates Equivariant Molecular Dynamics

Simulating the atomic world is one of the holy grails of computational science. From discovering new battery materials to designing novel drugs, Molecular Dynamics (MD) simulations allow us to peek into the movement of atoms over time. Historically, scientists have had to choose between two extremes: Quantum Mechanical methods (highly accurate but painfully slow) or classical force fields (fast but often inaccurate). In recent years, a third contender has emerged: Machine Learning Interatomic Potentials (MLIPs). Specifically, Equivariant MLIPs have revolutionized the field by achieving quantum-level accuracy with significantly better efficiency. They respect the physical laws of symmetry—if you rotate a molecule, the forces acting on it should rotate with it. ...

[Nonparametric Teaching for Graph Property Learners 🔗](https://arxiv.org/abs/2505.14170)

Faster GCN Training: How a 'Teacher' Can Speed Up Graph Learning by 40%

Introduction In the world of machine learning, data is often neat and tabular. But in the real world—especially in biology, chemistry, and social sciences—data is messy and interconnected. It comes in the form of graphs: molecules are atoms connected by bonds; social networks are people connected by friendships. To make sense of this data, we use Graph Convolutional Networks (GCNs). These powerful neural networks can predict properties like whether a molecule is soluble in water or which social group a user belongs to. We call this task Graph Property Learning. ...

[Learning to (Learn at Test Time): RNNs with Expressive Hidden States 🔗](https://arxiv.org/abs/2407.04620)

The Model That Learns While It Reads: Explaining Test-Time Training (TTT) Layers

In the world of Large Language Models (LLMs), there is a constant tug-of-war between two architectural paradigms: Transformers and Recurrent Neural Networks (RNNs). Transformers, powered by Self-Attention, are the reigning champions. They are brilliant at handling long context because they explicitly remember every token they have seen (stored in the Key-Value cache). However, this memory comes at a quadratic cost (\(O(T^2)\)). As the sequence gets longer, the computation required grows explosively. ...

[Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion 🔗](https://arxiv.org/abs/2502.00264)

Smoothing the Landscape: How Rotation Symmetry Unlocks Better Transformer Fusion

Deep learning has a fascinating, somewhat counter-intuitive property: if you train two identical neural network architectures on the same data, they will learn to perform the task equally well, yet their internal weights will look completely different. This phenomenon poses a significant challenge for Model Fusion—the practice of merging multiple trained models into a single, superior model without accessing the original training data. If you simply average the weights of two distinct models (a technique often used in Federated Learning or model ensembling), the result is usually a “broken” model with poor performance. Why? Because the models, while functionally similar, are not aligned in parameter space. ...

[Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance 🔗](https://arxiv.org/abs/2402.08680)

MARINE: A Training-Free Framework to Stop Vision-Language Models from Hallucinating

Introduction The rapid rise of Large Vision-Language Models (LVLMs) like LLaVA, mPLUG-Owl, and GPT-4V has revolutionized how machines understand the world. By aligning visual encoders with powerful Large Language Models (LLMs), these systems can look at an image and describe it, answer complex questions about it, or even reason through visual problems. However, despite their impressive capabilities, these models suffer from a critical and often embarrassing flaw: Object Hallucination. Object hallucination occurs when an LVLM confidently describes objects in an image that simply aren’t there. For a casual user, this might result in a funny caption. But in safety-critical domains—such as medical imaging analysis or autonomous navigation—a model “seeing” a tumor that doesn’t exist or a stop sign that isn’t present poses severe risks. ...

[Latent Diffusion Planning for Imitation Learning 🔗](https://arxiv.org/abs/2504.16925)

Escaping the Expert Data Bottleneck: A Guide to Latent Diffusion Planning

Introduction In the world of robotics, data is currency. Over the last few years, we have seen a massive surge in the capabilities of robot policies, largely driven by Imitation Learning (IL). The formula seems simple: collect a massive dataset of a human expert performing a task (like folding a towel or opening a door), and train a neural network to copy those movements. However, there is a catch. This “expert data” is incredibly expensive. It requires humans to teleoperate robots for hours, meticulously labeling every state with a precise action. Meanwhile, there exists a vast ocean of “cheap” data that we mostly ignore: videos of robots attempting tasks and failing (suboptimal data), or videos of humans or robots doing things without the specific motor commands recorded (action-free data). ...

[Neural Encoding and Decoding at Scale 🔗](https://arxiv.org/abs/2504.08201)

Bridging the Gap: How NEDS Creates a Unified Language for Brain and Behavior

Understanding the brain is fundamentally a translation problem. On one side, we have the biological reality: neurons firing electrical spikes in complex, rhythmic patterns. On the other side, we have the observable output: movement, choices, and behavior. For decades, computational neuroscience has treated this translation as two separate, distinct tasks. If you use behavior to predict neural activity, you are performing neural encoding. If you use neural activity to predict behavior, you are performing neural decoding. Historically, models were designed to do one or the other. You built an encoder to understand how the brain represents the world, or a decoder to control a robotic arm with neural signals. ...

[UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control 🔗](https://arxiv.org/abs/2502.05749)

Bridging the Gap: Unifying Diffusion Models with Stochastic Optimal Control

Introduction Diffusion models have fundamentally changed the landscape of generative AI. From DALL-E to Stable Diffusion, the ability to generate high-fidelity images from Gaussian noise is nothing short of magical. However, standard diffusion models have a specific limitation: they generally assume a transition from a standard Gaussian distribution (pure noise) to a data distribution (an image). But what if you don’t want to start from noise? What if you want to transition from one specific distribution to another? Consider image restoration: you want to move from a “Low-Quality” (LQ) distribution—blurry, rainy, or masked images—to a “High-Quality” (HQ) distribution. This requires a Diffusion Bridge. ...

[GMAIL: Generative Modality Alignment for generated Image Learning 🔗](https://openreview.net/pdf?id=u6xeKVHS6K)

Bridging the Reality Gap: How GMAIL Aligns Synthetic and Real Images for Better AI Training

Introduction We are currently living in the “Golden Age” of generative AI. Models like Stable Diffusion and DALL-E 3 can conjure photorealistic images from simple text descriptions in seconds. For machine learning researchers and students, this creates a tantalizing possibility: Infinite Training Data. Imagine you want to train a vision system to recognize rare objects or complex scenes. Instead of spending months collecting and labeling real-world photos, why not just generate millions of synthetic images? It sounds like the perfect solution to the data scarcity problem. ...

[Discovering a Zero (Zero-Vector Class of Machine Learning) 🔗](https://openreview.net/pdf?id=u3n5wuRGTa)

The Mathematics of Nothing: How Discovering the 'Zero-Vector' Class Could Revolutionize Neural Networks

Introduction In the way humans learn, there is a distinct difference between “knowing what a cat is” and “knowing what a cat is not.” When you visualize a cat, you are identifying a specific set of features—whiskers, pointed ears, a tail. You do not define a cat by looking at the entire universe and subtracting dogs, cars, and trees. However, traditional machine learning classification often works the latter way. A standard neural network classifier, when trained to distinguish between two classes (say, Class 1 and Class 2), typically slices the entire feature space into two regions. Every single point in the universe of data must belong to either Class 1 or Class 2. ...

[Efficient Source-free Unlearning via Energy-Guided Data Synthesis and Discrimination-Aware Multitask Optimization 🔗](https://openreview.net/pdf?id=tqL8gJsuS5)

Unlearning Without Source Data: A Deep Dive into the DSDA Framework

In the modern era of Artificial Intelligence, data is the new oil. But unlike oil, data often comes with strings attached: privacy regulations. With the enforcement of laws like the European Union’s GDPR and the California Consumer Privacy Act (CCPA), individuals have gained the “right to be forgotten.” This means that if a user requests their data be deleted, any machine learning model trained on that data must theoretically “unlearn” it. ...

[Visual and Domain Knowledge for Professional-level Graph-of-Thought Medical Reasoning 🔗](https://openreview.net/pdf?id=tnyxtaSve5)

Beyond Pattern Recognition: Teaching AI to Think Like a Doctor with Clinical Graph of Thoughts

Introduction In the last few years, we have witnessed a massive leap in the capabilities of Large Vision-Language Models (LVLMs). Models like GPT-4o and Gemini can describe a photo of a busy street, write code from a whiteboard sketch, or explain a meme. However, when we shift our gaze from general internet images to the high-stakes world of medicine, these “foundation models” often hit a wall. Why? Because medical diagnosis isn’t just about identifying objects. It is about reasoning. ...

[Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems 🔗](https://openreview.net/pdf?id=teUg2pMrF0)

How LLMs are Revolutionizing Large-Scale Optimization: Introducing LLM-LNS

In the world of computer science and operations research, scaling is the ultimate boss fight. Solving a logistics problem with ten delivery trucks is a homework assignment; solving it with ten thousand trucks, changing traffic conditions, and time windows is a computational nightmare. These massive challenges are often framed as Mixed Integer Linear Programming (MILP) problems. While we have powerful commercial solvers like Gurobi and SCIP, they often hit a wall as problem sizes explode. The search space grows exponentially, making exact solutions impossible to find in a reasonable timeframe. ...