Papers

[Better to Teach than to Give: Domain Generalized Semantic Segmentation via Agent Queries with Diffusion Model Guidance 🔗](https://openreview.net/pdf?id=jvP1wbD0xh)

QueryDiff: Teaching Segmentation Models to Generalize with Diffusion Guidance

Introduction: Teaching vs. Giving In the world of deep learning, there is an old proverb that fits surprisingly well: “Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.” In the context of computer vision, specifically Domain Generalized Semantic Segmentation (DGSS), “giving a fish” is analogous to data augmentation or generating synthetic data. If you want your self-driving car model (trained on a sunny simulator) to recognize a rainy street, the standard approach is to generate thousands of rainy images and feed them to the model. While this works to an extent, it is computationally expensive and limited by the diversity of the data you can generate. ...

[Procurement Auctions via Approximately Optimal Submodular Optimization 🔗](https://arxiv.org/abs/2411.13513)

Designing Truthful Auctions for Submodular Procurement

In the world of algorithmic game theory and large-scale logistics, a fundamental problem persists: how do you buy services from multiple people who might lie about their prices, while ensuring you get the best possible “bang for your buck”? Imagine a government agency trying to procure medical supplies from various vendors, or a crowdsourcing platform hiring freelancers to label data. The buyer (auctioneer) wants to select a set of sellers to maximize the total value of the service minus the cost paid. This is known as a procurement auction. ...

[scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data 🔗](https://arxiv.org/abs/2506.10031)

Cracking the Cellular Code: A Deep Dive into Self-Supervised Learning for Single-Cell Genomics

Cracking the Cellular Code: A Deep Dive into Self-Supervised Learning for Single-Cell Genomics Imagine trying to understand a complex city by looking at a satellite photo of the whole metropolitan area. You see the general layout, the highways, and the density, but you miss the individual people who make the city function. For a long time, this was the state of genomics. “Bulk sequencing” gave us an average view of millions of cells mashed together—a biological smoothie. ...

[On Learning Parallel Pancakes with Mostly Uniform Weights 🔗](https://arxiv.org/abs/2504.15251)

Unstacking the Parallel Pancakes: The Complexity of Learning Mostly Uniform Gaussian Mixtures

Introduction In the world of high-dimensional statistics and machine learning, few problems are as classic or as stubborn as learning Gaussian Mixture Models (GMMs). We use them everywhere—from astrophysics to marketing—to model populations made up of different sub-groups. The theoretical landscape of GMMs is a tale of two extremes. On one hand, if the components are spherical (perfectly round blobs), we can learn them efficiently. On the other hand, if the components are arbitrary “pancakes” (highly flattened Gaussians) that are stacked parallel to each other, the problem becomes exponentially hard. In the worst case, learning a mixture of \(k\) Gaussians in \(d\) dimensions requires time \(d^{O(k)}\). ...

[Geometric Hyena Network for Large-scale Equivariant Learning 🔗](https://arxiv.org/abs/2505.22560)

Beyond Self-Attention: Scaling Geometric Deep Learning with Geometric Hyena

In the world of deep learning for science, structure is everything. Whether it’s the folding of a protein, the twisting of an RNA strand, or the dynamics of a particle system, the geometric arrangement of atoms dictates function. To model these systems effectively, neural networks must understand two things: global context (how distant parts of a molecule interact) and equivariance (the laws of physics shouldn’t change just because you rotated the molecule). ...

[Elucidating the Design Space of Multimodal Protein Language Models 🔗](https://arxiv.org/abs/2504.11454)

Building Better Protein Models: How to Fix the 'Structure Gap' in Multimodal AI

Building Better Protein Models: How to Fix the “Structure Gap” in Multimodal AI Proteins are the molecular machinery of life. To understand biology—and to design new drugs—we need to understand two different “languages” of proteins: their sequence (the string of amino acids) and their structure (how they fold into 3D shapes). Historically, AI has treated these as separate problems. You had models like ESM for reading sequences and models like AlphaFold for predicting structures. But recently, researchers have been trying to merge these into Multimodal Protein Language Models (PLMs). Ideally, a single model should be able to read a sequence, understand its geometry, and generate new proteins that are both chemically valid and structurally sound. ...

[BAXBENCH: Can LLMs Generate Correct and Secure Backends? 🔗](https://openreview.net/pdf?id=il3KRr4H9u)

Why AI Can't Build Your Backend Yet: A Deep Dive into BAXBENCH

The software development world is currently in the grip of a revolution driven by Large Language Models (LLMs). Tools like GitHub Copilot and ChatGPT have demonstrated an uncanny ability to auto-complete functions, write unit tests, and even solve complex algorithmic puzzles. It is tempting to believe that we are on the verge of fully autonomous software engineering, where an AI can take a high-level requirement and produce a deployment-ready application. ...

[Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios 🔗](https://arxiv.org/abs/2505.21387)

Cleaning the Noise: How AIRMVC Revolutionizes Multi-View Clustering

Introduction In the era of big data, we rarely rely on a single source of information to understand the world. Consider an autonomous vehicle: it doesn’t just look through a camera; it listens to sonar, measures distance with LiDAR, and checks GPS coordinates. This aggregation of diverse data sources is the foundation of Multi-View Clustering (MVC). By fusing information from different “views” (e.g., audio, video, text), machine learning models can achieve a level of understanding that a single view simply cannot match. ...

[Primal-Dual Neural Algorithmic Reasoning 🔗](https://arxiv.org/abs/2505.24067)

Can Neural Networks Solve NP-Hard Problems? The Primal-Dual Approach

The intersection of classical algorithms and deep learning is one of the most fascinating frontiers in computer science. On one hand, we have classical algorithms—rigorous, interpretable, and guaranteed to work, but often rigid and unable to digest raw, messy real-world data. On the other hand, we have neural networks—flexible, adaptable, and capable of handling complex inputs, but often opaque and prone to hallucinating incorrect answers. Neural Algorithmic Reasoning (NAR) attempts to fuse these two worlds. The goal is to train neural networks to “reason” like algorithms. By teaching a network to mimic the steps of a classical algorithm (like Breadth-First Search or Bellman-Ford), we hope to create systems that generalize better than standard deep learning models. ...

[Do Multiple Instance Learning Models Transfer? 🔗](https://openreview.net/pdf?id=hfLqdquVt3)

Why You Should Stop Training MIL Models from Scratch - The Power of Transfer Learning in Pathology

In the world of deep learning, particularly in computer vision and natural language processing (NLP), starting from scratch is almost a cardinal sin. You wouldn’t train a language model on a blank slate when you can fine-tune BERT or GPT; you wouldn’t train an image classifier on pixels when you can use weights from ImageNet. This concept, known as transfer learning, is the engine driving modern AI. However, in Computational Pathology (CPath)—the field dedciated to analyzing digitized tissue slides for cancer diagnosis—this standard practice hasn’t fully taken hold. When researchers build Multiple Instance Learning (MIL) models to analyze gigapixel whole slide images (WSIs), they almost exclusively initialize the aggregation networks with random weights. ...

[On the Benefits of Active Data Collection in Operator Learning 🔗](https://arxiv.org/abs/2410.19725)

Why Random Sampling Isn't Enough: The Power of Active Learning in Solving PDEs

Why Random Sampling Isn’t Enough: The Power of Active Learning in Solving PDEs If you have ever dabbled in scientific computing or machine learning for physics, you know the drill. You have a Partial Differential Equation (PDE), like the Heat equation or the Navier-Stokes equation, that describes a physical system. Traditionally, solving these requires heavy numerical solvers that chew up computational resources. Enter Operator Learning. The goal here is to train a machine learning model to approximate the “solution operator” of the PDE. Instead of solving the equation from scratch every time, you feed the initial conditions or source terms into a neural network (or another estimator), and it spits out the solution almost instantly. ...

[MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models 🔗](https://arxiv.org/abs/2501.00316)

Lost in Translation: Why Foundation Models Struggle with Real-World Maps

Imagine you are in a foreign city. You open a map app, looking for a coffee shop that is open right now, within a 10-minute walk, and has a rating above 4.0. You also need to spot it on a map filled with icons and street names. For a human, this is a standard navigational task involving visual scanning, spatial reasoning, and reading comprehension. Now, imagine asking an AI to do the same. While Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated incredible prowess in coding, creative writing, and general reasoning, their ability to navigate the physical world—represented through maps—remains a significant blind spot. ...

[The Jailbreak Tax: How Useful are Your Jailbreak Outputs? 🔗](https://openreview.net/pdf?id=hRQyqtcjVv)

The Jailbreak Tax: Why Breaking AI Safety Rails Might Break the AI Too

The world of Large Language Model (LLM) security is often framed as a high-stakes game of cat and mouse. On one side, developers build guardrails to align models, preventing them from generating harmful content like bomb-making instructions or hate speech. On the other side, “red teamers” and adversaries develop “jailbreaks”—clever prompts designed to bypass these defenses. Until now, the primary metric for a successful jailbreak has been binary: Did the model refuse, or did it answer? ...

[ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation 🔗](https://arxiv.org/abs/2502.13581)

Beyond IDs: How ActionPiece Brings Context to Generative Recommendation

Introduction In the world of Recommender Systems, we are currently witnessing a paradigm shift. The field is moving away from traditional classification-based methods—which select the best item from a massive, fixed pool—toward Generative Recommendation (GR). Inspired by the success of Large Language Models (LLMs) like GPT, GR models treat user behavior as a language. They “tokenize” user actions and train models to autoregressively predict the next token in a sequence. ...

[Functional Alignment Can Mislead: Examining Model Stitching 🔗](https://openreview.net/pdf?id=glLqTK9En3)

Just Because It Fits: Why Neural Network Alignment Doesn't Mean What You Think It Means

Introduction One of the great mysteries of modern artificial intelligence is the “black box” problem. We know that deep neural networks work—often surprisingly well—but we don’t always know how they represent the data they are processing. Does a model classify a bird because it sees wings, or because it hears a song, or because it detects a specific texture in the background? To answer these questions, researchers have developed various tools to compare the internal representations of different models. A popular and intuitive method is called Model Stitching. The logic goes like this: if you can take the first half of Model A, stitch it to the second half of Model B, and the frankenstein-monster combination still works, then Model A and Model B must be “thinking” in similar ways. They must be functionally aligned. ...

[Stronger Neyman Regret Guarantees for Adaptive Experimental Design 🔗](https://arxiv.org/abs/2502.17427)

Beyond A/B Testing - How Adaptive Algorithms Are Revolutionizing Experimental Design

Imagine you are running a clinical trial for a new drug, or perhaps testing a new feature on a high-traffic e-commerce website. In the traditional “A/B testing” world, you might flip a coin: 50% of people get the treatment, 50% get the control. You run this for a month, collect data, and analyze the results. But what if, halfway through the month, the data starts hinting that the treatment group has much higher variance than the control group? Or what if certain subgroups of people react differently? A fixed 50/50 split is rarely the most efficient way to estimate the truth. It wastes samples and time. ...

[Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry 🔗](https://arxiv.org/abs/2503.18114)

Seeing How They Think: Unlocking Neural Network Dynamics with Manifold Geometry

Seeing How They Think: Unlocking Neural Network Dynamics with Manifold Geometry How does a neural network actually learn? If you look at the raw numbers—the billions of synaptic weights—you see a chaotic storm of floating-point adjustments. If you look at the loss curve, you see a line going down. But neither of these tells you how the network is structuring information. For a long time, researchers have categorized deep learning into two distinct regimes: the Lazy regime and the Rich regime. In the lazy regime, the network barely touches its internal features, acting like a glorified kernel machine. In the rich regime, the network actively sculpts complex, task-specific features. ...

[Learning with Exact Invariances in Polynomial Time 🔗](https://arxiv.org/abs/2502.19758)

Breaking the Symmetry Barrier - How to Learn Exact Invariances in Polynomial Time

In the natural sciences and physics, symmetry is everything. Whether you are analyzing the energy of a molecule, the dynamics of a fluid, or the structure of a crystal, the fundamental laws of nature often remain unchanged under certain transformations—like rotation, reflection, or translation. In machine learning, we call these properties invariances. Ideally, if we train a model to predict the energy of a molecule, rotating that molecule in 3D space should not change the model’s prediction. However, teaching machines to respect these symmetries exactly has historically been a massive computational headache. ...

[Masked Autoencoders Are Effective Tokenizers for Diffusion Models 🔗](https://arxiv.org/abs/2502.03444)

Beyond VAEs: How Masking Makes Autoencoders Effective Tokenizers for Diffusion

If you have been following the explosion of generative AI, you are likely familiar with Latent Diffusion Models (LDMs), the architecture behind heavyweights like Stable Diffusion. The secret sauce of LDMs is efficiency: instead of generating an image pixel-by-pixel, they operate in a compressed “latent space.” This compression is handled by a tokenizer (usually an autoencoder). For years, the standard advice has been to use Variational Autoencoders (VAEs). VAEs enforce a smooth, Gaussian distribution on the latent space, which theoretically makes it easier for the diffusion model to learn. But there is a trade-off: that smoothness constraint often results in blurry reconstructions and limits the fidelity of the final image. ...

[Discovering Symbolic Cognitive Models from Human and Animal Behavior 🔗](https://openreview.net/pdf?id=dhRXGWJ027)

Can LLMs Act as Cognitive Scientists? Discovering How Brains Learn with CogFunSearch

In the world of neuroscience and psychology, there is a constant tension between prediction and understanding. If you want to simply predict what a human or animal will do next, you might train a massive Recurrent Neural Network (RNN) on their behavioral data. The RNN will likely achieve high accuracy, but it acts as a “black box.” It gives you the answer, but it doesn’t tell you how the brain solved the problem. It doesn’t offer a scientific theory. ...