[An Online Adaptive Sampling Algorithm for Stochastic Difference-of-convex Optimization with Time-varying Distributions 🔗](https://openreview.net/pdf?id=QmIzUuspWo)

Taming the Chaos: Adaptive Sampling for Optimization Under Distribution Shift

In the world of machine learning and operations research, textbook problems often assume that the data comes from a single, static distribution. You train your model, the data behaves politely, and you find an optimal solution. But the real world is rarely so cooperative. Financial markets fluctuate, user preferences drift, and sensor networks experience environmental changes. In these scenarios, the underlying probability distribution generating your data changes over time. This is the domain of time-varying distributions. ...

8 min · 1697 words
[A Generalization Result for Convergence in Learning-to-optimize 🔗](https://arxiv.org/abs/2410.07704)

Trusting the Black Box: Proving Convergence for Learned Optimizers

Imagine you have a race car. You can tune the engine yourself (manual optimization), or you can train an AI to tune it for you (Learning-to-Optimize). The AI version is often significantly faster, zooming past the finish line while you’re still tweaking the carburetor. But there is a catch: Can you trust the AI? In classical optimization (like Gradient Descent or Adam), we have mathematical proofs guaranteeing that, eventually, the car will stop at the finish line (a critical point). In Learning-to-Optimize (L2O), the algorithm is often a neural network—a “black box.” Historically, proving that this black box will actually converge has been a nightmare. To get a guarantee, researchers often had to “safeguard” the AI, essentially forcing it to behave like a slow, classical algorithm, which defeats the purpose of using AI in the first place. ...

2024-10 · 8 min · 1514 words
[Conformal Prediction as Bayesian Quadrature 🔗](https://arxiv.org/abs/2502.13228)

Bridging the Gap—How Bayesian Quadrature Improves Conformal Prediction

Machine learning models are increasingly deployed in high-stakes environments—from diagnosing diseases to steering autonomous vehicles. In these settings, “accuracy” isn’t enough; we need safety. We need to know that the model will not make catastrophic errors. To address this, the field has rallied around Conformal Prediction, a powerful framework that wraps around “black-box” models to provide statistical guarantees. For example, instead of just predicting “Cat,” a conformal predictor outputs a set {"Cat", "Dog"} and guarantees that the true label is in that set 95% of the time. ...

2025-02 · 9 min · 1877 words
[Auditing f-Differential Privacy in One Run 🔗](https://arxiv.org/abs/2410.22235)

Closing the Gap in Privacy Auditing with f-Differential Privacy

In the rapidly evolving landscape of machine learning, Differential Privacy (DP) has become the gold standard for training models on sensitive data. Theoretically, DP guarantees that the contribution of any single individual to a dataset does not significantly affect the model’s output. However, a significant gap often exists between theory and practice. Implementation bugs, floating-point errors, or loose theoretical analysis can lead to models that are less private than claimed. ...

2024-10 · 8 min · 1650 words
[ADASPLASH: Adaptive Sparse Flash Attention 🔗](https://arxiv.org/abs/2502.12082)

Can We Make Attention Sparse *and* Fast? A Deep Dive into ADASPLASH

Introduction: The Paradox of Long-Context Attention The Transformer architecture has revolutionized natural language processing, but it harbors a well-known secret: it is notoriously inefficient at scale. The culprit is the self-attention mechanism. In its standard form, every token in a sequence attends to every other token. If you double the length of your input document, the computational cost doesn’t just double—it quadruples. This is the infamous \(O(n^2)\) complexity. For years, researchers have known that this dense attention is often wasteful. When you read a book, you don’t focus on every single word on every previous page simultaneously to understand the current sentence. You focus on a few key context clues. In machine learning terms, attention probability distributions are often sparse—peaking around a few relevant tokens while the rest are near-zero noise. ...

2025-02 · 9 min · 1748 words
[The dark side of the forces: assessing non-conservative force models for atomistic machine learning 🔗](https://arxiv.org/abs/2412.11569)

The Dark Side of the Forces: Why Energy Conservation Matters in AI Chemistry

Introduction In the computational chemistry revolution, Machine Learning (ML) has become the new lightsaber. It cuts through the heavy computational cost of density functional theory (DFT) and quantum mechanics, allowing researchers to simulate larger systems for longer times than ever before. The premise is simple: train a neural network to predict how atoms interact, and you can model everything from drug discovery to battery materials at lightning speeds. However, as we push for faster and more scalable models, a fundamental debate has emerged regarding the “laws of physics” we impose on these networks. Traditionally, interatomic forces are calculated as the derivative of potential energy—a method that guarantees energy conservation. But a new wave of “non-conservative” models suggests we can skip the energy calculation and predict forces directly, trading physical rigor for computational speed. ...

2024-12 · 6 min · 1268 words
[MGD3: Mode-Guided Dataset Distillation using Diffusion Models 🔗](https://openreview.net/pdf?id=NIe74CY9lk)

Scaling Down to Scale Up: How MGD³ Distills Datasets Without Fine-Tuning

Scaling Down to Scale Up: How MGD³ Distills Datasets Without Fine-Tuning In the modern era of deep learning, the mantra has largely been “bigger is better.” We build massive models and feed them even more massive datasets. However, this trajectory hits a wall when it comes to computational resources and storage. Not every researcher or student has access to a cluster of H100 GPUs. This bottleneck has given rise to a fascinating field of study: Dataset Distillation. ...

9 min · 1815 words
[Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning 🔗](https://arxiv.org/abs/2502.07560)

Stopping the Drift: How to Fix Catastrophic Forgetting in Continual Learning

Imagine you are learning to play the piano. You spend months mastering classical music. Then, you decide to learn jazz. As you immerse yourself in jazz chords and improvisation, you suddenly realize you’re struggling to remember the classical pieces you once played perfectly. In the world of Artificial Intelligence, this phenomenon is known as Catastrophic Forgetting. When a neural network learns a new task, it tends to overwrite the parameters it optimized for previous tasks. ...

2025-02 · 8 min · 1554 words
[A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization 🔗](https://arxiv.org/abs/2501.18756)

Unifying Bayesian Optimization: How Expected Improvement is Actually Entropy Search in Disguise

Introduction In the world of machine learning, we are often tasked with optimizing “black-box” functions—functions that are expensive to evaluate, have no known gradients, and are essentially mysterious boxes where you put in an input \(x\) and get out a noisy output \(y\). This is the domain of Bayesian Optimization (BO). If you have studied BO, you know there is a bit of a divide in the community regarding Acquisition Functions (AFs)—the mathematical rules that decide where to sample next. ...

2025-01 · 9 min · 1846 words
[Sundial: A Family of Highly Capable Time Series Foundation Models 🔗](https://arxiv.org/abs/2502.00816)

Reading the Sundial: How Generative Flow Matching is Revolutionizing Time Series Forecasting

Time series forecasting is one of the oldest mathematical problems humans have tried to solve. From ancient civilizations predicting crop cycles to modern algorithms trading stocks in microseconds, the goal remains the same: use the past to predict the future. However, time series data is intrinsically non-deterministic. No matter how much historical data you have, the future is never a single, fixed point—it is a distribution of possibilities. In recent years, the success of Large Language Models (LLMs) has prompted researchers to treat time series forecasting as a language problem. If we can predict the next word in a sentence, can’t we predict the next value in a sequence? While this approach has yielded results, it fundamentally forces continuous data (like temperature or stock prices) into discrete “tokens” (like words in a dictionary). This conversion often results in a loss of precision and context. ...

2025-02 · 9 min · 1775 words
[Expected Variational Inequalities 🔗](https://arxiv.org/abs/2502.18605)

Escaping the Intractability Trap: How Expected Variational Inequalities Revolutionize Equilibrium Computation

In the worlds of computer science, economics, and engineering, we are often obsessed with finding a state of balance. Whether it’s predicting traffic flow in a congested city, pricing options in finance, or finding a Nash equilibrium in a complex multiplayer game, the mathematical tool of choice is often the Variational Inequality (VI). VIs are incredibly expressive. They provide a unified framework to model almost any system where agents compete for resources or optimize objectives under constraints. But there is a catch: they are notoriously difficult to solve. In computational complexity terms, finding a solution to a general VI is PPAD-hard. This means that for many real-world problems, efficient algorithms simply do not exist—or at least, we haven’t found them yet. ...

2025-02 · 8 min · 1636 words
[Learning dynamics in linear recurrent neural networks 🔗](https://openreview.net/pdf?id=KGOcrIWYnx)

Unlocking Time - How Linear RNNs Actually Learn Temporal Tasks

Recurrent Neural Networks (RNNs) are the workhorses of temporal computing. From the resurgence of state-space models like Mamba in modern machine learning to modeling cognitive dynamics in neuroscience, RNNs are everywhere. We know that they work—they can capture dependencies over time, integrate information, and model dynamic systems. But there is a glaring gap in our understanding: we don’t really know how they learn. Most theoretical analysis of RNNs looks at the model after training is complete. This is akin to trying to understand how a skyscraper was built by only looking at the finished building. To truly understand the emergence of intelligence—artificial or biological—we need to look at the construction process itself: the learning dynamics. ...

9 min · 1854 words
[Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks 🔗](https://arxiv.org/abs/2507.02119)

The Hidden Universality of Neural Training and the Mystery of Supercollapse

If you have ever trained a large neural network, you know the process can feel a bit like alchemy. You mix a dataset, an architecture, and an optimizer, staring at the loss curve as it (hopefully) goes down. We have developed “Scaling Laws”—empirical power laws that predict the final performance of a model based on its size and compute budget. But the path the model takes to get there—the training dynamics—has largely remained a messy, unpredictable black box. ...

2025-07 · 8 min · 1500 words
[AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses 🔗](https://arxiv.org/abs/2503.01811)

The Reality Gap: Can LLMs Actually Break Real-World AI Defenses?

We are living in the era of the “AI Agent.” We have moved past simple chatbots that write poems; we now evaluate Large Language Models (LLMs) on their ability to reason, plan, and interact with software environments. Benchmarks like SWE-Bench test if an AI can fix GitHub issues, while others test if they can browse the web or solve capture-the-flag (CTF) security challenges. But there is a lingering question in the research community: Do these benchmarks reflect reality? ...

2025-03 · 8 min · 1565 words
[In-context denoising with one-layer transformers: connections between attention and associative memory retrieval 🔗](https://arxiv.org/abs/2502.05164)

Transformers as Bayesian Denoisers: How Attention Mimics Associative Memory

The Transformer architecture has undeniably revolutionized deep learning. From LLMs like GPT-4 to vision models, the “Attention is All You Need” paradigm is ubiquitous. Yet, despite their massive success, we are still playing catch-up in understanding why they work so well. How does a mechanism designed for machine translation become a general-purpose in-context learner? One of the most compelling theories gaining traction links Transformers to associative memory—specifically, Dense Associative Memory (DAM) or modern Hopfield Networks. The idea is that the attention mechanism isn’t just “attending” to parts of a sequence; it is performing an energy minimization step to retrieve memories. ...

2025-02 · 10 min · 2042 words
[Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise 🔗](https://arxiv.org/abs/2502.11413)

Why Multiclass Classification with Noisy Labels is Surprisingly Hard

In the world of machine learning theory, there is often a stark difference between what works for two classes and what works for three or more. We see this in various domains, but a recent paper titled “Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise” highlights a particularly dramatic gap in the complexity of learning from noisy data. The problem of Multiclass Linear Classification (MLC) is a textbook staple: given data points in a high-dimensional space, can we find linear boundaries that separate them into \(k\) distinct classes? When the labels are clean (perfectly accurate), this is solvable efficiently. When we have Random Classification Noise (RCN)—where labels are randomly flipped based on a noise matrix—the story gets complicated. ...

2025-02 · 10 min · 2024 words
[SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs 🔗](https://arxiv.org/abs/2406.19593)

Scaling Up Multimodal RAG: How Synthetic Data is Solving the Knowledge Gap in Vision-Language Models

Introduction Imagine showing an AI a photo of a rare, specific species of bird perched on a branch and asking, “What is the migration pattern of this bird?” A standard Multimodal Large Language Model (MLLM) like GPT-4V or LLaVA might recognize the bird correctly. However, if the specific migration details weren’t prevalent in its pre-training data, the model might “hallucinate”—confidently inventing a migration route that doesn’t exist. This is a persistent reliability issue in AI: models are great at looking, but they don’t always know everything about what they see. ...

2024-06 · 8 min · 1682 words
[Hierarchical Refinement: Optimal Transport to Infinity and Beyond 🔗](https://arxiv.org/abs/2503.03025)

Scaling the Unscalable: How Hierarchical Refinement Solves Optimal Transport for Millions of Points

Introduction In the world of machine learning, alignment is everything. Whether you are training a generative model to map noise to images, aligning single-cell genomic data across different time points, or translating between distinct domains, you are essentially asking the same question: What is the best way to move mass from distribution A to distribution B? This is the core question of Optimal Transport (OT). For decades, OT has been the gold standard for comparing and aligning probability distributions because it respects the underlying geometry of the data. It seeks the “least effort” path to transport one dataset onto another. ...

2025-03 · 11 min · 2257 words
[Generative Social Choice: The Next Generation 🔗](https://arxiv.org/abs/2505.22939)

Democracy by AI? How to Scale Social Choice with Large Language Models

Introduction In traditional democratic processes, the menu of options is usually fixed. You vote for Candidate A or Candidate B; you choose between Policy X or Policy Y. But what happens when the goal isn’t just to choose from a pre-defined list, but to synthesize the complex, unstructured opinions of thousands of people into a coherent set of representative statements? This is the challenge of Generative Social Choice. Imagine a town hall meeting with 10,000 residents. It is impossible to let everyone speak, and it is equally difficult to have a human moderator manually summarize every distinct viewpoint without bias. Recently, researchers have turned to Large Language Models (LLMs) to solve this. Systems like Polis have already been used in Taiwan and by the United Nations to cluster opinions. However, moving from “clustering” to a mathematically rigorous selection of representative statements is a hard problem. ...

2025-05 · 10 min · 1992 words
[COLLABLLM: From Passive Responders to Active Collaborators 🔗](https://arxiv.org/abs/2502.00640)

Stop Being Passive: How COLLABLLM Teaches AI to Actually Collaborate

Introduction We have all been there. You ask a Large Language Model (LLM) a vague question, and it immediately spit outs a generic, confident answer. It doesn’t ask for clarification. It doesn’t check if it understands your underlying goal. It just… responds. You then spend the next ten minutes prompting back and forth, correcting its assumptions, until you finally get what you wanted. This happens because modern LLMs are typically “passive responders.” They are trained to maximize the likelihood of the very next response, satisfying the immediate query without considering the long-term trajectory of the conversation. ...

2025-02 · 9 min · 1781 words