Papers

[Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions 🔗](https://arxiv.org/abs/2502.06768)

The Hard Road to Smarter Models: Why Masked Diffusion Beats Autoregression on Logic Puzzles

If you have used ChatGPT or any modern Large Language Model (LLM), you have interacted with an Autoregressive Model (ARM). These models generate text in a very specific way: token by token, from left to right. They are incredibly successful, but they are also rigid. They must decide what comes next based entirely on what came before. But what if the “next” token isn’t the easiest one to predict? What if the end of the sentence is easier to guess than the middle? ...

[EMBODIEDBENCH: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents 🔗](https://arxiv.org/abs/2502.09560)

EmbodiedBench: Can Multimodal LLMs Actually Control Robots?

Introduction We are currently witnessing a golden age of Multi-modal Large Language Models (MLLMs). Models like GPT-4o, Gemini, and Claude can analyze complex images, write poetry, and code entire applications. Naturally, the next frontier is embodied AI—taking these “brains” and putting them inside a robot (or a simulation of one) to navigate the physical world and manipulate objects. The dream is a generalist robot that can understand a command like “clean up the kitchen,” see a mess, and figure out the thousands of tiny muscle movements required to tidy up. However, there is a significant gap between chatting about a task and physically doing it. ...

[Theoretical Limitations of Ensembles in the Age of Overparameterization 🔗](https://arxiv.org/abs/2410.16201)

The Ensemble Illusion: Why Deep Ensembles Might Just Be Large Models in Disguise

The Ensemble Illusion: Why Deep Ensembles Might Just Be Large Models in Disguise In the classical era of machine learning, “ensembling” was the closest thing to a free lunch. If you trained a single decision tree, it might overfit. But if you trained a hundred trees and averaged their predictions (a Random Forest), you got a robust, highly accurate model. The intuition was simple: different models make different mistakes, so averaging them cancels out the noise. ...

[Near-Optimal Decision Trees in a SPLIT Second 🔗](https://arxiv.org/abs/2502.15988)

The Best of Both Worlds: How SPLIT Achieves Optimal Decision Trees at Greedy Speeds

Introduction In the world of machine learning, there is often a painful trade-off between interpretability (understanding why a model made a prediction) and performance (how accurate that prediction is). Decision trees are the poster child for interpretability. They mimic human reasoning: “If X is true, check Y; if Y is false, predict Z.” However, building the perfect decision tree—one that is both highly accurate and sparse (few nodes)—is incredibly difficult. This brings us to a second trade-off: optimality vs. scalability. ...

[Neural Discovery in Mathematics: Do Machines Dream of Colored Planes? 🔗](https://openreview.net/pdf?id=7Tp9zjP9At)

When Neural Networks Paint the Infinite: Solving Combinatorial Geometry Problems with AI

Mathematics is often viewed as a discipline of rigid logic and absolute proofs. A statement is either true or false; a theorem is proven or unproven. However, the process of reaching a proof is frequently messy, relying on intuition, visualization, and trial-and-error. In recent years, a fascinating question has emerged: Can Artificial Intelligence act as an intuition engine for pure mathematics? Can it “dream up” geometric constructions that human mathematicians have overlooked? ...

[Polynomial-Delay MAG Listing with Novel Locally Complete Orientation Rules 🔗](https://openreview.net/pdf?id=70voOlSPos)

Unlocking Causal Secrets - How to Efficiently List Hidden Variable Graphs

Unlocking Causal Secrets: How to Efficiently List Hidden Variable Graphs In the world of Artificial Intelligence and Causal Inference, we often deal with incomplete pictures. We observe data—like symptoms in a patient or economic indicators in a country—but we rarely see the full machinery driving these observations. There are almost always “latent variables,” hidden factors that influence what we see but remain unrecorded. When we try to map out causal relationships in the presence of these hidden factors, we don’t get a single, clear map. Instead, we get a “class” of possible maps. Navigating this class to find every specific, valid causal explanation is a massive computational headache. ...

[rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking 🔗](https://arxiv.org/abs/2501.04519)

How Small AI Models Are Beating GPT-4 at Math: The rStar-Math Revolution

Introduction For a long time, the prevailing wisdom in Artificial Intelligence was simple: bigger is better. If you wanted a model to solve complex calculus or high-school Olympiad math problems, you needed hundreds of billions of parameters, massive computational resources, and a model like GPT-4 or Claude 3.5. Small Language Models (SLMs), typically in the 1 billion to 8 billion parameter range, were considered efficient assistants for basic tasks but incapable of deep, multi-step reasoning. ...

[Omnibench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities 🔗](https://openreview.net/pdf?id=4tFSKOY2mT)

Beyond Linear Tasks: How OmniBench Reveals the True Limits of Virtual Agents

Introduction We are currently witnessing a golden age of Multimodal Large Language Models (MLLMs). From GPT-4o to Claude 3.5, these models are no longer just text processors; they are evolving into “virtual agents” capable of seeing screens, clicking buttons, and navigating the web. The dream is to have a digital assistant that can handle complex workflows—like “Download the sales report from email, visualize the data in Excel, and Slack the chart to the manager.” ...

[Prices, Bids, Values: One ML-Powered Combinatorial Auction to Rule Them All 🔗](https://arxiv.org/abs/2411.09355)

The Best of Both Worlds: How Hybrid ML Auctions Solve the Efficiency-Cognitive Load Trade-off

Allocating resources efficiently is one of the fundamental problems in economics. When the resources are simple—like shares of a company—standard markets work well. But what happens when the items are distinct, but their values are interconnected? Consider a government auctioning spectrum licenses to telecom companies. A license for New York City is valuable, and a license for Philadelphia is valuable. But to a telecom provider, having both might be worth significantly more than the sum of the parts because it allows them to build a continuous network. Conversely, two different frequencies in the same city might be substitutes. ...

[SpeechSSM: Long-Form Speech Generation with State-Space Models 🔗](https://arxiv.org/abs/2412.18603)

Breaking the Silence: How SpeechSSM Masters Long-Form Audio Generation

Introduction Imagine asking an AI to tell you a bedtime story—not by reading text you provide, but by hallucinating a brand-new narrative, in a human voice, complete with pauses, sighs, and intonation. Now, imagine asking it to keep going for twenty minutes. For years, this has been the “final boss” of Generative Spoken Language Modeling. While models like AudioLM or GSLM can generate impressive snippets of speech lasting 10 or 20 seconds, they inevitably fall apart over longer durations. They begin to ramble incoherently, get stuck in repetitive loops, or simply dissolve into static. The computational cost of remembering the beginning of a conversation while generating the middle becomes astronomically high. ...

[Statistical Collusion by Collectives on Learning Platforms 🔗](https://arxiv.org/abs/2502.04879)

Strength in Numbers: How Collectives Can Statistically Guarantee Influence on AI Platforms

Strength in Numbers: How Collectives Can Statistically Guarantee Influence on AI Platforms In the modern digital ecosystem, the relationship between users and platforms is often viewed as a one-way street: the platform extracts data, trains algorithms, and dictates outcomes. But what happens when users band together? Imagine a scenario where a group of gig economy workers wants to influence an algorithm to improve their wages, or a consumer advocacy group wants to prevent a recommendation system from promoting a specific, harmful product. This is the domain of Algorithmic Collective Action. ...

[Not All Explanations for Deep Learning Phenomena Are Equally Valuable 🔗](https://openreview.net/pdf?id=cw7MYyDL33)

Stop Solving Puzzles—Why Deep Learning Needs Pragmatism Over Ad Hoc Hypotheses

In the fast-paced world of artificial intelligence, researchers love a good mystery. Over the last few years, the community has been captivated by strange behaviors in neural networks that seem to defy the fundamental laws of statistics and learning theory. Models that get smarter after they massively overfit? Test error that goes down, then up, then down again? These phenomena—known as grokking, double descent, and the lottery ticket hypothesis—have spawned thousands of papers attempting to explain them. ...

[Algorithm Development in Neural Networks: Insights from the Streaming Parity Task 🔗](https://arxiv.org/abs/2507.09897)

From Memorization to Algorithms: How RNNs Learn to Generalize Infinitely

Introduction One of the most profound mysteries in deep learning is the phenomenon of generalization. We typically understand generalization through the lens of interpolation: if a neural network sees enough training examples (dots on a graph), it learns a smooth curve that connects them, allowing it to predict values for points situated between the training examples. However, in certain settings, neural networks exhibit a behavior that defies this interpolation-based explanation. They learn to extrapolate. They can handle inputs far beyond the bounds of their training data—sometimes infinitely so. When a network trained on short sequences suddenly solves a task for sequences thousands of times longer, it implies the network hasn’t just fitted a curve; it has discovered an underlying algorithm. ...

[Emergence in Non-Neural Models: Grokking Modular Arithmetic via Average Gradient Outer Product 🔗](https://openreview.net/pdf?id=36hVB7DEB0)

Demystifying Grokking: It’s Not Just for Neural Networks

In the landscape of modern artificial intelligence, few phenomena are as puzzling as “grokking.” Imagine training a neural network on a difficult math problem. For a long time—thousands of training steps—the model seems to memorize the training data perfectly, yet it fails miserably on any new, unseen test data. Its test accuracy sits stubbornly at 0%. Then, suddenly, often long after you might have given up and stopped the training, the test accuracy rockets upward, snapping from 0% to 100%. The model has suddenly “grokked” the underlying logic. ...

[Normalizing Flows are Capable Generative Models 🔗](https://arxiv.org/abs/2412.06329)

The Return of the Flow: How TARFLOW Makes Normalizing Flows Competitive with Diffusion

If you have been following the generative AI landscape over the last few years, the narrative seems clear: Diffusion Models (like Stable Diffusion or DALL-E) and Autoregressive Models (like GPT-4) have won. They generate the highest quality images and text, dominating the leaderboards. Meanwhile, Normalizing Flows (NFs)—a family of models known for their elegant mathematical properties—have largely been left behind. While they were once a popular choice for density estimation, they gained a reputation for being computationally expensive and unable to produce the high-fidelity samples we see from diffusion models. ...

[Going Deeper into Locally Differentially Private Graph Neural Networks 🔗](https://openreview.net/pdf?id=2aKHuXdr7Q)

UPGNET: How to Save Graph Learning from the Noise of Privacy

Introduction Graph Neural Networks (GNNs) have revolutionized how we handle data. From predicting protein structures to recommending new friends on social media, GNNs excel at leveraging the connections between data points. But there is a massive elephant in the room: Privacy. Real-world graphs—like social networks or financial transaction webs—are often packed with sensitive personal information. We want to train models on this data to solve useful problems, but we cannot compromise the privacy of the individuals who make up the graph. ...

[The Value of Prediction in Identifying the Worst-Off 🔗](https://arxiv.org/abs/2501.19334)

Beyond Accuracy—When Should We Stop Improving Algorithms and Start Expanding Access?

In the world of data science and public policy, there is a pervasive assumption: better models lead to better outcomes. We spend countless hours tuning hyperparameters, gathering more features, and chasing that extra 0.01 boost in AUC or \(R^2\). The logic seems sound—if we can more accurately predict who is at risk of poverty, unemployment, or dropping out of school, we can more effectively target our help. But what if the bottleneck isn’t the algorithm? What if the best way to help the worst-off isn’t a smarter AI, but simply a larger budget to help more people? ...

[Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs 🔗](https://arxiv.org/abs/2505.18300)

Rewriting the Map: How History-Driven Targets Revolutionize Graph Sampling

Imagine you are an explorer dropped into a massive, labyrinthine city—like a social network with millions of users or the vast topology of the internet. You have a mission: you need to estimate the average income of the population, or perhaps find a hidden community of bots. You can’t see the whole map at once; you can only see the street you are on and the intersections leading to neighbors. ...

[Improving the Scaling Laws of Synthetic Data with Deliberate Practice 🔗](https://arxiv.org/abs/2502.15588)

Deliberate Practice: How to Break the Scaling Ceiling of Synthetic Data

Introduction: The Art of Learning Imagine you are learning to play the guitar. You start by strumming a few basic chords—G, C, and D. After a week, you’ve mastered them. Now, if you want to become a virtuoso, what should you do? Should you spend the next year playing those same three chords over and over again? Or should you deliberately seek out difficult fingerpicking patterns, complex jazz scales, and songs that force you to stretch your fingers in uncomfortable ways? ...

[ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference 🔗](https://arxiv.org/abs/2410.21465)

Scaling Long-Context LLMs: How ShadowKV Beats the Memory Wall

The capabilities of Large Language Models (LLMs) have exploded in recent years, particularly regarding context length. We have moved from models that could barely remember a paragraph to beasts like Llama-3-1M and Gemini-1.5 that can digest entire novels, codebases, or legal archives in a single pass. However, this capability comes with a massive computational cost. As the context length grows, so does the memory required to store the Key-Value (KV) cache—the stored activations of previous tokens used to generate the next one. For a 1-million-token sequence, the KV cache can easily exceed the memory capacity of even top-tier GPUs like the Nvidia A100. ...