Machine learning models are increasingly deployed in high-stakes environments—from diagnosing diseases to steering autonomous vehicles. In these settings, “accuracy” isn’t enough; we need safety. We need to know that the model will not make catastrophic errors.

To address this, the field has rallied around Conformal Prediction, a powerful framework that wraps around “black-box” models to provide statistical guarantees. For example, instead of just predicting “Cat,” a conformal predictor outputs a set {"Cat", "Dog"} and guarantees that the true label is in that set 95% of the time.

However, standard conformal prediction has a limitation: it is based on frequentist statistics. Its guarantees are “marginal,” meaning they hold on average over many possible datasets, but not necessarily for the specific calibration data you have in hand. It treats the world as having a fixed, unknown state, preventing us from easily incorporating prior knowledge or assessing the distribution of potential risks.

In the paper “Conformal Prediction as Bayesian Quadrature,” researchers Jake C. Snell and Thomas L. Griffiths propose a fascinating shift in perspective. By reframing conformal prediction as a Bayesian Quadrature problem—essentially treating risk calculation as an integration problem—they develop a method that provides interpretable, data-conditional guarantees.

In this post, we will break down how this method works, how it connects two seemingly different fields of mathematics, and why it leads to safer machine learning deployments.


1. The Problem with “Average” Guarantees

Before diving into the solution, let’s establish the baseline. Standard approaches like Split Conformal Prediction or Conformal Risk Control rely on a calibration set—a small dataset where we know the true answers—to tune the model.

The goal is to choose a threshold parameter, \(\lambda\), such that the risk (expected loss) is below a target level \(\alpha\).

Conformal Prediction Guarantee

The equation above is the standard guarantee: the probability that the true value \(Y\) falls outside our prediction set \(C(X)\) should be less than \(\alpha\).

While mathematically sound, these methods focus on controlling the expected loss averaged over many unobserved datasets. They don’t give you a probability distribution for the risk given the actual data you observed.

Imagine a doctor using an AI diagnostic tool. A frequentist guarantee is like saying, “If I used this tool on 1,000 different groups of patients, the average error rate would be 5%.” A Bayesian approach asks a more immediate question: “Given the specific calibration data we just saw, what is the likely range of error rates for the next patient?”

The authors of this paper argue that by adopting a Bayesian view, we can characterize the full distribution of possible outcomes, not just a single point estimate.


2. Background: The Building Blocks

To understand the new method, we need to understand the two pillars it rests on: Conformal Risk Control and Bayesian Quadrature.

Conformal Risk Control (CRC)

CRC is a generalization of conformal prediction. Instead of just “correct” or “incorrect,” it handles continuous loss functions. We assume we have a loss function \(L(\lambda)\) that decreases as we relax our threshold \(\lambda\).

The goal of CRC is to find a \(\hat{\lambda}\) such that the expected loss is controlled:

Conformal Risk Control Guarantee

To do this, standard CRC calculates the empirical risk (average loss) on the calibration set and adjusts it with a small buffer term:

CRC Lambda Calculation

This method works well on average, but as we will see in the experiments, it can be “risky” in specific instances because it targets the mean rather than bounding the tail of the distribution.

Bayesian Quadrature

This is where the paper introduces a twist. Bayesian Quadrature is a technique from numerical analysis used to estimate the value of an integral when the function is expensive or difficult to evaluate.

The core idea is:

  1. Place a prior probability distribution over the function \(f\).
  2. Evaluate the function at a few points \(x_1, \dots, x_n\).
  3. Compute the posterior distribution of the function.
  4. Estimate the integral \(\int f(x) dx\) based on this posterior.

Bayesian Quadrature Integral

Usually, we know the input points \(x_i\) where we evaluated the function. In the context of conformal prediction, however, the “evaluation points” are the quantile levels of our data, which we don’t know directly. This is the puzzle the authors set out to solve.


3. The Core Method: Prediction as Integration

The researchers propose a new framework where finding the right risk threshold is treated as a decision-theoretic problem.

The Risk Definition

We define the risk \(R(\theta, \lambda)\) as the expected loss given a state of nature \(\theta\) and a decision rule \(\lambda\).

Risk Definition

In standard conformal prediction, we try to bound the maximum risk \(\bar{R}(\lambda)\). In the Bayesian view, we look at the integrated risk—the risk averaged over our beliefs (prior) about the state of nature. Interestingly, the authors show that bounding the worst-case integrated risk is equivalent to bounding the maximum risk.

Transforming Risk into Quantiles

Here is the key insight: The expected loss (risk) can be calculated by integrating the quantile function of the losses.

Let \(K(t)\) be the quantile function of the loss distribution. The expected loss is simply the area under this curve: \(\int_0^1 K(t) dt\).

This looks exactly like a quadrature problem! We want to estimate the area under \(K(t)\). Our “observations” are the losses we saw in our calibration set, \(\ell_1, \dots, \ell_n\).

However, there is a catch. In standard quadrature, if you observe \(f(x) = y\), you know both \(x\) and \(y\). Here, we know the loss values \(\ell_i\) (the \(y\)-values), but we don’t know exactly which quantile \(t_i\) (the \(x\)-value) they correspond to. We only know that they are samples from the distribution.

The Bayesian Quadrature Framework

The figure below illustrates the difference between standard Bayesian Quadrature and this new approach.

Overview of the Approach

  • Left: Standard Bayesian Quadrature. We know the inputs (x-axis) and outputs (y-axis) and try to infer the curve.
  • Middle: The proposed approach. We observe the losses (y-axis), but their positions on the x-axis (the quantile levels) are random.
  • Right: This results in a posterior distribution for the expected loss.

Solving the Unknown Quantiles

Since we don’t know the exact \(t\) values (quantile levels) for our observed losses, how do we proceed?

The authors utilize a classic result from statistical prediction analysis: The spacing between sorted quantiles follows a Dirichlet distribution.

If we sort our observed losses \(\ell_{(1)} \le \dots \le \ell_{(n)}\), the “gaps” between the cumulative probabilities associated with these losses are distributed according to:

Dirichlet Distribution

Here, \(U_i\) represents the random spacing between consecutive quantiles. \(L^+\) is a random variable that represents an upper bound on the expected loss.

Instead of a complex integral that requires a specific prior on functions, the authors derive a “distribution-free” bound. They prove that the posterior risk is stochastically dominated by this random variable \(L^+\), which is simply a weighted sum of the sorted losses, where the weights are drawn from a Dirichlet distribution.

Why This Matters: A Visual Comparison

To understand why this is better than the standard approach, look at the comparison below.

Visual Comparison with CRC

  • Left (Standard CRC): CRC effectively estimates the risk using the expectation (mean) of the unobserved quantiles. In the example shown, this leads to an estimated expected loss of 0.45. However, the true expected loss might be 0.50. CRC might underestimate the risk because it focuses on the average case.
  • Right (Bayesian Approach): This approach considers the full distribution of possible quantiles (the Dirichlet distribution). It creates a distribution over the expected loss. By looking at the tail of this distribution (e.g., the 95th percentile), we can be much more confident that we aren’t underestimating the risk.

The Algorithm

The practical algorithm is surprisingly simple. We don’t need to perform complex Bayesian inference or specify a prior on the loss function \(K\).

  1. Collect losses \(\ell_1, \dots, \ell_n\) from the calibration set.
  2. Sort them to get order statistics \(\ell_{(i)}\).
  3. Simulate the random variable \(L^+\) by sampling weights \(U\) from a Dirichlet distribution (which is easy to do computationally).
  4. Calculate \(L^+ = \sum U_i \ell_{(i)}\).
  5. This gives you a histogram (distribution) of the likely risk.
  6. Choose a threshold \(\lambda\) such that a high percentage (e.g., 95%) of this distribution is below your safety limit \(\alpha\).

This decision rule is formally defined as:

HPD Decision Rule

This rule selects \(\lambda\) based on the Highest Posterior Density (HPD) interval. It ensures that with probability \(\beta\) (e.g., 0.95), the expected loss is controlled.


4. Experiments and Results

Does this actually work better than standard methods? The authors tested their approach against Conformal Risk Control (CRC) and Risk-controlling Prediction Sets (RCPS) on both synthetic data and the MS-COCO image dataset.

Synthetic Binomial Data

They first simulated a scenario where the loss follows a binomial distribution. They ran the experiment 10,000 times to see how often each method “failed” (i.e., the true risk exceeded the target \(\alpha = 0.4\)).

Figure 3: Histograms of Risk

The histograms above tell the story clearly:

  • Left (CRC): The distribution of risk is centered around the target, but a significant portion (the pink region) spills over into the “failure” zone. In fact, CRC exceeded the risk threshold in 21.20% of trials.
  • Right (Bayesian HPD): The Bayesian method (using a 95% confidence level) shifts the distribution to the left. It only exceeded the risk threshold in 0.03% of trials.

The table below summarizes these results:

Table 1 Results

CRC is valid on average (marginally), but for any specific deployment, it has a high chance of violating the safety constraint. The Bayesian approach (Ours) effectively eliminates these violations.

MS-COCO Object Detection

The authors also applied this to a real-world task: multilabel classification on the MS-COCO dataset. They aimed to control the False Negative Rate.

Table 3 Results

Here, we see the trade-off.

  • CRC again has a high failure rate (45.05%).
  • RCPS (another conservative baseline) has 0% failure but produces larger prediction sets (3.57), making the model less useful.
  • Ours keeps the failure rate low (5.43%, close to the target of 5%) while maintaining smaller, more useful prediction sets (3.04) than RCPS.

This shows that the Bayesian approach strikes a “Goldilocks” balance: it is safer than CRC but more efficient (smaller sets) than other conservative bounds.

Visualizing the Posterior Risk

One of the main benefits of this framework is interpretability. We can actually visualize the distribution of the upper bound on risk (\(L^+\)) for different thresholds.

Posterior Density of Risk

In the figure above, as the threshold \(\lambda\) increases (0.7 to 0.9), the distribution of risk shifts to the right. This visual feedback gives practitioners a much richer understanding of their model’s safety profile than a single “Yes/No” output.


5. Conclusion and Implications

The paper “Conformal Prediction as Bayesian Quadrature” bridges the gap between the rigid, frequentist guarantees of conformal prediction and the flexible, probabilistic reasoning of Bayesian statistics.

By viewing the calibration process as an integration problem over unknown quantiles, the authors derived a method that:

  1. Recovers existing methods: They showed that if you take the expectation of their Bayesian variable, you get back the standard Conformal Risk Control formulas.
  2. Provides safer guarantees: By using the full posterior distribution rather than just the mean, the method protects against “bad draws” of calibration data.
  3. Remains distribution-free: Remarkably, this Bayesian approach doesn’t require complex priors on the data distribution. It relies on the universal properties of order statistics (the Dirichlet distribution).

For students and practitioners, this work highlights the value of looking at old problems through new lenses. By bringing Bayesian Quadrature into the mix, we gain a tool that makes “black-box” AI models not just more accurate, but transparently safer.