Gaussian Processes (GPs) are the Swiss Army knife of machine learning when it comes to uncertainty quantification. They offer a powerful, principled way to not only make predictions but also to estimate how confident we should be in those predictions. This makes them invaluable for high-stakes applications like drug discovery, robotics, and automated scientific exploration, where knowing what the model doesn’t know is just as important as what it predicts.
However, GPs have a well-known Achilles’ heel: computational complexity. Exact GP inference scales cubically with the number of data points, \(O(N^3)\). This means that while a GP might work beautifully on a few thousand observations, it grinds to a halt when faced with the millions of data points common in modern datasets.
For decades, researchers have developed clever ways to tackle this scaling problem. Most solutions fall into one of two categories:
- Inducing point methods: summarizing the dataset with a smaller set of pseudo-inputs.
- Iterative solvers such as Conjugate Gradients (CG), which approximate the exact GP solution more efficiently.
A recent paper, Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent, introduces a fascinating new approach: Stochastic Gradient Descent (SGD). This might sound counter-intuitive: SGD is the workhorse behind deep learning—simple, often “naïve”—and is not the optimizer most would pick to solve the precise, structured linear algebra problems at the core of GPs.
Yet, the authors show that SGD can not only be used for GP inference, but can also achieve state-of-the-art results—especially for large or ill-conditioned datasets. The secret lies in a counterintuitive phenomenon they call benign non-convergence, where SGD’s tendency to not fully converge can actually become a feature. In this post, we’ll unpack their method, see how they adapt SGD for GP posterior sampling, and understand the beautiful theory behind its surprising effectiveness.
Gaussian Processes in a Nutshell
A Gaussian Process is a distribution over functions. Instead of fitting parameters like the weights in a polynomial, a GP defines a prior over functions, specified by a mean function \(\mu(\cdot)\) (often taken to be zero) and a covariance function (kernel), \(k(\cdot, \cdot')\), which measures similarity between inputs.
Given training inputs \(\mathbf{x}\) and noisy observations \(\mathbf{y}\), we use Bayes’ rule to update this prior into a posterior GP. The posterior mean gives us predictions; the posterior covariance quantifies uncertainty.
The GP posterior mean and covariance at a test point \((\cdot)\) are:
\[ \mu_{f|\boldsymbol{y}}(\cdot) = \mathbf{K}_{(\cdot)\boldsymbol{x}} (\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}} + \boldsymbol{\Sigma})^{-1} \boldsymbol{y} \]\[ k_{f|\boldsymbol{y}}(\cdot, \cdot') = \mathbf{K}_{(\cdot,\cdot')} - \mathbf{K}_{(\cdot)\boldsymbol{x}}(\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}} + \boldsymbol{\Sigma})^{-1} \mathbf{K}_{\boldsymbol{x}(\cdot')} \]The bottleneck lies in \((\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}} + \boldsymbol{\Sigma})^{-1}\)—inverting this dense \(N\times N\) matrix costs \(O(N^3)\).
Pathwise Conditioning: A Different View on GP Posteriors
An alternative approach uses pathwise conditioning, which directly constructs samples of functions from the posterior:
\[ (f \mid \boldsymbol{y})(\cdot) = f(\cdot) + \mathbf{K}_{(\cdot)\boldsymbol{x}}(\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}} + \boldsymbol{\Sigma})^{-1}(\boldsymbol{y} - f(\boldsymbol{x}) - \boldsymbol{\varepsilon}), \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}), \quad f \sim \mathrm{GP}(0, k) \]This says: take a prior function \(f(\cdot)\) and “nudge” it toward the data points using a correction term. The catch: the correction still requires solving that large linear system.
To make this practical, we need:
- An efficient way to sample \(f(\cdot)\) from the prior.
- An efficient way to solve the linear system.
For (1), the authors use Random Fourier Features (RFFs), an approximation of kernels via low-dimensional feature maps:
\[ k(x, x') \approx \langle \Phi(x), \Phi(x') \rangle, \quad f(\cdot) \approx \tilde{f}(\cdot) = \boldsymbol{\theta}^\top \Phi(\cdot), \quad \boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \]This makes sampling from the prior cheap. For (2), they turn to SGD.
From Quadratic Optimization to SGD
Finding the GP posterior mean can be cast as minimizing a quadratic loss:
\[ \boldsymbol{v}^* = \arg\min_{\boldsymbol{v} \in \mathbb{R}^N} \sum_{i=1}^N \frac{\big(y_i - \mathbf{K}_{x_i\boldsymbol{x}} \boldsymbol{v}\big)^2}{\Sigma_{ii}} + \|\boldsymbol{v}\|_{\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}}}^2 \]\[ \mu_{f| \mathbf{y}}(\cdot) = \mathbf{K}_{(\cdot)\boldsymbol{x}} \boldsymbol{v}^* \]Typically, Conjugate Gradients (CG) is used here. CG exploits the quadratic structure for faster convergence than standard gradient descent. However, in ill-conditioned problems, CG can still fail to converge quickly.
This raises the paper’s central question: can plain SGD—not exploiting quadratic structure—compete?
Posterior Sampling with SGD
The authors design stochastic objectives for both the posterior mean and posterior samples that can be optimized with minibatch SGD.
Figure 1: SGD excels in ill-conditioned regimes where CG fails to converge within a time budget. In well-conditioned regimes, both recover the exact solution.
1. SGD for the Posterior Mean
The quadratic mean-objective is made SGD-compatible by:
- Minibatching the loss term over \(D \ll N\) data points.
- Stochastic regularizer estimation using RFFs.
The resulting objective is:
\[ \mathcal{L}(\boldsymbol{v}) = \frac{N}{D} \sum_{i=1}^D \frac{(y_i - \mathbf{K}_{x_i\boldsymbol{x}}\boldsymbol{v})^2}{\Sigma_{ii}} + \sum_{\ell=1}^L \left(\boldsymbol{v}^\top \phi_\ell(\boldsymbol{x})\right)^2 \]This can be computed at \(O(N)\) cost per step.
2. Low-Variance Objective for Posterior Samples
From the pathwise form, the “uncertainty reduction term” involves solving another quadratic problem, but with noisy targets \(f(x_i)+\varepsilon_i\). Naive minibatching yields noisy gradients.
Variance is reduced by moving the noise into the regularizer:
\[ \min_{\boldsymbol{\alpha} \in \mathbb{R}^N} \sum_{i=1}^N \frac{(f(x_i) - \mathbf{K}_{x_i x} \boldsymbol{\alpha})^2}{\Sigma_{ii}} + \|\boldsymbol{\alpha} - \boldsymbol{\delta}\|_{\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}}}^2, \quad \boldsymbol{\delta} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}^{-1}) \]This objective has the same minimizer as the noisy version but with far lower gradient variance.
Figure 2: Low-variance reformulation reduces noise in SGD updates. Inducing point variants enable massive datasets to be handled efficiently.
Scaling Further: Inducing Points
Even with \(O(N)\) steps, millions of points are heavy. Inducing point methods summarize the dataset with \(M\) pseudo-inputs (\(M \ll N\)), so only \(M\) parameters are learned.
In the authors’ inducing point SGD, both mean and sample objectives are expressed in terms of \(M\)-dimensional representer weights, cutting per-step cost to \(O(SM)\) (for \(S\) samples). This allows hundreds of thousands of inducing points, approximating the full GP closely while remaining fast.
The Magic of Benign Non-Convergence
SGD’s convergence pattern is strikingly different from CG’s.
Figure 3: Despite slow convergence in parameter space, SGD’s predictive accuracy rapidly reaches optimal levels.
Why? The authors split input space into:
1. Prior Region
Far from data, kernel similarity \(k(x_i, x_{\text{test}}) \approx 0\). Both exact and SGD-approximated posteriors revert to the prior: zero error.
2. Interpolation Region
Near data, prediction is crucial. A spectral decomposition of the kernel matrix yields basis functions:
\[ u^{(i)}(\cdot) = \sum_{j=1}^N \frac{U_{ji}}{\sqrt{\lambda_i}} k(x_j, \cdot) \]Large-eigenvalue basis functions are concentrated in data-rich regions. Theory (Proposition 1) shows SGD converges fast along directions with large \(\lambda_i\):
\[ \|\operatorname{proj}_{u^{(i)}} \mu_{f|y} - \operatorname{proj}_{u^{(i)}} \mu_{\text{SGD}}\|_{H_k} \le \frac{1}{\sqrt{\lambda_i}} (\text{error terms shrinking as } t^{-1}) \]Thus, SGD gets these important directions right quickly.
3. Extrapolation Region
Between prior and interpolation regions, small-\(\lambda_i\) basis functions dominate. SGD converges slowly here, creating large RKHS error in Figure 3. But since little data exists here, errors have negligible impact on predictions—benign non-convergence.
Figure 4: SGD’s high accuracy aligns with interpolation regions; extrapolation errors are benign.
Experimental Evidence
Large-Scale Regression
The authors test on nine datasets (up to 2M points) with a fixed kernel and hyperparameters.
Table 1: On large or ill-conditioned datasets, SGD outperforms CG and SVGP in RMSE and NLL.
Figure 5: In large-scale regimes, SGD beats CG in wall time to good predictions.
High-Stakes Decision-Making: Parallel Thompson Sampling
The authors run large-scale Bayesian optimization with batch Thompson sampling (1000 acquisitions per step). This demands many posterior samples quickly.
Figure 6: Under tight compute budgets, SGD achieves the best target function values in less time.
Key Takeaways
This work reframes GP posterior sampling as stochastic optimization, enabling linear-cost sampling with SGD and sublinear cost with inducing points.
Crucially:
- SGD’s early convergence along high-eigenvalue directions yields fast predictive accuracy.
- Ill-conditioning affects small-eigenvalue directions, which SGD naturally deprioritizes, making it robust.
- Benign non-convergence is a blessing in disguise.
By showing that an “imperfect” optimizer can align perfectly with the geometry of the GP posterior, the authors challenge traditional assumptions about convergence and optimality in scalable ML. This opens doors to applying GPs to truly massive datasets and complex sequential decision problems where uncertainty estimates must be fast, accurate, and trustworthy.