Solving Giant Gaussian Processes with... SGD? A Deep Dive into Benign Non-Convergence

Gaussian Processes (GPs) are the Swiss Army knife of machine learning when it comes to uncertainty quantification. They offer a powerful, principled way to not only make predictions but also to estimate how confident we should be in those predictions. This makes them invaluable for high-stakes applications like drug discovery, robotics, and automated scientific exploration, where knowing what the model doesn’t know is just as important as what it predicts.

However, GPs have a well-known Achilles’ heel: computational complexity. Exact GP inference scales cubically with the number of data points, \(O(N^3)\). This means that while a GP might work beautifully on a few thousand observations, it grinds to a halt when faced with the millions of data points common in modern datasets.

For decades, researchers have developed clever ways to tackle this scaling problem. Most solutions fall into one of two categories:

Inducing point methods: summarizing the dataset with a smaller set of pseudo-inputs.
Iterative solvers such as Conjugate Gradients (CG), which approximate the exact GP solution more efficiently.

A recent paper, Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent, introduces a fascinating new approach: Stochastic Gradient Descent (SGD). This might sound counter-intuitive: SGD is the workhorse behind deep learning—simple, often “naïve”—and is not the optimizer most would pick to solve the precise, structured linear algebra problems at the core of GPs.

Yet, the authors show that SGD can not only be used for GP inference, but can also achieve state-of-the-art results—especially for large or ill-conditioned datasets. The secret lies in a counterintuitive phenomenon they call benign non-convergence, where SGD’s tendency to not fully converge can actually become a feature. In this post, we’ll unpack their method, see how they adapt SGD for GP posterior sampling, and understand the beautiful theory behind its surprising effectiveness.

Gaussian Processes in a Nutshell

A Gaussian Process is a distribution over functions. Instead of fitting parameters like the weights in a polynomial, a GP defines a prior over functions, specified by a mean function \(\mu(\cdot)\) (often taken to be zero) and a covariance function (kernel), \(k(\cdot, \cdot')\), which measures similarity between inputs.

Given training inputs \(\mathbf{x}\) and noisy observations \(\mathbf{y}\), we use Bayes’ rule to update this prior into a posterior GP. The posterior mean gives us predictions; the posterior covariance quantifies uncertainty.

The GP posterior mean and covariance at a test point \((\cdot)\) are:

\[ \mu_{f|\boldsymbol{y}}(\cdot) = \mathbf{K}_{(\cdot)\boldsymbol{x}} (\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}} + \boldsymbol{\Sigma})^{-1} \boldsymbol{y} \]

\[ k_{f|\boldsymbol{y}}(\cdot, \cdot') = \mathbf{K}_{(\cdot,\cdot')} - \mathbf{K}_{(\cdot)\boldsymbol{x}}(\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}} + \boldsymbol{\Sigma})^{-1} \mathbf{K}_{\boldsymbol{x}(\cdot')} \]

The bottleneck lies in \((\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}} + \boldsymbol{\Sigma})^{-1}\)—inverting this dense \(N\times N\) matrix costs \(O(N^3)\).

Pathwise Conditioning: A Different View on GP Posteriors

An alternative approach uses pathwise conditioning, which directly constructs samples of functions from the posterior:

\[ (f \mid \boldsymbol{y})(\cdot) = f(\cdot) + \mathbf{K}_{(\cdot)\boldsymbol{x}}(\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}} + \boldsymbol{\Sigma})^{-1}(\boldsymbol{y} - f(\boldsymbol{x}) - \boldsymbol{\varepsilon}), \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}), \quad f \sim \mathrm{GP}(0, k) \]

This says: take a prior function \(f(\cdot)\) and “nudge” it toward the data points using a correction term. The catch: the correction still requires solving that large linear system.

To make this practical, we need:

An efficient way to sample \(f(\cdot)\) from the prior.
An efficient way to solve the linear system.

For (1), the authors use Random Fourier Features (RFFs), an approximation of kernels via low-dimensional feature maps:

\[ k(x, x') \approx \langle \Phi(x), \Phi(x') \rangle, \quad f(\cdot) \approx \tilde{f}(\cdot) = \boldsymbol{\theta}^\top \Phi(\cdot), \quad \boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \]

This makes sampling from the prior cheap. For (2), they turn to SGD.

From Quadratic Optimization to SGD

Finding the GP posterior mean can be cast as minimizing a quadratic loss:

\[ \boldsymbol{v}^* = \arg\min_{\boldsymbol{v} \in \mathbb{R}^N} \sum_{i=1}^N \frac{\big(y_i - \mathbf{K}_{x_i\boldsymbol{x}} \boldsymbol{v}\big)^2}{\Sigma_{ii}} + \|\boldsymbol{v}\|_{\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}}}^2 \]

\[ \mu_{f| \mathbf{y}}(\cdot) = \mathbf{K}_{(\cdot)\boldsymbol{x}} \boldsymbol{v}^* \]

Typically, Conjugate Gradients (CG) is used here. CG exploits the quadratic structure for faster convergence than standard gradient descent. However, in ill-conditioned problems, CG can still fail to converge quickly.

This raises the paper’s central question: can plain SGD—not exploiting quadratic structure—compete?

Posterior Sampling with SGD

The authors design stochastic objectives for both the posterior mean and posterior samples that can be optimized with minibatch SGD.

Figure 1: Comparison of SGD, CG, and SVGP approximations to the exact GP on toy datasets. In ill-conditioned settings (infill, top row), CG struggles while SGD fits well. In well-conditioned settings (large domain, bottom row), both match the exact GP.

Figure 1: SGD excels in ill-conditioned regimes where CG fails to converge within a time budget. In well-conditioned regimes, both recover the exact solution.

1. SGD for the Posterior Mean

The quadratic mean-objective is made SGD-compatible by:

Minibatching the loss term over \(D \ll N\) data points.
Stochastic regularizer estimation using RFFs.

The resulting objective is:

\[ \mathcal{L}(\boldsymbol{v}) = \frac{N}{D} \sum_{i=1}^D \frac{(y_i - \mathbf{K}_{x_i\boldsymbol{x}}\boldsymbol{v})^2}{\Sigma_{ii}} + \sum_{\ell=1}^L \left(\boldsymbol{v}^\top \phi_\ell(\boldsymbol{x})\right)^2 \]

This can be computed at \(O(N)\) cost per step.

2. Low-Variance Objective for Posterior Samples

From the pathwise form, the “uncertainty reduction term” involves solving another quadratic problem, but with noisy targets \(f(x_i)+\varepsilon_i\). Naive minibatching yields noisy gradients.

Variance is reduced by moving the noise into the regularizer:

\[ \min_{\boldsymbol{\alpha} \in \mathbb{R}^N} \sum_{i=1}^N \frac{(f(x_i) - \mathbf{K}_{x_i x} \boldsymbol{\alpha})^2}{\Sigma_{ii}} + \|\boldsymbol{\alpha} - \boldsymbol{\delta}\|_{\mathbf{K}_{\boldsymbol{x}\boldsymbol{x}}}^2, \quad \boldsymbol{\delta} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}^{-1}) \]

This objective has the same minimizer as the noisy version but with far lower gradient variance.

Figure 2: Gradient variance comparison and efficiency of inducing point SGD variants. Loss 2 (low-variance) yields dramatically reduced gradient variance (left). Middle and right: using fewer inducing points retains performance while reducing runtime.

Figure 2: Low-variance reformulation reduces noise in SGD updates. Inducing point variants enable massive datasets to be handled efficiently.

Scaling Further: Inducing Points

Even with \(O(N)\) steps, millions of points are heavy. Inducing point methods summarize the dataset with \(M\) pseudo-inputs (\(M \ll N\)), so only \(M\) parameters are learned.

In the authors’ inducing point SGD, both mean and sample objectives are expressed in terms of \(M\)-dimensional representer weights, cutting per-step cost to \(O(SM)\) (for \(S\) samples). This allows hundreds of thousands of inducing points, approximating the full GP closely while remaining fast.

The Magic of Benign Non-Convergence

SGD’s convergence pattern is strikingly different from CG’s.

Figure 3: Convergence curves: CG rapidly reduces parameter-space error (Euclidean, RKHS), while SGD barely moves there. Yet SGD’s test RMSE drops quickly to match optimal performance.

Figure 3: Despite slow convergence in parameter space, SGD’s predictive accuracy rapidly reaches optimal levels.

Why? The authors split input space into:

1. Prior Region

Far from data, kernel similarity \(k(x_i, x_{\text{test}}) \approx 0\). Both exact and SGD-approximated posteriors revert to the prior: zero error.

2. Interpolation Region

Near data, prediction is crucial. A spectral decomposition of the kernel matrix yields basis functions:

\[ u^{(i)}(\cdot) = \sum_{j=1}^N \frac{U_{ji}}{\sqrt{\lambda_i}} k(x_j, \cdot) \]

Large-eigenvalue basis functions are concentrated in data-rich regions. Theory (Proposition 1) shows SGD converges fast along directions with large \(\lambda_i\):

\[ \|\operatorname{proj}_{u^{(i)}} \mu_{f|y} - \operatorname{proj}_{u^{(i)}} \mu_{\text{SGD}}\|_{H_k} \le \frac{1}{\sqrt{\lambda_i}} (\text{error terms shrinking as } t^{-1}) \]

Thus, SGD gets these important directions right quickly.

3. Extrapolation Region

Between prior and interpolation regions, small-\(\lambda_i\) basis functions dominate. SGD converges slowly here, creating large RKHS error in Figure 3. But since little data exists here, errors have negligible impact on predictions—benign non-convergence.

Figure 4: SGD posterior vs exact GP, showing low Wasserstein-2 distance in prior & interpolation regions; errors concentrate at data boundaries. Spectral basis functions (bottom): low-index functions live in interpolation regions, high-index in extrapolation.

Figure 4: SGD’s high accuracy aligns with interpolation regions; extrapolation errors are benign.

Experimental Evidence

Large-Scale Regression

The authors test on nine datasets (up to 2M points) with a fixed kernel and hyperparameters.

Table 1: Regression results. Smaller datasets: CG may win if run to convergence. Large/low-noise datasets: SGD is more accurate & robust.

Table 1: On large or ill-conditioned datasets, SGD outperforms CG and SVGP in RMSE and NLL.

Figure 5: Test error vs compute time for SGD & CG on large datasets. SGD reaches good solutions faster due to cheap steps, and is robust to ill-conditioned kernels.

Figure 5: In large-scale regimes, SGD beats CG in wall time to good predictions.

High-Stakes Decision-Making: Parallel Thompson Sampling

The authors run large-scale Bayesian optimization with batch Thompson sampling (1000 acquisitions per step). This demands many posterior samples quickly.

Figure 6: Thompson sampling performance under small/large compute budgets. SGD finds higher maxima faster than CG & SVGP, proving utility of its uncertainty estimates.

Figure 6: Under tight compute budgets, SGD achieves the best target function values in less time.

Key Takeaways

This work reframes GP posterior sampling as stochastic optimization, enabling linear-cost sampling with SGD and sublinear cost with inducing points.

Crucially:

SGD’s early convergence along high-eigenvalue directions yields fast predictive accuracy.
Ill-conditioning affects small-eigenvalue directions, which SGD naturally deprioritizes, making it robust.
Benign non-convergence is a blessing in disguise.

By showing that an “imperfect” optimizer can align perfectly with the geometry of the GP posterior, the authors challenge traditional assumptions about convergence and optimality in scalable ML. This opens doors to applying GPs to truly massive datasets and complex sequential decision problems where uncertainty estimates must be fast, accurate, and trustworthy.

Gaussian Processes in a Nutshell#

Pathwise Conditioning: A Different View on GP Posteriors#

From Quadratic Optimization to SGD#

Posterior Sampling with SGD#

1. SGD for the Posterior Mean#

2. Low-Variance Objective for Posterior Samples#

Scaling Further: Inducing Points#

The Magic of Benign Non-Convergence#

1. Prior Region#

2. Interpolation Region#

3. Extrapolation Region#

Experimental Evidence#

Large-Scale Regression#

High-Stakes Decision-Making: Parallel Thompson Sampling#

Key Takeaways#