In the natural sciences and physics, symmetry is everything. Whether you are analyzing the energy of a molecule, the dynamics of a fluid, or the structure of a crystal, the fundamental laws of nature often remain unchanged under certain transformations—like rotation, reflection, or translation.

In machine learning, we call these properties invariances. Ideally, if we train a model to predict the energy of a molecule, rotating that molecule in 3D space should not change the model’s prediction. However, teaching machines to respect these symmetries exactly has historically been a massive computational headache.

Traditional methods force us to choose between two evils: either we approximate the symmetry (which isn’t precise enough for rigorous physics) or we expend an exponential amount of computational power to guarantee it.

A recent paper, “Learning with Exact Invariances in Polynomial Time,” changes this landscape. The authors propose a method that achieves exact invariance without the exponential cost, offering a polynomial-time solution that matches the statistical performance of standard methods.

In this post, we will tear down the mathematics of this new approach, exploring how the authors leverage Spectral Theory and Differential Geometry to solve one of the stickiest problems in geometric deep learning.

The Problem: The High Cost of Symmetry

Let’s define the problem formally. We are trying to learn a regression function \(f^*\) based on a dataset of \(n\) samples \(\mathcal{S} = \{(x_i, y_i)\}\). Our data lives on a manifold \(\mathcal{M}\) (like a sphere or a torus), and we want our learned function \(\widehat{f}\) to be close to the true function \(f^*\).

We measure “closeness” using population risk (generalization error):

Population risk definition.

Here is the catch: The ground truth function \(f^*\) is invariant to a group of symmetries \(G\). This means \(f^*(gx) = f^*(x)\) for any group element \(g\) and any point \(x\). We want our estimator \(\widehat{f}\) to respect this same rule.

Why Standard Methods Fail

You might wonder, “Why not just average the data?” This is the intuition behind Group Averaging. If you have a kernel function \(K\) (a similarity measure between points), you can force it to be invariant by summing over all possible group transformations:

Group averaging formula.

While mathematically sound, this operation is a computational nightmare. If \(G\) is a large group (like the permutation group of \(d\) elements, where \(|G| = d!\)), the size of the group explodes super-exponentially. Summing over \(d!\) elements is impossible for high-dimensional data.

Other techniques like Data Augmentation face the same issue: you would need to augment your training set with every possible transformation of every data point. Canonicalization and Frame Averaging offer alternatives but often struggle with discontinuities or lack general-purpose algorithms.

The research question becomes: Can we find an estimator that is exactly invariant, statistically optimal, and computationally efficient (polynomial time)?

The Solution: Spectral Averaging

The authors propose a method called Spectral Averaging (Spec-Avg). Instead of trying to force invariance in the spatial domain (averaging points), they move to the frequency domain using the spectral theory of the Laplace-Beltrami operator.

1. The Geometry of the Problem

To understand the solution, we need to understand the stage where this plays out: a Riemannian Manifold \(\mathcal{M}\). On this manifold, we have the Laplace-Beltrami operator (\(\Delta_\mathcal{M}\)), which is the generalization of the standard Laplacian (the “second derivative” operator) to curved surfaces.

This operator has a set of eigenfunctions \(\phi_{\lambda, \ell}\) and corresponding eigenvalues \(\lambda\). Just like sines and cosines form a basis for functions on a line (Fourier series), these eigenfunctions form an orthonormal basis for functions on the manifold.

Any function \(f\) on the manifold can be written as a sum of these basis functions:

Function decomposition into eigenfunctions.

2. The Magic of Commutativity

The core insight of the paper relies on a beautiful property of geometry: The Laplacian commutes with isometric group actions.

If you transform a function using a symmetry \(g\) (denoted as \(T_g\)) and then apply the Laplacian, it’s the same as applying the Laplacian first and then transforming it.

Commutativity of Laplacian and Group Action.

Why does this matter? Linear Algebra. If two linear operators commute, they share eigenspaces. This means that when the group \(G\) acts on the manifold, it doesn’t mix up all the frequencies. It only mixes eigenfunctions that share the same eigenvalue \(\lambda\).

The group action effectively becomes a set of orthogonal matrices \(D^\lambda(g)\) acting on the specific eigenspace \(V_\lambda\). This allows us to decompose the massive, complex problem of invariance into many tiny, independent problems—one for each frequency \(\lambda\).

3. Reformulating the Optimization

Ideally, we want to solve the Empirical Risk Minimization (ERM) problem—finding the function that best fits our data—with a constraint that the function must be invariant.

Constrained ERM formulation.

The constraint \(\forall (g,x): f(gx) = f(x)\) is difficult because there are too many \(g\)’s and infinitely many \(x\)’s.

However, using the spectral decomposition we established above, we can translate this condition into the spectral domain. A function \(f\) is invariant if and only if its coefficients \(f_\lambda\) satisfy a linear equation for every group element:

Linear invariance constraint in spectral domain.

4. Compressing the Constraints

We still have a problem: we technically have to check this constraint for every \(g \in G\). For large groups, this is still too slow.

The authors utilize a fundamental property of group theory: Generators. You don’t need to check every element of a group; you only need to check the generating set \(S\). If a function is invariant to the generators, it is invariant to the whole group.

Logic of group generators.

Crucially, for any finite group, the size of the generating set is small: \(|S| \leq \log_2(|G|)\). This provides an exponential reduction in complexity. We went from checking \(d!\) constraints to roughly \(\log(d!)\) constraints.

The Algorithm: Spec-Avg

The resulting algorithm is elegant and surprisingly simple. It effectively “truncates” the infinite series of eigenfunctions at a certain frequency \(D\) (based on sample size \(n\)) and solves a Quadratic Program for each frequency.

Here is the step-by-step process:

  1. Unconstrained Estimation: First, compute a naive estimate of the coefficients \(\widetilde{f}_{\lambda, \ell}\) based on the data. This is just a weighted sum of the eigenfunctions evaluated at the data points. Empirical estimation of coefficients.

  2. Projection (The Correction): We effectively take this naive estimate and “project” it onto the space of invariant functions. We find the closest coefficients \(\widehat{f}_{\lambda, \ell}\) to our naive estimate that satisfy the symmetry constraints. Optimization problem for coefficients.

Because the constraints are linear, this optimization problem has a closed-form solution. We don’t need to run a gradient descent loop; we can just compute it using matrix operations.

Closed-form solution for coefficients.

  1. Reconstruction: Finally, we sum up the corrected coefficients to get our invariant estimator \(\widehat{f}(x)\). Final estimator reconstruction.

Does It Work? The Experiments

Theory is good, but does it hold up in practice? The authors tested the Spec-Avg algorithm against standard Kernel Ridge Regression (KRR) on a high-dimensional Torus with sign-invariance symmetries.

Exact Invariance

First, they measured the Invariance Discrepancy (ID)—a metric of how much the model violates the symmetry.

Definition of Invariance Discrepancy.

As shown in Figure 1 below, standard KRR (the colored lines) fails to achieve exact invariance. The discrepancy decreases as we add data, but it never hits zero. In contrast, Spec-Avg is mathematically guaranteed to have an ID of exactly 0 (not plotted because it would just be a flat line at the bottom).

Figure 1: Invariance Discrepancy vs Samples.

Generalization Performance

Crucially, enforcing this exact invariance does not hurt the model’s ability to learn. The authors proved theoretically that Spec-Avg achieves the same optimal convergence rate as standard methods: \(\mathcal{O}(n^{-\alpha/(1+\alpha)})\).

Figure 2 confirms this empirically. The test error for Spec-Avg (solid lines) decreases at the same rate as KRR (shaded regions). You get the benefit of exact physics compliance without paying a penalty in accuracy.

Figure 2: Test Error vs Samples.

Conclusion

The contribution of this paper is significant for the field of Geometric Deep Learning. It proves that we do not need to enumerate every possible symmetry transformation to learn invariant functions.

By shifting the perspective from the spatial domain to the spectral domain, the authors converted a computationally intractable problem (summing over a massive group) into a manageable set of linear constraints solvable in polynomial time.

Key Takeaways:

  1. Polynomial Time: This is the first algorithm to achieve exact invariance for kernels in time polynomial in \(n\), \(d\), and \(\log|G|\).
  2. No Statistical Trade-off: The method converges to the true function just as fast as non-invariant methods.
  3. Generators are Key: Using the generating set of a group is the secret to avoiding the curse of dimensionality in group size.

For students and practitioners, this work opens the door to more rigorous machine learning applications in physics and chemistry, where “mostly invariant” just isn’t good enough.