Beyond Flatness: How Riemannian Geometry Unlocks the Secrets of Transformer Generalization

The mystery of generalization—why a neural network trained on specific images or text performs well on data it has never seen before—is the “dark matter” problem of deep learning.

For years, a leading hypothesis has been the concept of sharpness (or its inverse, flatness). The intuition is simple: if a neural network finds a solution in a “flat” valley of the loss landscape, the solution is robust. If the training data shifts slightly (simulating the difference between train and test distributions), the loss doesn’t skyrocket. Conversely, a “sharp” minimum means even a tiny shift results in high error.

This theory has worked reasonably well for Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). But recently, a problem emerged. When researchers applied these sharpness measures to Transformers—the architecture powering ChatGPT, Claude, and Gemini—the correlation broke. Flatness no longer seemed to predict generalization reliably.

In a fascinating paper titled “Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It”, researchers da Silva, Dangel, and Oore argue that the problem isn’t with the concept of sharpness, but with how we measure it. Transformers possess rich, complex symmetries that distort Euclidean space. To truly see the landscape, we must abandon standard geometry and adopt Riemannian geometry.

In this post, we will walk through their derivation, moving from the basic intuition of loss landscapes to the complex geometry of quotient manifolds, and finally seeing how this new “Geodesic Sharpness” successfully predicts generalization in large-scale Transformers.

The Symmetry Problem

To understand why standard sharpness fails, we first need to understand symmetries in neural networks. A symmetry in a parameter space means you can change the weights of the network without changing the function the network computes (and consequently, without changing the loss).

The Illusion of Euclidean Space

Consider a very simple network layer with a scaling symmetry. If you double the weights in one layer and halve them in the next, the output might remain mathematically identical. In standard Euclidean space, these two sets of weights look like two distinct points far apart. But in terms of the function they represent, they are the same point.

This creates a massive problem for measuring sharpness.

Figure 1: Quantities from the Riemannian quotient manifold respect the loss landscape’s symmetry; Euclidean quantities do not.

Look at Figure 1 above. This toy example shows a loss landscape with a scaling symmetry (specifically a GL(1) symmetry).

(a) The Loss: Notice the hyperbolic “valleys.” The loss is constant along these curves.
(b) Euclidean Gradient: If you calculate the gradient norm using standard Euclidean math, you get concentric circles. This creates a contradiction: the gradient changes value as you move along a curve where the loss function is constant! The Euclidean measure is “confused” by the parameter scaling.
(c) Riemannian Gradient: By adjusting our geometry to respect the symmetry, the gradient norm becomes constant along the constant-loss curves.

The authors argue that standard sharpness measures act like Figure 1(b)—they are sensitive to parameter scaling that shouldn’t matter.

The Transformer Symmetry: GL(h)

While CNNs have simple scaling symmetries, Transformers have something much more complex: General Linear group symmetry, or \(\mathrm{GL}(h)\).

Definition of GL(h)

In the attention mechanism of a Transformer, we have matrices for Queries (\(Q\)), Keys (\(K\)), and Values (\(V\)). The math of attention involves products like \(QK^T\). Because of this matrix multiplication, we can insert an invertible matrix \(A\) and its inverse \(A^{-1}\) between the weights without changing the result.

Action of the symmetry group

As shown above, if we transform the weights \(G\) and \(H\) (representing parts of the attention head) using matrix \(A\), the function remains identical. This isn’t just scaling a number up or down; it’s a high-dimensional transformation that twists the entire parameter space.

Standard “Adaptive Sharpness” (ASAM), which attempts to fix simple scaling issues by normalizing weights, is mathematically insufficient to handle this complex \(\mathrm{GL}(h)\) symmetry. This explains why existing measures fail to predict generalization for Transformers.

Enter Riemannian Geometry

To solve this, the authors propose a radical shift: stop measuring sharpness in the “Total Space” (the raw parameters \(\theta\)) and start measuring it on the Quotient Manifold.

The Quotient Manifold

Imagine the parameter space \(\overline{\mathcal{M}}\) as a piece of paper. Because of symmetries, many points on this paper represent the exact same neural network function. We call the set of all equivalent points an orbit.

If we collapse every orbit into a single point, we get a new, curved shape called the Quotient Manifold (\(\mathcal{M}\)). On this manifold, every point represents a unique function. There is no ambiguity.

The challenge is that we cannot directly run gradient descent on this abstract manifold. We have to run it on the raw parameters (the Total Space). We need a mathematical way to translate concepts from the curved Quotient Manifold back to our concrete parameter space.

Vertical and Horizontal Spaces

To bridge these worlds, the authors introduce a decomposition of the tangent space (the space of possible updates to our weights).

At any point in parameter space, we can break a change in weights (\(\delta\)) into two components:

Vertical Space (\(\mathcal{V}\)): Moving in this direction moves you along the symmetry orbit. The weights change, but the network’s function (and loss) stays exactly the same.
Horizontal Space (\(\mathcal{H}\)): Moving in this direction moves you perpendicular to the orbit. This is the change that actually modifies the network’s function.

Figure 2: Illustrative sketch relating total and quotient space and their tangent spaces.

Figure 2 visualizes this beautifully.

The saddle shape represents the Total Space (\(\overline{\mathcal{M}}\)).
The curve connecting \([x]\) and \([x']\) is an orbit (Vertical space).
To measure true sharpness, we only care about the Horizontal component (\(\xi^H\)). This vector represents the “Horizontal Lift”—the distinct change in the network function, stripped of symmetry noise.

Defining the Riemannian Metric

To make this rigorous, we need a Riemannian Metric—a way to measure distance and angles that respects the symmetry. The authors propose two specific metrics for the attention layers: the Invariant metric and the Mixed metric.

Definitions of Invariant and Mixed Metrics

These metrics (\(\langle \cdot, \cdot \rangle^{\mathrm{inv}}\) and \(\langle \cdot, \cdot \rangle^{\mathrm{mix}}\)) weight the gradients by the inverse or the transpose of the weight matrices (\(G\) and \(H\)). Notice how different this is from the standard Euclidean dot product. These metrics ensure that no matter how you scale or transform the weights using the symmetry group, the measured distance remains consistent.

Geodesic Sharpness

Now we arrive at the core contribution: Geodesic Sharpness.

Standard sharpness measures (like SAM or Random Sharpness) perturb the weights by adding a vector \(\delta\):

\[ \theta_{new} = \theta + \delta \]

Geometrically, this is moving in a straight line. But in a curved space (like our quotient manifold), “straight lines” are actually curves called geodesics.

If you try to walk in a straight line on a curved sphere, you drift off your intended path. Similarly, adding a linear Euclidean perturbation to Transformer weights creates distortions because it ignores the curvature of the symmetry group.

The Formulation

The authors redefine worst-case sharpness using geodesics:

Equation for Geodesic Sharpness

Here, \(\bar{\gamma}_{\bar{\xi}}(1)\) represents the point you arrive at after following a geodesic curve starting at \(\theta\) with velocity \(\xi\) for one time step.

However, calculating exact geodesics is computationally expensive. We can’t solve a differential equation for every step of training. The authors solve this by using a second-order Taylor expansion to approximate the geodesic path.

Approximation of the geodesic path

This equation is the “secret sauce.”

The first term \(\bar{\xi}^i t\) is the standard linear update.
The second term involving \(\Gamma_{kl}^i\) (the Christoffel symbols) is the correction for curvature.

This reveals a stunning insight: Standard adaptive sharpness is just a first-order approximation of Geodesic Sharpness. It assumes the space is flat (Christoffel symbols = 0). By adding the second-order term, the authors correct for the curvature induced by the Transformer’s complex symmetries.

The Experiment: Does It Work?

Theory is nice, but does it predict generalization? The authors tested this on three distinct architectures.

1. Diagonal Networks (The Synthetic Test)

They started with diagonal linear networks, a simple architecture where the math is analytically tractable.

Figure 3: Generalization gap vs. sharpness for diagonal models.

In Figure 3, look at the correlation coefficients (\(\tau\)).

Adaptive Sharpness (Left): \(\tau = -0.68\). A decent correlation.
Geodesic Sharpness (Middle/Right): \(\tau = -0.83\) and \(-0.86\).

The Riemannian approach significantly tightens the correlation. (Note: The negative correlation here implies that for this specific metric setup, a higher sharpness value correlated with lower test error, which is an interesting inversion of the usual “flat is good” heuristic, but the strength of the predictive relationship is what matters).

2. Vision Transformers (ImageNet)

This is the real test. They took 72 fine-tuned CLIP models (ViT-B/32) on ImageNet. Previous work showed that adaptive sharpness fails to predict generalization here.

Figure 4: Generalization gap vs. sharpness for ImageNet models.

Figure 4 shows the results:

Adaptive Sharpness (Left): \(\tau = -0.41\). The data points are a scattered cloud. It’s a weak signal.
Geodesic Sharpness (Middle/Right): \(\tau = -0.71\) and \(-0.70\).

Suddenly, a clear trend emerges. By accounting for the symmetry curvature, the “noise” in the sharpness measurement resolves into a clear signal. The Riemannian geometry successfully “found” the sharpness that the symmetries were obscuring.

3. Language Models (BERT on MNLI)

Finally, they looked at BERT models fine-tuned on the MNLI dataset.

Figure 5: Generalization gap vs. sharpness for BERT models.

Figure 5 is perhaps the most damning for the old methods:

Adaptive Sharpness (Left): \(\tau = 0.06\). Effectively zero correlation. It is random noise.
Geodesic Sharpness (Middle/Right): \(\tau = 0.28\) and \(0.38\).

While the correlation isn’t perfect, it discovers a signal where the Euclidean measure found absolutely nothing.

Why the “Curvature Correction” Matters

To prove that the complex math (the Christoffel symbols) was necessary, the authors performed an ablation study. They turned off the second-order correction term in their equation to see if just using the Riemannian metric norm was enough.

Ablation study results

In Figure 8 (Right plot), with the full Geodesic Sharpness, they achieve \(\tau = 0.38\). In the Middle plot, where they turn off the second-order weight corrections (ignoring the curvature), the correlation drops to \(\tau = 0.24\). This confirms that following the curve (the geodesic) is crucial, not just normalizing the vector.

Conclusion and Implications

The paper “Hide & Seek” offers a profound correction to our understanding of deep learning optimization. It suggests that the “flatness” hypothesis wasn’t wrong, but our Euclidean ruler was broken.

Key Takeaways:

Symmetries Matter: Transformers have high-dimensional symmetries (\(\mathrm{GL}(h)\)) that distort parameter space.
Euclidean is Blind: Standard gradient descent and sharpness measures treat parameters as flat Euclidean vectors, ignoring these distortions.
Riemannian is Real: By modeling the parameter space as a Quotient Manifold and moving along Geodesics, we recover strong correlations between the loss landscape geometry and model performance.

This work opens the door for new optimizers. Just as SAM (Sharpness-Aware Minimization) explicitly optimizes for flatness, a Geodesic SAM could optimize for Riemannian flatness, potentially leading to Transformers that generalize even better than current state-of-the-art models. The math is heavier, but as this paper shows, the geometry finds what the symmetries hide.

The Symmetry Problem#

The Illusion of Euclidean Space#

The Transformer Symmetry: GL(h)#

Enter Riemannian Geometry#

The Quotient Manifold#

Vertical and Horizontal Spaces#

Defining the Riemannian Metric#

Geodesic Sharpness#

The Formulation#

The Experiment: Does It Work?#

1. Diagonal Networks (The Synthetic Test)#

2. Vision Transformers (ImageNet)#

3. Language Models (BERT on MNLI)#

Why the “Curvature Correction” Matters#

Conclusion and Implications#