Demystifying Grokking: It’s Not Just for Neural Networks

In the landscape of modern artificial intelligence, few phenomena are as puzzling as “grokking.”

Imagine training a neural network on a difficult math problem. For a long time—thousands of training steps—the model seems to memorize the training data perfectly, yet it fails miserably on any new, unseen test data. Its test accuracy sits stubbornly at 0%. Then, suddenly, often long after you might have given up and stopped the training, the test accuracy rockets upward, snapping from 0% to 100%. The model has suddenly “grokked” the underlying logic.

This delayed generalization, or “emergence,” challenges our traditional understanding of learning curves. Until recently, the prevailing wisdom was that grokking is a unique property of deep neural networks and gradient descent optimization. It was viewed as a mysterious phase transition specific to the complex alchemy of deep learning.

A fascinating research paper, “Emergence in Non-Neural Models: Grokking Modular Arithmetic via Average Gradient Outer Product,” challenges this assumption entirely. The researchers demonstrate that grokking is not exclusive to neural networks. In fact, they reproduce the exact same phenomenon using Kernel Machines—a much older, mathematically simpler class of models—by equipping them with a specific feature learning mechanism.

In this deep dive, we will explore how non-neural models can “grok,” the hidden structures they learn, and what this tells us about the fundamental nature of learning itself.

1. The Problem: Learning Modular Arithmetic

To understand grokking, we first need to understand the task used to observe it. The standard benchmark for these experiments is modular arithmetic.

Modular arithmetic is the math of clocks. In a modulo \(p\) system (where \(p\) is a prime number), integers wrap around. For example, in Modulo 5:

\[2 + 2 = 4\]

\[3 + 3 = 1 \text{ (because } 6 \text{ wraps around to } 1)\]

The task seems simple to a human who knows the rule. But to a machine learning model fed raw data (like “Input: 3, 3; Output: 1”), it is extremely difficult. The inputs are discrete symbols. There is no obvious “smoothness”—being close to 3 doesn’t mean the answer is close to 1.

When neural networks learn this, they exhibit the grokking curve: perfect training accuracy early on, but flatlining test accuracy until a sudden “Aha!” moment much later.

2. Introducing the Recursive Feature Machine (RFM)

The authors of this paper set out to see if this phenomenon could be replicated without a neural network. They turned to Kernel Machines.

Classically, kernel methods (like Support Vector Machines) rely on a fixed “kernel function” that measures similarity between data points. They don’t typically “learn features” or change their internal representation of data during training the way neural networks do. Because they don’t learn features, they usually can’t solve tasks like modular arithmetic well.

However, the authors utilize a novel algorithm called the Recursive Feature Machine (RFM).

How RFM Works

The RFM is a kernel machine that can learn features. It does this iteratively. Instead of training once, it trains, looks at what it learned, adjusts its internal view of the data (its features), and trains again.

The secret sauce powering RFM is the Average Gradient Outer Product (AGOP).

The Mechanism: AGOP

The AGOP is a matrix that captures which directions in the input space are most important for the function the model is trying to learn. Mathematically, it is defined as:

The definition of the Average Gradient Outer Product (AGOP).

Here is the intuition:

Train a predictor (\(f\)) on the current data.
Calculate gradients: Look at how the output of the model changes as you wiggle the inputs (\(\frac{\partial f}{\partial x}\)). Directions where the gradient is high are “important” features.
Compute AGOP: Average these gradients over all data points to get a matrix (\(G\)) that summarizes the global feature importance.
Transform Data: Use this matrix to stretch and rotate the input data, emphasizing the important features.
Repeat.

By looping through these steps, the RFM recursively refines its features.

Grokking with RFM

When the researchers applied RFM to the modular arithmetic task, they observed something remarkable. As shown in the figure below, the kernel machine exhibits the classic grokking signature.

Figure 1. Recursive Feature Machines grok the modular arithmetic task. The top graph shows accuracy staying flat then spiking. The bottom graph shows loss decreasing.

Look at the green line in the top chart (Test Accuracy). For the first 10-12 iterations, the model is clueless on the test set (near 0% accuracy), even though it has arguably memorized the training set (dashed red line). Then, sharply, the test accuracy jumps to 100%.

This proves a massive point: Grokking is not unique to neural networks. It is a general phenomenon of feature learning.

3. Unveiling the Hidden Structure

If both neural networks and RFMs grok, they must be learning a specific type of feature that solves the problem. Because RFMs are mathematically more transparent than neural networks, we can actually “open the box” and look at the learned feature matrices.

The AGOP matrix (\(M\)) tells us how the model transforms inputs. The researchers visualized these matrices for different operations: Addition, Subtraction, Multiplication, and Division.

Figure 3. Visualizations of learned feature matrices (AGOP). Top row shows Addition/Subtraction. Bottom row shows Mul/Div and their reordered versions.

The Stripes of Addition

In the image above, look at (A) for Addition (“Add”). The matrix displays a distinct striped pattern. This is known as a Circulant Matrix.

In a circulant matrix, each row is a shifted version of the one before it. The values are constant along the diagonals. The researchers observed that the learned feature matrix \(M^*\) essentially takes this form:

Equation showing the structure of the learned feature matrix with circulant blocks.

This structure is significant because circulant matrices are deeply connected to the Fourier Transform. This suggests the model is discovering a Fourier-based strategy to solve addition.

The Puzzle of Multiplication

Now, look at the “Mul” (Multiplication) matrix in the bottom left of Figure 3. It looks like static noise. There is no obvious pattern. Does this mean the model uses a completely different, unstructured method for multiplication?

Not quite. The researchers applied a clever transformation. They reordered the rows and columns using a concept from number theory called the Discrete Logarithm.

Just as a standard logarithm turns multiplication into addition (\(\log(ab) = \log a + \log b\)), the discrete logarithm maps the multiplicative group of integers modulo \(p\) to an additive group.

When the researchers reordered the indices of the “noisy” multiplication matrix using the discrete log, the hidden structure revealed itself (Figure 3, bottom right, “Mul (reordered)”). It is the same striped, circulant pattern.

Key Insight: The model learned to treat multiplication exactly like addition, but in a “logarithmic” feature space. It discovered the mathematical relationship between the two operations on its own.

4. Hidden Progress: Why the “Jump” Happens

One of the mysteries of grokking is the silence before the storm. Why does the model sit at 0% test accuracy for so long? Is it doing nothing?

The paper argues that standard metrics (accuracy and loss) are deceptive. Under the surface, the model is making steady, linear progress. To prove this, the authors introduced two new “progress measures”:

Circulant Deviation: A measure of how close the feature matrix is to being perfectly circulant (i.e., how “striped” it is).
AGOP Alignment: A measure of how similar the current features are to the final, perfect features.

Figure 2. Comparison of standard metrics vs. hidden progress measures. Section B shows the hidden measures improving steadily.

In Section A of the figure above, you see the “mirage”: accuracy (top left) is flat for a long time. But look at Section B. The blue lines tell a different story.

Circulant Deviation (top row, B): Steadily decreases from the very start. The matrix is slowly organizing itself into stripes.
AGOP Alignment (bottom row, B): Steadily increases.

The “sudden” jump in accuracy is simply the moment when the features finally become good enough to support the correct classification. The learning itself was continuous, not sudden.

5. Neural Networks Do It Too

A critic might ask: “Okay, RFMs do this, but do Neural Networks really do the same thing?”

The authors show that yes, they do. They trained standard fully-connected neural networks on the same tasks and analyzed the weights of the first layer. They calculated the Neural Feature Matrix (NFM), which is essentially the covariance of the weights (\(W^T W\)).

According to the Neural Feature Ansatz, this NFM should look like the AGOP.

Neural Feature Ansatz equation stating the proportionality between weights and AGOP.

When they visualized these neural network features, the resemblance was uncanny.

Figure 6. Neural Feature Matrices (NFM) and NN AGOPs showing the same striped patterns as RFM.

Just like the RFM, the Neural Network learns block-circulant features for addition. And just like the RFM, when you reorder the Neural Network’s features for multiplication using the discrete log, the stripes appear.

This confirms that Neural Networks and Recursive Feature Machines are converging to the same algorithmic solution.

6. The Algorithm: Fourier Multiplication

Why do both models converge to these “striped” circulant matrices? What algorithm are they actually implementing?

The paper proposes that the models are learning the Fourier Multiplication Algorithm (FMA).

In signal processing, a fundamental theorem states that convolution in the time domain is equivalent to multiplication in the frequency domain. Circulant matrices are operators that perform circular convolution. By learning these matrices, the models are effectively performing a Discrete Fourier Transform (DFT) on the input data.

The paper proves mathematically that a kernel predictor using these specific circulant features is equivalent to computing the answer via Fourier transforms:

Equation for Fourier Multiplication Algorithm for modular addition.

In simple terms:

The model uses learned features to convert the input numbers into “frequencies” (DFT).
It combines these frequencies (element-wise product).
It converts the result back to get the answer.

This is a highly efficient way to solve modular arithmetic, and both the Kernel Machine and the Neural Network discover it automatically.

7. Validating the Discovery

If the “secret sauce” is simply these circulant features, we should be able to “inject” them into a model and see instant learning.

The researchers tested this hypothesis. They took a standard kernel machine (which usually cannot learn this task) and manually transformed the input data using random circulant matrices. They didn’t even train the features—they just forced the structure.

Figure 4. Random circulant features allow standard kernels to generalize instantly (orange line), beating learned features (green line).

The result (Figure 4, Orange Line) is striking. The model with fixed random circulant features generalizes almost immediately. It doesn’t need to grok. The grokking process we usually see is simply the time it takes for the model to “find” and build these circulant features from scratch. Once the structure is there, the problem is easy.

8. Conclusion and Key Takeaways

This paper offers a refreshing and demystifying look at AI “magic.” Here are the big takeaways for students and researchers:

Grokking isn’t magic: It is a visible side-effect of a model slowly learning the correct features (representation learning) while the readout metric (accuracy) remains insensitive to those partial improvements.
It’s not just Deep Learning: The fact that Kernel Machines (via RFM) exhibit grokking proves that this is a fundamental property of learning from data, not just a quirk of backpropagation or neural architectures.
Structure matters: The models aren’t memorizing answers; they are discovering profound mathematical structures (Fourier transforms, Discrete Logs) to solve the tasks efficiently.
We can measure the invisible: By using better metrics like AGOP alignment or Circulant Deviation, we can see “hidden progress” that standard loss and accuracy curves miss.

By stripping away the complexity of neural networks and using the more interpretable Recursive Feature Machine, these researchers have provided a clear window into how machines learn to reason about abstract mathematics. The “sudden” emergence of intelligence is actually the result of a long, steady, and structured journey.

1. The Problem: Learning Modular Arithmetic#

2. Introducing the Recursive Feature Machine (RFM)#

How RFM Works#

The Mechanism: AGOP#

Grokking with RFM#

3. Unveiling the Hidden Structure#

The Stripes of Addition#

The Puzzle of Multiplication#

4. Hidden Progress: Why the “Jump” Happens#

5. Neural Networks Do It Too#

6. The Algorithm: Fourier Multiplication#

7. Validating the Discovery#

8. Conclusion and Key Takeaways#