Beyond Self-Attention: Scaling Geometric Deep Learning with Geometric Hyena

In the world of deep learning for science, structure is everything. Whether it’s the folding of a protein, the twisting of an RNA strand, or the dynamics of a particle system, the geometric arrangement of atoms dictates function. To model these systems effectively, neural networks must understand two things: global context (how distant parts of a molecule interact) and equivariance (the laws of physics shouldn’t change just because you rotated the molecule).

For years, the gold standard for capturing global interactions has been the Transformer and its self-attention mechanism. But there is a catch. Self-attention scales quadratically (\(O(N^2)\)). If you double the size of your molecule, the computational cost quadruples. For massive biological systems with tens of thousands of atoms, standard Transformers simply run out of memory.

Enter the Geometric Hyena (G-Hyena).

In this post, we will explore a new architecture that challenges the dominance of attention mechanisms in geometric deep learning. By adapting long-convolutional models (specifically the Hyena operator) to 3D space, researchers have created a model that is arguably faster, leaner, and more capable of handling massive scales than its predecessors.

The Scaling Problem in Geometric Learning

To understand why Geometric Hyena is necessary, we first need to look at the bottlenecks in current methods.

When modeling a 3D structure, like a protein, we deal with two types of data:

Invariant Features (Scalar): Properties that don’t change with rotation, such as atom type (Carbon, Nitrogen) or charge.
Equivariant Features (Vector): Properties that rotate with the molecule, such as 3D coordinates, velocities, or forces.

A robust model must process both. Standard methods usually fall into one of two camps:

Message Passing Neural Networks (MPNNs): These look at local neighborhoods (e.g., an atom and its closest neighbors). They are efficient but struggle to “see” the whole picture. Information from one end of a protein takes a long time to propagate to the other.
Equivariant Transformers: These use attention to let every atom talk to every other atom. This captures global context perfectly but is incredibly expensive.

The graph below illustrates this bottleneck vividly.

Figure 1. Left: GPU forward runtime comparison. Geometric Hyena scales sub-quadratically and achieves a considerable speedup compared to other equivariant models with global context. Right: Peak GPU memory consumption for G-Hyena is the most efficient for long sequences.

As the sequence length (number of atoms/tokens) approaches 30,000, standard Transformer-based models (like the VNT or Equiformer shown in red and purple) see their runtime and memory usage explode. In contrast, the Geometric Hyena (teal line) remains almost flat. In fact, G-Hyena can handle context lengths up to 2.7 million tokens on a single GPU—72 times longer than what an equivariant transformer can handle.

The Geometric Hyena Architecture

How does G-Hyena achieve this efficiency? The secret lies in replacing the expensive “all-to-all” attention matrix with Long Convolutions.

Inspired by the success of the Hyena operator in Large Language Models (LLMs), this architecture processes data in a way that allows for parallel training and sub-quadratic inference. But unlike standard text processing, G-Hyena must respect the 3D symmetries of the physical world.

Here is the high-level view of the Geometric Hyena block:

Figure 2. Geometric Hyena block. (a) Geometric Hyena block includes the SE(3)-Hyena operator and equivariant projections. (b) The SE(3)-Hyena operator includes query, key, value projection, geometric long convolution for global context aggregation, and gating.

The architecture generally follows the familiar workflow of a Transformer:

Projection: Input data is projected into Queries (\(Q\)), Keys (\(K\)), and Values (\(V\)).
Mixing: A global operation mixes information across the sequence.
Gating: A mechanism controls the flow of information.

However, the “Mixing” step (highlighted in green in Figure 2b) is where the innovation happens. Instead of attention, G-Hyena uses Geometric Long Convolution.

The model maintains strict equivariance throughout this process. Mathematically, the model \(\Psi\) satisfies the following property, ensuring that if you rotate the input geometric tokens \(\mathbf{x}\), the output \(\hat{\mathbf{x}}\) rotates exactly the same way:

Equation defining the equivariance property of the model Psi regarding group action Lg.

Let’s break down the core components of this new operator.

1. The Projection Layer

Before convolution, we need to embed our scalar (invariant) and vector (equivariant) inputs. The authors use a layer inspired by Equivariant Graph Neural Networks (EGNN). This layer processes local neighborhoods to create rich embeddings for the global convolution to consume. It acts as a bridge, converting raw atom data into the \(Q, K, V\) format required for the Hyena operator.

2. Scalar Long Convolution

For the scalar features (the “invariant” stream), the model utilizes the standard long convolution found in modern sequence models. Instead of performing a naive convolution (which is slow), it utilizes the Fast Fourier Transform (FFT).

By the Convolution Theorem, convolution in the time/space domain is equivalent to multiplication in the frequency domain. This allows the model to compute global interactions in \(O(N \log N)\) time rather than \(O(N^2)\).

Equation 2 showing the scalar long convolution computed via Fast Fourier Transform.

Here, \(\mathbf{q}\) and \(\mathbf{k}\) are the query and key sequences. They are transformed into the frequency domain (\(\mathbf{F}\)), multiplied element-wise, and transformed back. This is highly efficient.

3. Equivariant Vector Long Convolution

This is the paper’s primary contribution. Standard FFT convolution works for scalars, but how do you convolve 3D vectors while preserving rotational equivariance?

The researchers introduce the Vector Long Convolution based on the cross product. Unlike the dot product (which results in a scalar), the cross product of two vectors results in a new vector that is equivariant to the rotation of the inputs.

The operation is defined as a sum of cross products over the sequence:

Equation 3 defining the vector long convolution as a sum of cross products between query and key vectors.

A naive calculation of this sum would return us to quadratic complexity. To solve this, the authors decompose the cross-product convolution into simpler components. A cross product can be written as a specific combination of element-wise multiplications. Therefore, the vector convolution can be broken down into a sum of six scalar convolutions:

Equation decomposing the vector cross-product convolution into a sum of scalar convolutions using the Levi-Civita symbol.

By decomposing the vector interaction into scalar components (indexed by \(h\) and \(p\)), the model can essentially run the efficient FFT-based convolution on the components of the vectors and then recombine them. This preserves the \(O(N \log N)\) speed while handling the 3D vector geometry correctly.

4. Geometric Long Convolution: Mixing Scalars and Vectors

In physical systems, geometry (vectors) often dictates properties (scalars), and vice versa. A model treating them as two separate streams would be limited. G-Hyena introduces a “Geometric Long Convolution” that allows these subspaces to interact.

The interaction is visualized below:

Figure 5. Scalar-vector interactions in geometric long convolution. Blue lines represent interactions leading to a scalar output alpha 3, and red lines are interactions leading to a vector output r3.

The model computes a new scalar output (\(\alpha_3\)) and a new vector output (\(\mathbf{r}_3\)) by mixing the inputs in every rotation-safe way possible:

Scalar \(\times\) Scalar
Vector \(\cdot\) Vector (Dot product \(\rightarrow\) Scalar)
Scalar \(\times\) Vector
Vector \(\times\) Vector (Cross product \(\rightarrow\) Vector)

Mathematically, the output scalar is computed by combining the scalar convolution and the dot-product of the vector streams:

Equation 17 showing the scalar output of geometric convolution as a mix of scalar convolution and vector component convolutions.

And the output vector combines scalar-vector modulation with the cross-product convolution we derived earlier:

Equation 19 showing the vector output of geometric convolution combining scalar-vector interactions and vector-vector cross products.

This comprehensive mixing allows G-Hyena to learn complex dependencies where the shape of a molecule influences its chemical properties and vice versa.

Experimental Results

The researchers validated Geometric Hyena across a variety of tasks, ranging from synthetic benchmarks to real-world molecular dynamics.

1. Geometric Associative Recall

To prove that G-Hyena can actually “learn” geometric sequences, the authors designed a new task called Geometric Associative Recall.

Figure 4. Diagram explaining the geometric associative recall task where the model must retrieve a value vector corresponding to a query key based on previous occurrences in the sequence.

In this task, the model sees a sequence of vector pairs (Key, Value). At the end of the sequence, it is presented with a specific Key (rotated) and must predict the corresponding Value (also rotated). This tests the model’s ability to store and retrieve geometric information over long contexts.

The results were decisive:

Figure 3. Top: The MSE between retrieved and target vectors for the geometric associative recall task over various sequence lengths. Bottom: The study of performance across varying hidden dimensions and vocabulary size.

As shown in the top chart, standard Transformers (green) and pure Hyena (red) struggle because they lack proper geometric inductive biases. The G-Hyena (teal triangle) achieves near-zero error, matching the theoretical performance of equivariant attention models but doing so with much higher efficiency.

2. Large Molecule Property Prediction (RNA)

The real test is on biological data. The authors tested G-Hyena on the Open Vaccine and Ribonanza-2k datasets. These datasets require predicting stability and degradation profiles for large RNA molecules, which can contain thousands of atoms.

Table 1. The RMSE for two large RNA molecule stability and degradation prediction tasks over the all-atom and the backbone representations.

In the table above, methods relying only on local context (Red) generally perform worse than those with global context (Cyan). However, notice the Equiformer entry: it goes OOM (Out Of Memory) on the larger tasks.

G-Hyena achieves the lowest RMSE (error) across almost all categories, outperforming both local methods and heavy attention-based global methods. It effectively captures the long-range interactions that determine RNA stability without crashing the GPU.

3. Protein Molecular Dynamics

Finally, the model was tested on predicting the dynamics of proteins—forecasting where atoms will move next.

Table 3. The MSE of predicted and ground truth all-atom and backbone protein MD trajectories.

Again, we see the efficiency gap. FastEGNN fails with numerical instability (NAN), and Equiformer runs out of memory. G-Hyena completes the task with the lowest Mean Squared Error (1.80 on backbone, 2.49 on all-atom), proving it is robust enough for complex physical simulations.

Conclusion and Implications

The Geometric Hyena Network represents a significant shift in how we approach geometric deep learning. For years, the community has accepted the quadratic cost of Transformers as the price to pay for global context. This paper proves that this tradeoff is no longer necessary.

By cleverly adapting the Fast Fourier Transform and the vector cross-product, G-Hyena achieves:

Sub-quadratic scaling: Enabling context lengths of millions of tokens.
Strict Equivariance: Respecting the physics of 3D rotation and translation.
Rich Interaction: deeply mixing invariant and equivariant data.

This opens the door to modeling entire genomes, massive protein complexes, and macroscopic material properties at atomic resolution—tasks that were previously computationally impossible. As we move toward larger “Foundation Models” for biology and chemistry, efficient architectures like Geometric Hyena will likely form the backbone of the next generation of scientific AI.

The Scaling Problem in Geometric Learning#

The Geometric Hyena Architecture#

1. The Projection Layer#

2. Scalar Long Convolution#

3. Equivariant Vector Long Convolution#

4. Geometric Long Convolution: Mixing Scalars and Vectors#

Experimental Results#

1. Geometric Associative Recall#

2. Large Molecule Property Prediction (RNA)#

3. Protein Molecular Dynamics#

Conclusion and Implications#