In the world of deep learning for science, structure is everything. Whether it’s the folding of a protein, the twisting of an RNA strand, or the dynamics of a particle system, the geometric arrangement of atoms dictates function. To model these systems effectively, neural networks must understand two things: global context (how distant parts of a molecule interact) and equivariance (the laws of physics shouldn’t change just because you rotated the molecule).
For years, the gold standard for capturing global interactions has been the Transformer and its self-attention mechanism. But there is a catch. Self-attention scales quadratically (\(O(N^2)\)). If you double the size of your molecule, the computational cost quadruples. For massive biological systems with tens of thousands of atoms, standard Transformers simply run out of memory.
Enter the Geometric Hyena (G-Hyena).
In this post, we will explore a new architecture that challenges the dominance of attention mechanisms in geometric deep learning. By adapting long-convolutional models (specifically the Hyena operator) to 3D space, researchers have created a model that is arguably faster, leaner, and more capable of handling massive scales than its predecessors.
The Scaling Problem in Geometric Learning
To understand why Geometric Hyena is necessary, we first need to look at the bottlenecks in current methods.
When modeling a 3D structure, like a protein, we deal with two types of data:
- Invariant Features (Scalar): Properties that don’t change with rotation, such as atom type (Carbon, Nitrogen) or charge.
- Equivariant Features (Vector): Properties that rotate with the molecule, such as 3D coordinates, velocities, or forces.
A robust model must process both. Standard methods usually fall into one of two camps:
- Message Passing Neural Networks (MPNNs): These look at local neighborhoods (e.g., an atom and its closest neighbors). They are efficient but struggle to “see” the whole picture. Information from one end of a protein takes a long time to propagate to the other.
- Equivariant Transformers: These use attention to let every atom talk to every other atom. This captures global context perfectly but is incredibly expensive.
The graph below illustrates this bottleneck vividly.

As the sequence length (number of atoms/tokens) approaches 30,000, standard Transformer-based models (like the VNT or Equiformer shown in red and purple) see their runtime and memory usage explode. In contrast, the Geometric Hyena (teal line) remains almost flat. In fact, G-Hyena can handle context lengths up to 2.7 million tokens on a single GPU—72 times longer than what an equivariant transformer can handle.
The Geometric Hyena Architecture
How does G-Hyena achieve this efficiency? The secret lies in replacing the expensive “all-to-all” attention matrix with Long Convolutions.
Inspired by the success of the Hyena operator in Large Language Models (LLMs), this architecture processes data in a way that allows for parallel training and sub-quadratic inference. But unlike standard text processing, G-Hyena must respect the 3D symmetries of the physical world.
Here is the high-level view of the Geometric Hyena block:

The architecture generally follows the familiar workflow of a Transformer:
- Projection: Input data is projected into Queries (\(Q\)), Keys (\(K\)), and Values (\(V\)).
- Mixing: A global operation mixes information across the sequence.
- Gating: A mechanism controls the flow of information.
However, the “Mixing” step (highlighted in green in Figure 2b) is where the innovation happens. Instead of attention, G-Hyena uses Geometric Long Convolution.
The model maintains strict equivariance throughout this process. Mathematically, the model \(\Psi\) satisfies the following property, ensuring that if you rotate the input geometric tokens \(\mathbf{x}\), the output \(\hat{\mathbf{x}}\) rotates exactly the same way:

Let’s break down the core components of this new operator.
1. The Projection Layer
Before convolution, we need to embed our scalar (invariant) and vector (equivariant) inputs. The authors use a layer inspired by Equivariant Graph Neural Networks (EGNN). This layer processes local neighborhoods to create rich embeddings for the global convolution to consume. It acts as a bridge, converting raw atom data into the \(Q, K, V\) format required for the Hyena operator.
2. Scalar Long Convolution
For the scalar features (the “invariant” stream), the model utilizes the standard long convolution found in modern sequence models. Instead of performing a naive convolution (which is slow), it utilizes the Fast Fourier Transform (FFT).
By the Convolution Theorem, convolution in the time/space domain is equivalent to multiplication in the frequency domain. This allows the model to compute global interactions in \(O(N \log N)\) time rather than \(O(N^2)\).

Here, \(\mathbf{q}\) and \(\mathbf{k}\) are the query and key sequences. They are transformed into the frequency domain (\(\mathbf{F}\)), multiplied element-wise, and transformed back. This is highly efficient.
3. Equivariant Vector Long Convolution
This is the paper’s primary contribution. Standard FFT convolution works for scalars, but how do you convolve 3D vectors while preserving rotational equivariance?
The researchers introduce the Vector Long Convolution based on the cross product. Unlike the dot product (which results in a scalar), the cross product of two vectors results in a new vector that is equivariant to the rotation of the inputs.
The operation is defined as a sum of cross products over the sequence:

A naive calculation of this sum would return us to quadratic complexity. To solve this, the authors decompose the cross-product convolution into simpler components. A cross product can be written as a specific combination of element-wise multiplications. Therefore, the vector convolution can be broken down into a sum of six scalar convolutions:

By decomposing the vector interaction into scalar components (indexed by \(h\) and \(p\)), the model can essentially run the efficient FFT-based convolution on the components of the vectors and then recombine them. This preserves the \(O(N \log N)\) speed while handling the 3D vector geometry correctly.
4. Geometric Long Convolution: Mixing Scalars and Vectors
In physical systems, geometry (vectors) often dictates properties (scalars), and vice versa. A model treating them as two separate streams would be limited. G-Hyena introduces a “Geometric Long Convolution” that allows these subspaces to interact.
The interaction is visualized below:

The model computes a new scalar output (\(\alpha_3\)) and a new vector output (\(\mathbf{r}_3\)) by mixing the inputs in every rotation-safe way possible:
- Scalar \(\times\) Scalar
- Vector \(\cdot\) Vector (Dot product \(\rightarrow\) Scalar)
- Scalar \(\times\) Vector
- Vector \(\times\) Vector (Cross product \(\rightarrow\) Vector)
Mathematically, the output scalar is computed by combining the scalar convolution and the dot-product of the vector streams:

And the output vector combines scalar-vector modulation with the cross-product convolution we derived earlier:

This comprehensive mixing allows G-Hyena to learn complex dependencies where the shape of a molecule influences its chemical properties and vice versa.
Experimental Results
The researchers validated Geometric Hyena across a variety of tasks, ranging from synthetic benchmarks to real-world molecular dynamics.
1. Geometric Associative Recall
To prove that G-Hyena can actually “learn” geometric sequences, the authors designed a new task called Geometric Associative Recall.

In this task, the model sees a sequence of vector pairs (Key, Value). At the end of the sequence, it is presented with a specific Key (rotated) and must predict the corresponding Value (also rotated). This tests the model’s ability to store and retrieve geometric information over long contexts.
The results were decisive:

As shown in the top chart, standard Transformers (green) and pure Hyena (red) struggle because they lack proper geometric inductive biases. The G-Hyena (teal triangle) achieves near-zero error, matching the theoretical performance of equivariant attention models but doing so with much higher efficiency.
2. Large Molecule Property Prediction (RNA)
The real test is on biological data. The authors tested G-Hyena on the Open Vaccine and Ribonanza-2k datasets. These datasets require predicting stability and degradation profiles for large RNA molecules, which can contain thousands of atoms.

In the table above, methods relying only on local context (Red) generally perform worse than those with global context (Cyan). However, notice the Equiformer entry: it goes OOM (Out Of Memory) on the larger tasks.
G-Hyena achieves the lowest RMSE (error) across almost all categories, outperforming both local methods and heavy attention-based global methods. It effectively captures the long-range interactions that determine RNA stability without crashing the GPU.
3. Protein Molecular Dynamics
Finally, the model was tested on predicting the dynamics of proteins—forecasting where atoms will move next.

Again, we see the efficiency gap. FastEGNN fails with numerical instability (NAN), and Equiformer runs out of memory. G-Hyena completes the task with the lowest Mean Squared Error (1.80 on backbone, 2.49 on all-atom), proving it is robust enough for complex physical simulations.
Conclusion and Implications
The Geometric Hyena Network represents a significant shift in how we approach geometric deep learning. For years, the community has accepted the quadratic cost of Transformers as the price to pay for global context. This paper proves that this tradeoff is no longer necessary.
By cleverly adapting the Fast Fourier Transform and the vector cross-product, G-Hyena achieves:
- Sub-quadratic scaling: Enabling context lengths of millions of tokens.
- Strict Equivariance: Respecting the physics of 3D rotation and translation.
- Rich Interaction: deeply mixing invariant and equivariant data.
This opens the door to modeling entire genomes, massive protein complexes, and macroscopic material properties at atomic resolution—tasks that were previously computationally impossible. As we move toward larger “Foundation Models” for biology and chemistry, efficient architectures like Geometric Hyena will likely form the backbone of the next generation of scientific AI.
](https://deep-paper.org/en/paper/2505.22560/images/cover.png)