Introduction
Graph Neural Networks (GNNs) have revolutionized how we process structured data, from predicting molecular properties to analyzing social networks. However, standard GNNs—specifically Message Passing Neural Networks (MPNNs)—have a well-known Achilles’ heel: oversquashing.
In a standard MPNN, information is aggregated from neighbors. To learn about a node 10 steps away, you need 10 layers of message passing. As the receptive field grows, the amount of information that needs to be compressed into a single fixed-size vector grows exponentially. The signal gets “squashed,” and the model fails to capture long-range dependencies.
To fix this, researchers have looked toward State Space Models (SSMs) like Mamba and S4, which are currently taking the sequence modeling world by storm due to their linear complexity and ability to handle massive context windows. But there is a catch: SSMs are designed for sequences (ordered data). Graphs are not sequences; they are permutation-invariant structures. Forcing a graph into a sequence (e.g., by sorting nodes) often breaks the fundamental symmetries of the graph.
In this post, we dive deep into a new paper that proposes an elegant solution: GRAMA (Graph Adaptive Autoregressive Moving Average).
Instead of turning the graph into a sequence of nodes, GRAMA turns the features into a sequence of graphs. By marrying classical Autoregressive Moving Average (ARMA) system theory with modern Deep Learning, the authors create a model that is theoretically equivalent to an SSM, fully permutation-equivariant, and highly effective at long-range tasks.
Background: The Convergence of GNNs and Dynamical Systems
To understand GRAMA, we need to quickly revisit two concepts: the limitations of current GNNs and the basics of ARMA models.
The Problem with Propagation
Standard GNNs, like GCN or GIN, rely on local aggregation. While efficient, they struggle to propagate information across the graph without losing signal fidelity.
- Graph Transformers (GTs) solve this by using attention mechanisms that connect every node to every other node. While powerful, they typically carry a quadratic computational cost (\(O(N^2)\)), making them hard to scale to large graphs.
- Graph SSMs attempt to use efficient sequential models (like Mamba) on graphs. However, most existing approaches use heuristics to order the nodes (e.g., by degree). This means if you reorder the nodes in your input file, the model might give a different prediction—a violation of permutation equivariance.
What is ARMA?
The Autoregressive Moving Average (ARMA) model is a staple in time-series analysis (e.g., stock price prediction). It models a value at time \(t\) based on two things:
- AutoRegressive (AR): The previous values of the system (history).
- Moving Average (MA): The previous “errors” or residuals (shocks).
Mathematically, a classical ARMA model looks like this:

Here, \(\phi\) represents the AR coefficients (how much history matters) and \(\theta\) represents the MA coefficients (how much the past residuals matter).
The key insight of the GRAMA paper is that we can apply this dynamical process to the depth of a neural network rather than time, allowing node features to evolve in a stable, controllable way.
The GRAMA Method
The core innovation of GRAMA is how it frames the graph learning problem. Rather than viewing the graph as a single static snapshot, GRAMA embeds the input into a sequence of graphs and processes them using a neural ARMA block.
1. From Static Graph to Graph Sequence
How do we use a sequence model on a static graph without imposing an arbitrary order on the nodes?
The authors propose a “feature expansion.” Given an input feature matrix, they project it into a sequence of length \(L\). Imagine that instead of one feature vector per node, each node now possesses a short history or trajectory of features.

Simultaneously, they initialize a sequence of residuals (\(\Delta\)), which tracks the differences or “shocks” in the system.

This transformation is crucial. It creates a “time” dimension (the sequence length \(L\)) that the ARMA model can operate on, without touching the spatial dimension (the graph topology). This preserves permutation equivariance.
2. The Architecture: A Deep Dive
Once the graph is transformed into a sequence, it passes through the GRAMA Block. This block updates the node features using a learnable ARMA process.
Let’s visualize the full framework:

As shown in Figure 1 above, the model consists of initialization, followed by stacked GRAMA blocks. Inside each block, the update rule combines three components:
A. The AutoRegressive (AR) Update
The new state of a node depends on its own history within the sequence. This allows the model to maintain long-term memory of the feature’s evolution.

B. The Moving Average (MA) Update
The new state also depends on the history of residuals. In control theory, this allows the system to correct itself based on past “noise” or inputs.

C. The Graph Backbone (The “Spatial” Component)
So far, the AR and MA parts only look at a node’s own history sequence. We still need to exchange information with neighbors!
This is where the backbone GNN comes in. The “current” residual \(\delta\) is generated by running a standard GNN (like GCN, GAT, or GPS) on the previous state. This couples the spatial mixing (GNN) with the temporal evolution (ARMA).
The final update equation for a recurrence step combines all three:

Here, \(\delta^{(\ell+L)}\) represents the output of the GNN backbone. By structuring the layer this way, GRAMA allows the GNN to focus on local aggregation while the ARMA dynamics handle the retention of information over depth (time).
3. Deep GRAMA and Nonlinearity
A single GRAMA block performs \(R\) recurrence steps. To build a deep network, multiple blocks are stacked. Crucially, the ARMA recurrence itself is a linear system. To give the neural network expressive power, non-linear activation functions (\(\sigma\)) are applied between the blocks.

4. Adaptive & Selective Learning
A rigid ARMA model has fixed coefficients \(\phi\) and \(\theta\). However, different graphs (or even different parts of the feature sequence) might require different dynamics. Some might need long memory (high AR), while others rely on immediate inputs.
GRAMA introduces Selectivity. Instead of learning static coefficients, the model uses an Attention Mechanism to dynamically predict the coefficients \(\phi\) and \(\theta\) based on the input features.
The attention mechanism pools the sequence and computes normalized scores:

This results in coefficients that change adaptively. The authors visualize these learned coefficients in the paper:

In Figure 3(b) above, notice how the trained coefficients (post-training) are not uniform. The model learns to attend to specific past steps, effectively “selecting” which part of the history is important for the current update. This adaptability is a major factor in GRAMA’s performance.
Theoretical Foundations: The SSM Connection
One of the paper’s strongest contributions is theoretical. The authors prove that their specific Neural ARMA formulation is mathematically equivalent to a linear State Space Model (SSM).
An SSM is generally defined by the transition of a hidden state \(x_t\) and an output \(f_t\):

The authors derive an explicit mapping showing that GRAMA’s AR/MA coefficients form a specific structure of the State Matrix \(\mathbf{A}\) in an SSM.

Why does this matter?
- Stability: We can analyze the stability of the neural network by looking at the eigenvalues of this matrix. If they lie inside the unit circle, the gradient propagation is stable.
- Long-Range Interaction: The theory dictates that if the eigenvalues are close to the unit circle, the state matrix powers (\(A^k\)) decay slowly, allowing information to propagate over very long distances without vanishing.
This theoretical grounding provides a formal explanation for why GRAMA mitigates oversquashing.
Experiments & Results
The researchers evaluated GRAMA on 26 different datasets, ranging from synthetic stress tests to real-world molecular benchmarks.
1. The “Graph Transfer” Stress Test
To test oversquashing explicitly, they used a “Graph Transfer” task. The goal is to pass a signal from a source node to a target node over a specific distance (hops).
As the distance increases, standard GNNs usually fail because the signal gets diluted.

In Figure 2, look at the purple line (GRAMA).
- GCN (Blue) and GAT (Orange) error rates skyrocket as the distance grows (x-axis).
- GRAMA maintains a near-zero error rate even at 50 hops, significantly outperforming standard MPNNs and even competitive Graph Transformers like GPS in stability.
2. Long Range Graph Benchmark (LRGB)
The LRGB is the standard for testing long-range capabilities in real-world scenarios (Peptides).

In Table 2, we see GRAMA’s performance on Peptide-func and Peptide-struct.
- GRAMA paired with a simple GCN backbone (
GRAMA_GCN) achieves an AP of 70.93, beating the standard GCN (59.30) by over 11 points. - It performs competitively with or better than specialized Graph Mamba models and Graph Transformers, often with fewer parameters or lower complexity.
3. Efficiency and Speed
One of the main promises of SSM-based approaches is efficiency. Graph Transformers with full attention are \(O(N^2)\). GRAMA, because it relies on the backbone GNN (typically linear \(O(E)\)) and channel-wise recurrences, scales much better.

Table 9 compares the runtime on the Roman-Empire dataset.
- GPS (Graph Transformer): Runs out of memory (OOM) at depth 32 and takes ~1139ms for training at depth 4.
- GRAMA: Does not OOM even at depth 32. At equivalent depths, it is significantly faster than the Transformer while achieving higher accuracy (86-88% vs 81-82%).
4. Standard Benchmarks
The authors also tested on standard heterophilic datasets (Roman-empire, Amazon-ratings, etc.).

The summary in Table 17 is compelling. Across almost every task—whether node classification, regression, or graph classification—wrapping a backbone (like GCN) in the GRAMA framework consistently improves performance, often by a large margin.
Conclusion
GRAMA represents a sophisticated step forward in the convergence of Dynamical Systems and Deep Graph Learning. By acknowledging that graphs require permutation equivariance, but also recognizing the power of sequential state-space modeling, the authors created a hybrid that offers the best of both worlds.
Key Takeaways:
- Sequence of Graphs: Transforming static inputs into temporal sequences enables the use of powerful ARMA/SSM dynamics without breaking graph symmetries.
- Drop-in Enhancement: GRAMA acts as a wrapper. You can put any GNN backbone (GCN, GAT, GatedGCN) inside it to supercharge its long-range capabilities.
- Adaptive Dynamics: The attention-based coefficient learning allows the model to decide per graph whether it needs long-term memory or local focus.
- Theoretical Stability: The link to SSMs guarantees that the model effectively combats the vanishing gradient and oversquashing problems.
For students and researchers dealing with large graphs where distant nodes affect each other (like biological networks or supply chains), GRAMA offers a theoretically grounded and practically efficient tool to capture those elusive long-range signals.
](https://deep-paper.org/en/paper/4819_graph_adaptive_autoregres-1677/images/cover.png)