SURGEON: How to Adapt Deep Learning Models on the Edge Without Running Out of Memory

Introduction

Imagine you have trained a state-of-the-art computer vision model to detect pedestrians for a self-driving car. It works perfectly in sunny California where it was trained. But when you deploy it to a rainy street in London, accuracy plummets. The visual conditions—the “distribution”—have shifted.

To fix this, researchers use Test-Time Adaptation (TTA). Instead of freezing the model after training, TTA allows the model to continue learning from the new, incoming data (like the rainy London streets) in real-time. It effectively fine-tunes the model “online” to adapt to the current environment.

However, there is a massive catch: Memory.

Updating a neural network requires backpropagation. Backpropagation requires storing the intermediate outputs (activations) of the network during the forward pass so they can be used to calculate gradients in the backward pass. On a massive server GPU, this is fine. On an edge device like a mobile phone, a drone, or an IoT sensor, this memory requirement is often a dealbreaker.

Today, we are diving into a paper that proposes a clever solution: SURGEON. This method introduces “Dynamic Activation Sparsity.” It intelligently decides which parts of the network’s memory are worth saving and which can be discarded (pruned) to save space, all without needing to change how the model was originally trained.

The Bottleneck: Why TTA is Hard on the Edge

Before understanding SURGEON, we need to clarify the problem it solves.

TTA vs. FTTA

In standard Test-Time Adaptation (TTA), you might have access to the original training data or be allowed to modify the training procedure to prepare the model for future adaptation.

However, in many real-world scenarios, we face Fully Test-Time Adaptation (FTTA). In FTTA:

You have a pre-trained model.
You do not have the original training data (often due to privacy or storage).
You cannot modify how the model was trained initially.
You must adapt to new data instantly.

The Memory Wall

The biggest obstacle in FTTA is the memory cost of backpropagation. When a neural network processes an image (Forward Pass), it generates “activations” at every layer. To update the weights (Backward Pass), the chain rule of calculus requires those saved activations.

\[ \Delta W _ { i } = \frac { \partial L } { \partial W _ { i } } = \frac { \partial L } { \partial A _ { i + 1 } } \frac { \partial A _ { i + 1 } } { \partial W _ { i } } = A _ { i } ^ { T } \frac { \partial L } { \partial A _ { i + 1 } } , \]

As seen in the equation above (taken from the paper’s preliminaries), calculating the weight update \(\Delta W_i\) requires \(A_i\) (the activations). Storing these \(A_i\) tensors for every layer consumes a massive amount of memory—often exceeding the capacity of edge hardware.

Existing Solutions and Their Flaws

Researchers have tried to solve this before, but previous attempts had limitations:

Figure 1 comparing EcoTTA, MECTA, and SURGEON.

As shown in Figure 1:

(a) EcoTTA: This method adds lightweight “meta networks” to the model. It freezes the main backbone and only updates these small networks. While memory-efficient, it fails the FTTA requirement because it demands a specific modification to the original training procedure to “warm up” these meta modules.
(b) MECTA: This method tries to save memory by only updating certain channels or layers. However, it decides what to update based on Batch Normalization (BN) statistics. This makes it incompatible with architectures that don’t use BN layers heavily, such as modern Vision Transformers (ViTs).
(c) SURGEON (Ours): This is the method we are analyzing. It works on any architecture (CNN or Transformer) and requires zero changes to the original training. It solves the problem by pruning activations—literally deleting parts of the memory cache that aren’t strictly necessary.

The SURGEON Method: Dynamic Activation Sparsity

The core insight of SURGEON is that not all activations are created equal. Some layers are critical for learning (accuracy), while others are just memory hogs. Furthermore, this changes depending on the data.

Instead of keeping 100% of the activations for the backward pass, SURGEON applies Dynamic Activation Sparsity. It saves a “sparse” version of the activations—perhaps keeping only 50% or 30% of the values—and zeroes out the rest.

The Workflow

Figure 2 showing the forward and backward propagation with pruning.

Figure 2 illustrates the process elegantly:

Forward Pass: The data flows through the network. At each layer \(i\), the model computes the activation \(A_i\).
Pruning: Before storing \(A_i\) in the cache, SURGEON calculates a specific pruning ratio \(p_i\) for that layer. It removes the least important elements, resulting in a sparse tensor \(\dot{A}_i\).
Caching: Only the non-zero values (and their indices) of \(\dot{A}_i\) are stored in memory.
Backward Pass: When calculating gradients, the system reconstructs the sparse tensor and uses it to update the weights.

The magic lies in how SURGEON decides how much to prune from each layer. It doesn’t use a fixed number (like “prune 50% everywhere”). Instead, it optimizes for two competing goals: minimizing memory usage while maximizing accuracy.

\[ \operatorname* { m i n } \quad \alpha \cdot \mathbf { M e m o r y } - \beta \cdot \mathbf { A c c u r a c y } , \]

To achieve this balance, the authors introduce two metrics: Gradient Importance and Layer Activation Memory Importance.

Metric 1: Gradient Importance (\(G\))

First, the system needs to know which layers are actually learning. If a layer has very small gradients, it means the weights aren’t changing much, so the activations for that layer aren’t contributing significantly to the adaptation.

The authors define Gradient Importance (\(G_i\)) for layer \(i\) based on the magnitude of its weight gradients:

\[ \Delta w _ { i } = \frac { \partial L } { \partial w _ { i } } , \quad G _ { i } = \sqrt { \frac { \sum _ { j = 1 } ^ { N _ { i } } \left( \Delta w _ { j } \right) ^ { 2 } } { N _ { i } } } , \]

Here, \(G_i\) is essentially the average strength of the gradients in that layer. A high \(G_i\) means the layer is actively adapting to the new data, so we should keep more of its activations (prune less) to ensure accuracy.

Metric 2: Layer Activation Memory (\(M\))

Second, the system looks at the cost. Some layers (usually the early, high-resolution layers in CNNs) produce massive activation maps that eat up RAM. Other layers (deeper in the network) are smaller.

The Memory Importance (\(M_i\)) metric quantifies how “efficient” a layer is regarding memory:

\[ m _ { i } = \mathbf { s i z e } ( A _ { i } ) , \quad M _ { i } = - \log \left( \frac { m _ { i } } { \sum _ { i = 1 } ^ { l } m _ { i } } \right) , \]

The formula calculates the size of the activations \(m_i\) relative to the total size of the network. The logarithmic scaling helps normalize the values. Essentially, layers that occupy a huge amount of memory will result in a lower preference for keeping them—we want to prune these heavily to save space.

Combining Metrics into a Pruning Ratio

Finally, SURGEON combines these two metrics into a single Importance Indicator (\(I_i\)). It normalizes both metrics to a 0-1 scale so they can be compared fairly:

\[ I _ { i } = \mathbf { N o r m } ( M _ { i } ) \times \mathbf { N o r m } ( G _ { i } ) , \]

This indicator \(I_i\) tells us how “valuable” the activations of layer \(i\) are.

High \(I_i\): The layer is learning a lot (High \(G\)) AND/OR is efficient with memory (High memory score, meaning small size). Action: Prune Little.
Low \(I_i\): The layer isn’t learning much OR it is taking up way too much RAM. Action: Prune A Lot.

The final pruning ratio \(p_i^t\) (percentage of activations to drop) is calculated dynamically for each batch \(t\):

\[ p _ { i } ^ { t } = 1 - \frac { I _ { i } ^ { t } } { \operatorname* { m a x } _ { i \in \{ 1 , 2 , . . . , l \} } { ( I _ { i } ^ { t } ) } } , \]

This formula ensures that the most important layer in the network is pruned the least, and everything else is scaled relative to it.

Experimental Results

Does this actually work? The authors tested SURGEON against standard baselines (like TENT and CoTTA) and memory-efficient baselines (like EcoTTA and MECTA).

Why Dynamic is Better than Static

You might wonder: “Why calculate all these metrics? Why not just set a global rule to prune 50% of activations everywhere?”

The authors compared SURGEON against “Static Activation Sparsity” (fixed pruning ratios).

Figure 3: Dynamic vs Static Sparsity.

Figure 3 shows the comparison. The x-axis represents the pruning ratio (how much data is thrown away), and the y-axis is the error rate (lower is better).

Look at the red star (SURGEON) versus the teal line (Static).
SURGEON achieves significantly lower error rates while maintaining high sparsity (around 80-90%).
This proves that what you prune matters just as much as how much you prune. By keeping the important data and discarding the useless data, SURGEON outperforms blind static pruning.

Analyzing the Metrics

To understand what the algorithm is actually doing inside the network, let’s look at how the importance metrics change across layers.

Figure 4: Visualization of importance metrics across layers.

Figure 4 visualizes the importance scores across the layers of a ResNet.

Block 1 (Layers 4-16): Notice the importance drops sharply here. These early layers have massive activation maps (high memory cost). SURGEON identifies this and lowers their importance score, leading to aggressive pruning to save RAM.
Deep Layers: In the later blocks, the importance is higher. These layers are smaller (low memory cost) but carry semantic information critical for classification (high gradient importance). SURGEON preserves these.

The red line (“Ours”) represents the combined metric, effectively balancing the blue (Gradient only) and green (Memory only) lines.

Benchmarks: CIFAR and ImageNet

The paper provides extensive tables, but let’s focus on the key takeaways from the standard benchmarks.

CIFAR-10-C Results (Table 1):

Table 1: CIFAR-10-C Results.

In Table 1, look at the comparison between CoTTA and SURGEON:

CoTTA: Good accuracy (16.2% error), but massive cache size (3697 MB). This simply wouldn’t fit on many edge chips.
SURGEON: Comparable accuracy (18.1% error), but with a tiny cache size (325 MB).
Reduction: That is a 10x reduction in memory usage.
Also, note that SURGEON is marked with an “X” in the “Original” column, meaning it works without modifying the training procedure, unlike EcoTTA.

ImageNet-C Results (Table 4):

Table 4: ImageNet-C Results.

On the much harder ImageNet dataset (Table 4), the trend continues.

TENT: 2714 MB Cache.
SURGEON: 1125 MB Cache.
SURGEON cuts memory usage by more than half compared to TENT while maintaining nearly identical accuracy (55.5% vs 55.2%).

Real-World Deployment: Jetson Xavier

The ultimate test for any “efficient” algorithm is running it on actual edge hardware. The authors deployed SURGEON on an NVIDIA Jetson Xavier NX (a popular embedded AI computer).

Table 5: Performance on Jetson Xavier NX.

Table 5 reveals the real-world impact:

CoTTA requires 522ms to process a batch.
SURGEON takes only 117ms.
Memory: SURGEON uses roughly 1/10th the cache of CoTTA and comparable memory to MECTA.

Crucially, because SURGEON uses less memory, it avoids the “swapping” slowdowns that occur when a device runs out of RAM, leading to faster overall inference speeds.

Conclusion

The transition of AI from research labs to the real world relies on robustness. Models must adapt to changing environments (rain, fog, sensor noise) without crashing the hardware they run on.

SURGEON represents a significant step forward for Fully Test-Time Adaptation (FTTA). By treating memory as a dynamic resource—spending it where it helps learning and saving it where it doesn’t—it allows sophisticated adaptation algorithms to run on constrained devices.

Key Takeaways:

Plug-and-Play: Unlike EcoTTA, SURGEON works with standard pre-trained models.
Architecture Agnostic: Unlike MECTA, it doesn’t rely on Batch Norm, making it future-proof for Transformers.
Smart Caching: By using Gradient and Memory importance, it outperforms static pruning, saving up to 10x memory with minimal accuracy loss.

For students and engineers looking to deploy robust AI on the edge, SURGEON offers a practical blueprint for balancing the trade-off between adaptability and efficiency.

Introduction#

The Bottleneck: Why TTA is Hard on the Edge#

TTA vs. FTTA#

The Memory Wall#

Existing Solutions and Their Flaws#

The SURGEON Method: Dynamic Activation Sparsity#

The Workflow#

Metric 1: Gradient Importance (\(G\))#

Metric 2: Layer Activation Memory (\(M\))#

Combining Metrics into a Pruning Ratio#

Experimental Results#

Why Dynamic is Better than Static#

Analyzing the Metrics#

Benchmarks: CIFAR and ImageNet#

Real-World Deployment: Jetson Xavier#

Conclusion#