Introduction
In the world of medical imaging, alignment is everything. Whether a clinician is tracking the growth of a tumor over time or comparing a patient’s brain anatomy to a standard atlas, the images must be perfectly overlaid. This process is called Deformable Image Registration (DIR).
For years, Deep Learning has revolutionized this field. Networks like VoxelMorph replaced computationally expensive iterative algorithms with fast, learning-based models. Most of these models rely on Convolutional Neural Networks (CNNs). However, standard CNNs have a fundamental trait that acts as a double-edged sword: spatial invariance.
In a standard convolution, the same filter (kernel) slides across the entire image. This implies that a feature (like an edge or a texture) is processed identically whether it appears in the top-left corner or the center. While this is excellent for object detection (a cat is a cat, regardless of position), it is suboptimal for medical registration. Why? Because biological tissues are not uniform.
Consider a brain scan: the rigid skull, the soft gray matter, and the fluid-filled ventricles all possess different physical properties. They deform differently. A convolution filter optimized for the skull might struggle to capture the subtle warping of soft tissue.
In this post, we will dive deep into SACB-Net, a novel architecture proposed by researchers from the University of Birmingham and the University of Manchester. This paper introduces the Spatial-Awareness Convolution Block (SACB), a mechanism that allows a network to “sense” the type of tissue it is processing and generate adaptive kernels on the fly. We will explore how this method breaks the limitations of shared weights and achieves state-of-the-art results in both brain and abdominal registration.
The Problem with “Vanilla” Convolutions
To understand the innovation of SACB-Net, we must first visualize the limitation of traditional methods.
In a standard (“vanilla”) convolution layer, a learned kernel \(W\) is applied to the input feature map \(F\). Every voxel in \(F\) interacts with \(W\) in exactly the same way. The network assumes that the rules of feature extraction are spatially universal.
However, medical registration is inherently a problem of local variation. The deformation field—the map telling us how to move pixels from Image A to match Image B—is highly dependent on the underlying anatomical structure.

As illustrated in Figure 1, a vanilla convolution pays “equal attention” to all regions. In contrast, a Spatial-Awareness Convolution (SAC) recognizes that different clusters (representing different tissue types) require unique processing weights. By adjusting the kernel based on the spatial context, the network can model complex, non-uniform deformations more accurately.
The Architecture: SACB-Net
The researchers propose SACB-Net, a pyramid-based network designed to estimate deformation fields from coarse to fine scales.
High-Level Overview
The network operates on a pair of images: a Moving Image (\(I_m\)) and a Fixed Image (\(I_f\)). The goal is to find a deformation field \(\phi\) such that \(I_m\) warped by \(\phi\) aligns with \(I_f\).
The architecture, shown below, consists of two main components:
- A Shared Encoder: Extracts feature pyramids from both images.
- Pyramid Flow Estimators: A series of blocks that estimate the deformation at different resolutions, progressively refining the alignment.

The Shared Encoder (detailed in Figure 5 below) uses standard convolutional blocks to downsample the images, creating a hierarchy of features (\(F^1\) through \(F^5\)).

However, the real innovation lies within the Pyramid Flow Estimators, specifically in how they process these features using the Spatial-Awareness Convolution Block (SACB).
Deep Dive: The Spatial-Awareness Convolution Block (SACB)
The SACB is designed to refine feature maps by applying adaptive kernels. Instead of using one fixed kernel, it clusters the image features into different “regions” (e.g., bone, tissue, background) and generates specific kernel weights for each region.
Let’s break down the mathematics and mechanics of the SACB step-by-step.
Step 1: Spatial Context Estimation via Clustering
The first challenge is to determine which parts of the feature map belong to similar regions without using explicit labels (since we are doing unsupervised registration). The authors use K-Means clustering on the feature map itself.
First, the input feature map \(\mathbf{F}\) is “unfolded” into local patches. For a voxel at position \((d, h, w)\), the system looks at a local window (size \(k \times k \times k\)).

To reduce computational complexity, these patches are averaged to obtain a representative vector for that neighborhood:

Next, K-Means clustering is applied to these spatial means. This groups voxels with similar local features into \(N\) clusters. For each cluster \(n\), the algorithm calculates a centroid (\(S_n^c\)), which represents the “average” feature of that tissue type.

This process results in a map where every voxel is assigned a cluster index, effectively segmenting the image into latent “tissue types” based on feature similarity.
Step 2: Adaptive Kernel Generation
Once the centroids (\(S_n^c\)) are known, the network generates a specific convolution kernel for each cluster.
The network learns a global base kernel \(\mathbf{W}\). Simultaneously, a Multi-Layer Perceptron (MLP), denoted as \(\mathcal{F}_w\), takes the cluster centroid as input and predicts a specific weight adjustment.
The adaptive kernel \(\mathbf{W}_n\) for cluster \(n\) is calculated as the element-wise multiplication of the base kernel and the MLP output:

This is a powerful concept. The network is essentially saying, “I have a general idea of how to process features (Global \(\mathbf{W}\)), but for this specific tissue type (Centroid \(S_n^c\)), I need to adjust my filters by this amount.”
Step 3: Performing the Convolution
With specific kernels generated for every region, the convolution operation becomes spatially adaptive. For a voxel belonging to cluster \(n\), the output is computed using \(\mathbf{W}_n\) and a similarly generated bias term:

Finally, to ensure gradient flow and stability, this adaptive convolution is applied as a residual block (added back to the original input):

The entire pipeline of the SACB is visualized in Figure 3 below. Note the two parallel paths: the top path determines the “context” (clusters), while the bottom path prepares the features. They merge when the generated kernels are applied to the features.

The Pyramid Flow Estimator
SACB-Net does not try to predict the complex deformation field in a single shot. Instead, it uses a coarse-to-fine pyramid approach.
- Coarsest Level (\(i=5\)): The network looks at the lowest resolution features. It refines them using SACB.
- Similarity Matching: It calculates the similarity between the fixed features and the moving features. The authors use a dot-product-based matching score (Softmax of the inner product) to find correspondences.

- Flow Estimation: Based on these matching scores, a local sub-deformation (flow) is calculated.

- Upsampling and Composition: This coarse flow is upsampled and used to warp the moving features of the next level (\(i=4\)). The process repeats, with the network estimating the residual motion (the difference) at each finer scale.
The recursive process is mathematically defined as:

This ensures that large deformations are handled at low resolutions (where structures appear closer together), while fine details are aligned at high resolutions.
Loss Functions
SACB-Net is trained in an unsupervised manner. This means it doesn’t need “ground truth” deformation fields (which are rarely available in medicine). Instead, it optimizes two objectives:
- Similarity Loss (\(\mathcal{L}_{sim}\)): Ensures the warped moving image looks like the fixed image.
- Regularization Loss (\(\mathcal{L}_{reg}\)): Ensures the deformation field is smooth and physically plausible (no tearing or folding of space).

The specific similarity metric used is Normalized Cross-Correlation (NCC), which is robust to intensity variations.

Experimental Results
The authors evaluated SACB-Net on three distinct tasks: Atlas-based brain registration (IXI dataset), Inter-subject brain registration (LPBA dataset), and Inter-subject Abdomen CT registration.
Quantitative Performance
The results were measured using the Dice score (overlap of anatomical structures, higher is better) and Jacobian determinant (percentage of folding, lower is better).
Brain Registration (IXI and LPBA): In the table below, SACB-Net (“Ours”) shows superior performance compared to U-Net based methods (like VoxelMorph/VM) and Transformer-based methods (TransMorph). Notably, on the LPBA dataset, it achieves the highest Dice score (0.731) with a very low number of parameters (1.11M) compared to TransMorph (46.77M).

Abdomen CT Registration: Abdominal registration is notoriously difficult due to the high variability of organ placement and large deformations (breathing, digestion). Here, the gap between SACB-Net and the competition widens. SACB-Net achieves a Dice score of 0.588, significantly outperforming the next best method.

Qualitative Visualization
Visual inspection confirms the numbers. In Figure 4 (below), comparing the columns, we can see that SACB-Net provides sharper alignment.
Look closely at the Abdomen CT rows (bottom two). Many methods struggle to preserve the boundaries of the organs. TransMorph and RDN fail to register parts of the kidneys (missing chunks in the warped mask). SACB-Net, however, maintains the structural integrity of the kidneys and liver.

We can also view the displacement fields (the colorful grids). SACB-Net produces smooth, coherent fields, whereas some competitors show erratic or noisy deformations.
Below are additional visual comparisons for the LPBA and IXI datasets, further highlighting the precision of the displacement fields.


Dealing with Failure
No method is perfect. The authors candidly present a failure case in Figure 7, where a small organ (gallbladder) resulted in a low Dice score (<0.5). Small, variable structures remain a challenge for unsupervised methods relying on global intensity matching.

Ablation Studies: Does Spatial Awareness Matter?
To prove that the performance boost comes specifically from the SACB and not just the pyramid structure, the authors conducted ablation studies.
They tested the network with and without SACB at different scales and with varying numbers of clusters (\(N\)).
- Impact of Scales: Applying SACB at multiple scales (from Scale 5 down to Scale 2) consistently improves results.
- Impact of Clusters: Increasing \(N\) (the number of tissue types identified) improves accuracy up to a point. The sweet spot appears to be around \(N=7\).
- Method of Context: Using “Spatial” means (averaging spatial patches) worked better than averaging across channels.

Conclusion
SACB-Net represents a significant step forward in medical image registration. By challenging the assumption that convolution kernels should be shared universally across an image, the authors have created a network that adapts to the anatomy it sees.
Key Takeaways:
- Spatial Context Matters: Medical images are heterogeneous. Treating skull, brain, and fluid identically leads to suboptimal registration.
- Clustering as Attention: Using unsupervised K-Means to cluster features provides a robust signal for generating adaptive kernels.
- Efficiency: Despite its complex adaptive mechanism, SACB-Net remains computationally efficient, outperforming massive Transformer models with a fraction of the parameters.
For students and researchers in this field, SACB-Net demonstrates that innovation often comes not from just stacking more layers, but from rethinking the fundamental operations—like convolution—to better suit the specific characteristics of the data.
](https://deep-paper.org/en/paper/2503.19592/images/cover.png)