Introduction
In the world of computational pathology, a picture is worth much more than a thousand words—it might be worth thousands of gene expression profiles.
For decades, pathologists have diagnosed diseases using Hematoxylin and Eosin (H&E) stained slides. These images reveal tissue morphology—the shape and structure of cells. However, morphology is only half the story. The molecular drivers of disease, specifically gene expression, are invisible to the naked eye. Spatial Transcriptomics (ST) is a revolutionary technology that bridges this gap, allowing scientists to map gene expression to specific physical locations on a tissue slide. It is akin to moving from a satellite map (visual features) to a street view with demographic data (molecular features).
The catch? ST is expensive, slow, and requires specialized equipment. This has created a massive demand for AI models that can “hallucinate” or predict spatial transcriptomics data directly from standard, cheap H&E images.
While recent deep learning attempts have been promising, they face two significant hurdles:
- Isolation: They often treat small patches of tissue in isolation, ignoring the biological reality that cells communicate and interact with their neighbors.
- Scalability: Methods that do try to look at the whole slide often crash computer memory because a single slide can contain tens of thousands of spots.
Enter STFlow, a new model presented at ICML 2025. This research proposes a generative approach using Flow Matching to predict gene expression. Unlike previous methods that guess values in a single shot, STFlow treats the problem as an iterative generation process, modeling the joint distribution of genes across the entire slide.
In this post, we will tear down the STFlow architecture, explain why “Flow Matching” fits this biological problem perfectly, and look at how it manages to be both more accurate and more efficient than its predecessors.
Background: The Challenge of Histology-to-ST
To understand STFlow, we first need to understand the data structure. An H&E whole-slide image (WSI) is usually segmented into a grid of small squares called spots.
- Input: An image patch for a specific spot and its \((x, y)\) coordinates.
- Output: The gene expression levels (counts of RNA transcripts) for thousands of genes at that spot.
The Limitation of Regression
Most prior works frame this as a simple regression problem. They take an image patch, run it through a Convolutional Neural Network (CNN) or a Transformer, and ask the model to output a number.
The problem with this approach is independence. By predicting spot A and spot B separately, the model ignores cell-cell interaction. In biology, a tumor cell at spot A might be secreting signals that suppress immune cells at spot B. If the model doesn’t look at A and B together, it misses the context necessary to make an accurate prediction.
Some “slide-based” methods have tried to fix this using global attention mechanisms, but they hit a computational wall. Calculating attention between 10,000+ spots requires massive memory (\(O(N^2)\) complexity), making it impractical for clinical workflows.
The Core Method: STFlow
STFlow changes the paradigm from regression to generation. Instead of asking “What is the exact gene count here?”, it asks “What is the most likely distribution of gene counts across this whole slide, given the image?”

As shown in Figure 1, the process works in three stages:
- Visual Encoding (a): Each spot image is processed by a Pathology Foundation Model (like UNI or Gigapath) to extract high-level visual features (\(Z\)).
- Contextual Encoding (b): The model looks at neighbors to understand spatial dependencies.
- Flow Matching (c): The model starts with random noise and iteratively refines it into a clean gene expression map.
Let’s break down the two main technical innovations: the Generative Framework and the Denoiser Architecture.
1. Learning with Flow Matching
Flow Matching is a technique for training Continuous Normalizing Flows. Simplistically, it learns a “vector field” that pushes a probability distribution from a simple shape (noise) to a complex shape (data).
In STFlow, the researchers reformulate gene prediction as a conditional generation task.
\[ \operatorname* { m i n } _ { \theta } \operatorname { M S E } \left( Y , f _ { \theta } \left( Y _ { t } , I , C , t \right) \right) \]
Here, the model \(f_\theta\) (the denoiser) tries to predict the clean gene expression \(Y\) given a noisy version \(Y_t\), the images \(I\), coordinates \(C\), and the time step \(t\).
The Choice of Prior: ZINB
In generative models like Diffusion, we usually start with Gaussian (Normal) noise. However, gene expression data is unique. It is sparse (many zeros) and overdispersed (variance is higher than the mean).
The authors analyzed standard datasets and found that a Gaussian prior doesn’t fit biological reality.

As seen in Figure 2, the data is heavily skewed toward zero. To account for this, STFlow uses a Zero-Inflated Negative Binomial (ZINB) distribution as its prior.
\[ \begin{array} { r l } & { p ( y \mid \mu , \phi , \pi ) = } \\ & { \left\{ \begin{array} { l l } { \pi + \left( 1 - \pi \right) \left( \frac { \Gamma ( y + \phi ) } { \Gamma ( \phi ) y ! } \right) \left( \frac { \phi } { \phi + \mu } \right) ^ { \phi } \left( \frac { \mu } { \phi + \mu } \right) ^ { y } } & { \mathrm { i f ~ } y = 0 , } \\ { \left( 1 - \pi \right) \left( \frac { \Gamma ( y + \phi ) } { \Gamma ( \phi ) y ! } \right) \left( \frac { \phi } { \phi + \mu } \right) ^ { \phi } \left( \frac { \mu } { \phi + \mu } \right) ^ { y } } & { \mathrm { i f ~ } y > 0 , } \end{array} \right. } \end{array} \]
This complex formula essentially says: “There is a probability \(\pi\) that the gene count is exactly zero (dropout/sparsity). If it’s not zero, it follows a Negative Binomial distribution defined by mean \(\mu\) and dispersion \(\phi\).” By sampling the starting noise from this distribution rather than a Gaussian one, the model starts much closer to the “truth,” making the generation process easier and more accurate.
2. The Architecture: E(2)-Invariant Denoiser
The second major innovation is how the model handles spatial data. Tissue slides can be rotated 90 degrees, flipped, or shifted, and the biological meaning shouldn’t change. A tumor is a tumor, whether it’s on the left or right side of the image. This property is called E(2)-invariance (Euclidean group in 2D).
Standard Transformers are not E(2)-invariant; they are sensitive to specific coordinate values. STFlow solves this using Frame Averaging (FA).

How Frame Averaging Works
Instead of feeding raw coordinates into the network, the model:
- Calculates the direction vectors between a spot and its neighbors.
- Uses Principal Component Analysis (PCA) to find the “frames” (dominant axes) of the local point cloud.
- Projects the data into these frames.
- Averages the results.
This ensures that no matter how you rotate the input slide, the feature representation inside the neural network remains consistent.
Spatial Attention with Interaction
The attention mechanism in STFlow is designed to explicitly model cell-cell interactions.
\[ \begin{array} { r l } & { A _ { i j } = } \\ & { \mathrm { S o f t m a x } _ { i } ( \mathbf { M L P } ( \boldsymbol { Z } _ { Q , i } \mid \mid \boldsymbol { Z } _ { K , j } \mid \mid \boldsymbol { C } _ { i j } ^ { \prime } \mid \mid ( \boldsymbol { Y } _ { t , i } - \boldsymbol { Y } _ { t , j } ) ) ) } \end{array} \]
Look closely at the inputs to the MLP in the equation above. It uses visual features (\(Z\)), spatial relations (\(C'\)), and, crucially, the difference in gene expression (\(Y_{t,i} - Y_{t,j}\)). Because this is an iterative process, the model can use the current estimate of gene expression to inform the attention weights. This allows the model to learn relationships like “If neighbor J has high expression of Gene X, spot I should pay more attention to it.”
Experiments & Results
The researchers evaluated STFlow on two massive benchmarks: HEST-1k and STImage-1K4M, comprising 17 different datasets across various organs (Breast, Prostate, Lung, etc.).
Gene Expression Prediction
The primary metric is the Pearson Correlation Coefficient (PCC) between the predicted gene counts and the ground truth.

As shown in Table 1, STFlow (far right column) consistently outperforms both spot-based methods (like UNI and Ciga) and complex slide-based methods (like Gigapath-slide and TRIPLEX).
- Average Improvement: Over 18% relative improvement over pathology foundation models.
- Consistency: It achieves the best results in almost every organ type.
Biomarker Discovery
Clinical value lies in identifying specific biomarkers—genes that indicate disease prognosis or treatment response. The authors tested prediction accuracy for key cancer markers: GATA3, ERBB2, UBE2C, and VWF.

Visualizing the predictions (Figure 7) reveals that STFlow (rightmost column) generates heatmaps that are startlingly close to the Ground Truth (leftmost column). Competitors like BLEEP or STNet often produce noisy or blurry maps that miss the structural definition of the tumor regions.

Table 2 confirms this quantitatively. STFlow achieves the highest correlation for all four biomarkers. Notably, the table includes an ablation “STFlow w/o FM” (without Flow Matching). The drop in performance there proves that the generative, iterative approach provides a significant boost over simple regression.
The Refinement Process
One of the most fascinating aspects of Flow Matching is watching the model “think.” Because it generates data over time steps (\(t\)), we can visualize the transition from noise to signal.

In Figure 6, you can see the gene expression map sharpening. At Step 1, it is a vague blur. By Step 5, distinct morphological structures appear that match the underlying tissue architecture. This iterative refinement allows the model to correct errors and sharpen boundaries that one-step regression models essentially “blur out.”
Efficiency and Scalability
High accuracy usually comes at the cost of high compute. However, STFlow utilizes a Local Spatial Attention mechanism (restricting attention to \(k\)-nearest neighbors) rather than global attention.

Figure 5 shows the runtime (y-axis) vs. the number of spots (x-axis).
- Blue/Orange Lines (Competitors): As the number of spots grows, their runtime spikes or plateaus at a high level.
- Purple Line (STFlow): Remains incredibly fast and flat, even as the slide size scales up to 30,000 spots.
By using local attention combined with the efficient Frame Averaging technique, STFlow avoids the quadratic complexity trap, making it feasible to run on standard GPUs without running out of memory.
Conclusion
STFlow represents a sophisticated step forward in digital pathology. By moving away from simple regression and embracing a generative flow matching framework, it captures the complex joint distributions of gene expression across a tissue slide.
Its success relies on three pillars:
- Biological Intuition: Using a ZINB prior to match the sparsity of gene data.
- Geometric Rigor: Using Frame Averaging to ensure the model understands tissue structure regardless of orientation.
- Contextual Modeling: Using iterative refinement to allow neighboring cells to inform each other’s predicted state.
For students and researchers in this field, STFlow demonstrates that “how” you predict (generative vs. regression) matters just as much as “what” you use to predict (foundation models). As spatial transcriptomics continues to mature, lightweight, scalable models like this will be essential for bringing these insights out of the lab and into clinical workflows.
](https://deep-paper.org/en/paper/2506.05361/images/cover.png)