Why You Should Stop Training MIL Models from Scratch - The Power of Transfer Learning in Pathology

In the world of deep learning, particularly in computer vision and natural language processing (NLP), starting from scratch is almost a cardinal sin. You wouldn’t train a language model on a blank slate when you can fine-tune BERT or GPT; you wouldn’t train an image classifier on pixels when you can use weights from ImageNet. This concept, known as transfer learning, is the engine driving modern AI.

However, in Computational Pathology (CPath)—the field dedciated to analyzing digitized tissue slides for cancer diagnosis—this standard practice hasn’t fully taken hold. When researchers build Multiple Instance Learning (MIL) models to analyze gigapixel whole slide images (WSIs), they almost exclusively initialize the aggregation networks with random weights.

Why is this the status quo? And more importantly, are we leaving performance on the table by ignoring transfer learning?

A comprehensive new study, “Do Multiple Instance Learning Models Transfer?”, tackles these questions head-on. The researchers systematically evaluated 11 different MIL architectures across 21 tasks to determine if pretraining MIL models can boost performance, improve data efficiency, and even outperform massive “foundation models.”

The Unique Challenge of Pathology

To understand why this paper is significant, we first need to understand the architecture of a pathology pipeline.

Digitized tissue slides (WSIs) are enormous—often exceeding 100,000 \(\times\) 100,000 pixels. You cannot feed an image this size into a standard neural network; the memory requirements would be impossible. Instead, the field uses a two-step framework called Multiple Instance Learning (MIL):

Patching & Encoding: The slide is cut into thousands of small squares (patches). A pretrained image encoder (like ResNet or a pathology-specific encoder like UNI) converts each patch into a feature vector.
Aggregation: This is the MIL part. A trainable “aggregator” network takes this bag of feature vectors and pools them together to make a single prediction for the whole slide (e.g., “Cancer” or “No Cancer”).

While the encoders (Step 1) are heavily pretrained, the aggregator (Step 2) is usually trained from scratch for every new specific task. The researchers argue that this is a missed opportunity. If an aggregator learns how to identify cancer in a lung biopsy, shouldn’t that knowledge help it analyze a breast biopsy?

The Experiment: A Massive Evaluation of Transfer

The authors designed a rigorous experimental setup to test the “transferability” of MIL models.

11 Architectures: They tested everything from simple attention mechanisms (ABMIL) to complex Transformer-based aggregators (TransMIL).
21 Pretraining Tasks: Models were pretrained on diverse datasets, including specific organs (like lung or breast) and massive pancancer datasets (covering multiple cancer types simultaneously).
19 Target Tasks: The pretrained models were then tested on downstream tasks they hadn’t seen before, ranging from tumor subtyping to genetic mutation prediction.

The “Pancancer” Hypothesis

A key component of this study was the creation of two specific pretraining benchmarks: PC-43 and PC-108. These are hierarchical classification tasks derived from over 3,900 slides across varied organs.

PC-43: Classifies the slide into one of 43 coarse cancer types.
PC-108: A fine-grained task classifying 108 specific subtypes.

The hypothesis was that a model forced to distinguish between 108 different cancer subtypes would learn a “universal grammar” of tissue architecture that could transfer anywhere.

Key Result 1: Pretraining Always Wins

The most immediate finding is visually striking. When the researchers compared models trained from random initialization against those initialized with weights from the PC-108 pancancer task, the difference was undeniable.

Figure 1. Average performance with supervised pretraining vs. random initialization. Performance of MIL models trained from random initialization (black) vs. initialized with weights from a model pretrained (red) on a 108-class pancancer task.

As shown in Figure 1 above, for every single architecture tested—from ABMIL to TransMIL—the pretrained version (red dots) outperformed the random initialization (black dots). On average, pretraining provided a 3.3% performance boost.

This debunks the idea that MIL aggregators are too task-specific to transfer. Whether the model is simple or complex, starting with knowledge is better than starting with nothing.

Key Result 2: You Can Transfer Across Organs

One might assume that a model trained on lung cancer would only be useful for other lung tasks. The study found otherwise.

The researchers used a “Frozen Feature” evaluation (using K-Nearest Neighbors) to see how well the raw representations from pretrained models clustered data in completely different target tasks.

Figure 2. Transfer performance across pretrain tasks. The contingency table shows the average KNN performance of three MIL models transferring from the 21 pretrain tasks to the 19 target tasks.

Figure 2 presents a heatmap of these results. Red indicates an improvement over random weights. The dominance of red across the board tells us two things:

Pancancer is King: The columns for PC-43 and PC-108 (pancancer tasks) show the strongest consistent improvements (darkest red).
Cross-Organ Transfer Works: A model trained on lung cancer (NSCLC) transferred effectively to breast cancer tasks (BCNB). This suggests that different cancers share fundamental morphological features—like tumor density or immune cell infiltration—that the MIL model learns to recognize.

Key Result 3: Pretraining Makes Models “Data Efficient”

In clinical settings, collecting thousands of labeled slides for a rare disease is impossible. This is where few-shot learning—the ability to learn from very few examples—becomes critical.

The authors tested how well these models performed when given only a handful of training samples (4, 16, or 32 slides per class).

Figure 3. Few-shot performance. Few-shot performance with different initialization strategies for K samples over five MIL methods on molecular subtyping tasks.

Figure 3 illustrates the gap in data efficiency. The red and blue lines (pancancer pretraining) are significantly higher than the black line (random initialization), especially when data is scarcest (\(K=4\)). For the DFTD architecture, pretraining boosted performance by 171% in the 4-shot setting. This means institutions with small datasets can achieve high-performance AI by leveraging pretrained weights.

Key Result 4: Scaling Laws in Pathology

In general deep learning, bigger models are usually better—if you have enough data. If you train a huge model on a small dataset from scratch, it usually overfits and fails.

The researchers investigated if this holds true for MIL by scaling up the ABMIL architecture from 0.2 million to 9 million parameters.

Figure 4. Transfer at different model scales. Average performance of different ABMIL scales across 19 evaluation tasks with initialization from random weights and PC-108 pretraining.

Figure 4 reveals an interesting trend:

Random Init (Black): Performance fluctuates and actually gets worse as the model gets larger. The large models are likely overfitting.
Pretrained (Red): Performance scales positively. As the model grows larger, it utilizes the pretraining better, peaking around 5 million parameters.

This suggests that pretraining “unlocks” the ability to use larger, more expressive MIL architectures that would otherwise be unstable.

Key Result 5: David vs. Goliath

Perhaps the most provocative result involves comparing this supervised pretraining approach against Slide Foundation Models.

Recently, tech giants and large labs have released massive models like GigaPath and CHIEF. These are trained using self-supervised learning on hundreds of thousands of slides. The authors compared their ABMIL model (pretrained on just ~4,000 slides via PC-108) against these giants.

Table 2. Comparison of MIL transfer with slide foundation models. KNN and finetuning performance for MIL transfer with slide foundation models (GigaPath, CHIEF) compared against an ABMIL model initialized with pancancer pretraining (PC-108).

Table 2 shows the results. Surprisingly, the PC-108 model (supervised) often outperformed the foundation models.

KNN Evaluation: PC-108 representations beat CHIEF on 12/15 tasks and GigaPath on 13/15 tasks.
Data Efficiency: PC-108 achieved this using <10% of the pretraining data required by the foundation models.

This challenges the prevailing “scale is all you need” narrative. It suggests that supervised pretraining on a diverse, high-quality hierarchical dataset (like PC-108) is incredibly dense with information, potentially more so than self-supervised learning on massive unlabeled datasets.

Mechanism: What exactly is being transferred?

Why does this work? Is the model just learning better feature processing, or is it learning how to look at a slide?

To answer this, the researchers inspected the attention weights. In an MIL model, attention weights determine which patches the model focuses on.

Figure 6. Heatmaps for visualizing attention transfer. Visualization of the three different ABMIL attention heatmaps for lung squamous cell carcinoma.

Figure 6 visualizes this perfectly.

Frozen (Random init): The model’s attention is scattered and diffuse. It doesn’t know what’s important.
Frozen (PC-108 init): Even before fine-tuning on the specific task, the model already focuses on the tumor regions (red areas).
Finetuned: After training, the focus sharpens further.

This confirms that pretraining transfers the aggregation strategy. The model arrives at the new task already knowing that tumor cells and tissue structures are more important than whitespace or background noise.

Furthermore, t-SNE visualizations of the slide-level embeddings show that pretrained models cluster distinct disease subtypes much better than random models, even without task-specific training.

Figure 5. t-SNE of slide-level features. Visualization of the slide-level features from randomly initialized ABMIL compared to ABMIL pretrained on PC-108.

Conclusion: A New Standard for Pathology AI?

This paper makes a compelling case for retiring the practice of random initialization in Computational Pathology. The authors show that supervised MIL transfer is:

Robust: Improves performance across architectures and tasks.
Efficient: Works well with limited data (few-shot).
Accessible: Achieves foundation-model-level performance with a fraction of the data and compute.

The implications are broad. By adopting transfer learning, researchers can develop diagnostic tools for rare diseases where data is scarce. It also suggests that the future of CPath might not just lie in bigger self-supervised models, but in smarter, hierarchically supervised datasets that teach models the fundamental “language” of histology.

For the community, the authors have released FEATHER, their pretrained PC-108 ABMIL model. It serves as a drop-in replacement for random initialization, offering an immediate performance upgrade for researchers working on everything from cancer subtyping to mutation prediction.

The Unique Challenge of Pathology#

The Experiment: A Massive Evaluation of Transfer#

The “Pancancer” Hypothesis#

Key Result 1: Pretraining Always Wins#

Key Result 2: You Can Transfer Across Organs#

Key Result 3: Pretraining Makes Models “Data Efficient”#

Key Result 4: Scaling Laws in Pathology#

Key Result 5: David vs. Goliath#

Mechanism: What exactly is being transferred?#

Conclusion: A New Standard for Pathology AI?#