Artificial Intelligence has made massive strides in medical imaging, particularly in segmentation—the process of identifying and outlining boundaries of tumors or organs in scans. However, a persistent shadow hangs over these advancements: bias.

Deep learning models are data-hungry. In clinical practice, data is rarely balanced. We often have an abundance of data from specific demographics (e.g., White patients) or common disease stages (e.g., T2 tumors), but a scarcity of data from minority groups or varying disease severities. When a standard neural network trains on this skewed data, it becomes a “lazy learner.” It optimizes for the majority and fails to generalize to the underrepresented groups. In a medical context, this isn’t just an accuracy problem; it’s an ethical crisis. A model that works perfectly for one patient group but fails for another is unsafe for deployment.

In this post, we are diving into a fascinating research paper, “Distribution-aware Fairness Learning in Medical Image Segmentation From An Control-Theoretic Perspective.” The authors propose a novel framework called Distribution-aware Mixture of Experts (dMoE). What makes this paper unique is that it doesn’t just tweak the data; it reimagines the neural network as a dynamic control system, borrowing heavy-duty concepts from optimal control theory to engineer a fairer model.

The Problem: When “Average” Good Performance Isn’t Good Enough

In medical image segmentation, we usually look at metrics like the Dice score (overlap between the prediction and the ground truth). A model might boast a 90% average Dice score, which sounds fantastic. But if you peel back the layers, you might find it achieves 95% on the majority group and only 60% on a minority group.

The authors identify two main types of attributes that cause this imbalance:

  1. Demographic Attributes: Race, sex, and age.
  2. Clinical Factors: Disease severity (e.g., Tumor staging), which impacts the visual appearance of the anatomy.

Figure 1 illustrating the problem of data imbalance and how dMoE addresses equity. Top: Histograms show skewed distributions for Prostate Cancer (T-stage), Ophthalmology (Race), and Skin Lesions (Age). Middle: Violin plots show how dMoE flattens the performance variance across groups compared to poor equity models.

As shown in Figure 1 above, the top section highlights the stark imbalances in standard datasets. For example, the prostate cancer dataset is dominated by T2 and T3 stages, while T1 and T4 are rare. The middle section visualizes the goal: shifting from “Poor Equity” (where performance varies wildly) to “Improved Equity” (where the model performs consistently across all groups).

The Solution: Distribution-aware Mixture of Experts (dMoE)

To solve this, the researchers turned to the Mixture of Experts (MoE) architecture but added a critical twist: Distribution Awareness.

What is a Mixture of Experts?

In a standard deep learning layer, every input image is processed by the exact same neurons. A Mixture of Experts (MoE) layer is different. It consists of a set of different neural networks (called “experts”) and a “gating network” (or router). For any given input, the router decides which experts are best suited to handle it.

The output of a standard MoE layer is a weighted sum of the chosen experts:

Equation for standard Mixture of Experts output.

Here, \(G(x)\) is the gating decision and \(E_i(x)\) is the output of the \(i\)-th expert. This allows the model to specialize parts of its network for different types of data.

The “Distribution-aware” Twist

The standard MoE routes based solely on the input image features (\(x\)). The authors argue that to achieve fairness, the router needs to explicitly know the attribute (e.g., “This patient is Stage T4” or “This patient is from the minority demographic”).

They propose dMoE, where the gating network \(G\) is conditioned on the attribute flag (\(attr\)). The new formulation looks like this:

Equation for Distribution-aware Mixture of Experts output.

In this equation:

  • \(\tilde{h}_l\) is the image feature input.
  • \(G^{attr}\) is the attribute-specific router.
  • The router selects the top-\(k\) experts (usually 2 out of 8) to process the feature.

This allows the network to dynamically switch “modes” based on who the patient is or how severe their disease is.

The Architecture

This dMoE module isn’t a standalone model; it is a building block that can be inserted into existing powerful architectures like Transformers (TransUNet) or CNNs (3D ResUNet).

Figure 2 showing the dMoE schematic. (a) The segmentation network structure where dMoE blocks are inserted. The red box details the dMoE module: Inputs enter a router which uses the attribute flag to select top-k experts. (b) Comparison of control systems: Non-feedback vs. Feedback vs. Mode-switching control.

As seen in Figure 2(a), the attribute flag flows directly into the router networks. The router computes scores for all experts and applies a Noisy Top-K Gating mechanism to keep only the most relevant ones active. The gating function is defined as:

Equation for the gating function G(x).

This ensures that the model focuses its computational power on the experts that have learned to handle that specific distribution of data.

The Theory: Neural Networks as Control Systems

One of the most intellectually satisfying parts of this paper is how it frames the fairness problem using Optimal Control Theory. If you aren’t an engineer, this might sound intimidating, but the analogy is quite intuitive.

1. Neural Networks are Dynamical Systems

We can view the layers of a neural network as time steps in a dynamical system. The “state” of the system is the feature map \(h\), which evolves as it passes through layers (time). A standard residual block update looks like this:

Equation showing discrete residual connection update.

This looks remarkably like the Euler method for solving Ordinary Differential Equations (ODEs). In continuous time, this is:

Equation for continuous time system dynamics.

Here, \(u_t\) represents the “control” (the weights/parameters) applied at time \(t\).

2. Training is Optimal Control

Training a neural network is essentially trying to find the best sequence of controls (weights) to minimize the error (loss function) at the end of the process.

Equation describing the optimal control problem for minimizing loss.

3. From Open-Loop to Feedback Control

Standard neural networks are Non-feedback (Open-loop) systems (see Figure 2b, left). The weights are fixed after training. The network applies the same transformation regardless of the intermediate state.

The Mixture of Experts (MoE) changes this. Because the router looks at the current input \(h_t\) to decide which expert to use, the control \(u_t\) becomes a function of the state. This is Feedback (Closed-loop) Control:

Equation showing the feedback control mechanism.

This makes the system adaptive. It can adjust its trajectory based on where it currently is in the feature space.

4. dMoE is Mode-Switching Control

The authors take it one step further. In complex mechanical systems (like flight control), a single feedback law isn’t enough. You need different “modes” for takeoff, cruising, and landing.

The dMoE acts as a Mode-Switching Controller. It switches its control strategy based on external environmental variables—in this case, the demographic or clinical attributes (\(attr\)).

Equation defining the mode-switching control dynamics.

Here, \(s(attr)\) determines which policy \(\kappa\) to use. By mathematically proving that the MoE structure approximates this kernel-based control function, the authors provide a rigorous theoretical justification for why dMoE works better for fairness: it literally changes its “operating mode” to suit the sub-population it is looking at.

Experimental Results

Theory is great, but does it work on actual medical scans? The researchers tested dMoE on three datasets:

  1. Harvard-FairSeg (2D): Eye fundus images (Race attribute).
  2. HAM10000 (2D): Skin lesion dermatology images (Age attribute).
  3. Prostate Cancer (3D): CT scans for radiotherapy (Tumor Stage attribute).

They compared dMoE against state-of-the-art fairness methods, including FEBS (Fair Error-Bound Scaling) and standard MoE.

To measure success, they used standard segmentation metrics (Dice Score, IoU) and a fairness-adjusted metric called Equity-Scaled Segmentation Performance (ESSP):

Equation for ESSP metric.

This metric penalizes the score if there is a high variance (\(\Delta\)) in performance between different subgroups.

Result 1: Addressing Racial Bias in Eye Scans

In the Harvard-FairSeg dataset, the “Black” and “Asian” subgroups are often underrepresented or harder to segment.

Table 1: Comparison on 2D Harvard-FairSeg dataset.

Looking at Table 1, standard methods (TransUNet) struggle with the Black subgroup (Dice 0.731). The dMoE approach pushes this to 0.776, a significant leap. It also achieves the highest Equity-Scaled (ES) scores, meaning it improved the minority without sacrificing the performance on the majority (White) group.

Result 2: Addressing Age Bias in Dermatology

For skin lesions, age groups are highly imbalanced.

Table 2: Comparison on 2D HAM10000 dataset.

In Table 2, we see dMoE achieving the highest ES-Dice (0.801) and ES-IoU. It consistently outperforms FEBS, which actually dropped performance in some cases compared to the baseline.

Result 3: Clinical Fairness in 3D Cancer Treatment

This is perhaps the most critical experiment. In prostate cancer, T4 (advanced) tumors are rare but critical to segment correctly for radiation therapy.

Table 3: Comparison on 3D radiotherapy target segmentation.

Table 3 shows massive improvements. For the T4 subgroup, the baseline model scored 0.656. The dMoE model scored 0.778. That is a game-changing improvement for treatment planning in advanced cancer cases.

We can visualize this improvement clearly in Figure 3:

Figure 3: Violin plots and qualitative segmentation results. (a) Violin plots show dMoE maintains high performance across T-stages T1-T4 compared to Baseline and FEBS which drop at T1/T4. (b) Segmentation images show dMoE (far right) closely matching Ground Truth.

In the violin plots (a), notice how the dMoE (far right) keeps the blue “equity line” straight and high across all stages (T1 to T4). The Baseline and FEBS models dip significantly at the edges (T1 and T4). The qualitative images (b) confirm this: look at the T4 row. The dMoE segmentation (green overlay) is much closer to the Ground Truth than the other methods.

Efficiency Analysis

One might ask: “Why not just train separate models for each group?” The authors compared dMoE against training multiple separate networks.

Table 8: Comparison to multiple networks.

As shown in Table 8, dMoE is not only more accurate (ES-Dice 0.499 vs 0.457), but it is also drastically more efficient (1761 GFlops vs 5729 GFlops). By sharing knowledge in the “experts” while routing via the “gating network,” dMoE gets the best of both worlds: specialization and shared learning.

Conclusion & Implications

The Distribution-aware Mixture of Experts (dMoE) represents a significant step forward in ethical AI for healthcare. By stepping back and viewing the neural network through the lens of Control Theory, the authors identified that addressing fairness isn’t just about feeding the model more data—it’s about giving the model the mechanisms to adapt its strategy based on the context (demographics or disease state).

Key Takeaways:

  1. Context Matters: Feeding the attribute (race, age, stage) into the model’s routing layer allows for dynamic adaptation.
  2. Theory Drives Practice: The mode-switching control interpretation explains why MoE structures are effective for heterogeneous data.
  3. No Compromise: dMoE improves performance on minority groups without degrading the majority, solving the “equity-accuracy trade-off.”

While the study focused on single attributes, the future lies in handling intersectional biases (e.g., Age and Race combined). As AI continues to integrate into clinical workflows, frameworks like dMoE will be essential to ensure that these powerful tools serve every patient equitably, not just the “average” one.