Peeking Inside the Black Box of Sleep: How Trusted Multi-View Learning Uses Expert Knowledge

Artificial Intelligence has made massive strides in healthcare, particularly in diagnosing sleep disorders. Automated Sleep Stage Classification (SSC) using EEG signals is becoming faster and more accurate than manual scoring by human experts. However, a lingering problem remains in high-stakes medical AI: Trust.

When a neural network diagnoses a patient, it typically acts as a “black box.” It spits out a probability (e.g., “90% chance of Stage N2 sleep”), but it rarely tells us why it thinks that, nor does it honestly admit when it is confused by conflicting data.

In this deep dive, we will explore a new framework proposed in the paper “Trusted Multi-View Classification with Expert Knowledge Constraints” (TM-CEK). This research introduces a fascinating approach that combines expert domain knowledge (using classic signal processing theory) with uncertainty estimation (knowing when to say “I don’t know”).

If you are a student of machine learning or biomedical engineering, this paper offers a masterclass in how to move beyond simple accuracy metrics and towards building AI systems that doctors can actually trust.

1. The Core Problem: Accuracy vs. Trustworthiness

Before dissecting the solution, we must understand the specific limitations of current Multi-View Learning (MVL) models in sleep staging.

The Context: Multi-View Learning

Sleep diagnosis uses Polysomnography (PSG), which involves multiple signals: EEG (brain waves), EOG (eye movements), and EMG (muscle activity). Even within a single EEG channel, data can be viewed from different “perspectives”:

Time Domain: The raw wave signal over time.
Frequency Domain: How much energy exists at different frequencies (spectrograms).

MVL combines these views to make a decision. While accurate, standard MVL suffers from two main issues:

Feature-Level Opacity: Deep neural networks usually learn abstract, non-linear features that make no sense to a human. A doctor cannot look at a convolutional filter and say, “Ah, yes, that is detection a K-complex.”
Decision-Level Overconfidence: Standard models often use a Softmax output, which forces probabilities to sum to 1. Even if the model has seen garbage noise, it might output a prediction with 99% confidence simply because one class had a slightly higher score than the others.

The “Evidence” Paradox

Recent advances in Evidential Deep Learning (EDL) attempt to solve the confidence issue by treating the output not as a single probability, but as a distribution of beliefs (Dirichlet distribution).

However, the authors of TM-CEK identify a critical flaw in current EDL methods. Current methods calculate uncertainty based primarily on the magnitude (quantity) of evidence. They assume that if the network finds lots of features, it must be certain.

But what if the evidence is conflicting? Imagine a jury.

Scenario A: 12 jurors all vote “Guilty.” (High Evidence, Low Uncertainty).
Scenario B: 6 jurors vote “Guilty,” 6 vote “Innocent.” (High Evidence, but logically, this should be High Uncertainty).

Existing mathematical frameworks often treat Scenario B as having “high confidence” simply because 12 people voted. This paper proposes a fix: we must look at the distribution of the evidence, not just the amount.

2. The Proposed Framework: TM-CEK

The researchers propose TM-CEK (Trusted Multi-view Classification Constrained with Expert Knowledge). The architecture is designed to handle the two problems mentioned above: transparency and true uncertainty estimation.

Let’s look at the high-level architecture:

Figure 1. The whole framework of model.

As shown in Figure 1, the model processes the input EEG signal through two parallel branches:

Top Branch (Time Domain): Processes the raw sequence \(X_t\). Note the “Gabor Layer” at the start—this is key to interpretability.
Bottom Branch (Frequency Domain): Processes the Short-Time Fourier Transform (STFT) \(X_f\).

These two branches produce “Evidence” (\(e^1\) and \(e^2\)), which are then fused to form a final opinion. Let’s break down the innovations in these steps.

Innovation 1: Embedding Expert Knowledge (The Gabor Layer)

In a standard Convolutional Neural Network (CNN), the filters (kernels) are initialized randomly. They eventually learn to detect edges or shapes, but we can’t control what they learn.

In sleep medicine, experts already know what features matter. They look for specific waveforms:

Delta waves: 1–4 Hz (Deep sleep)
Alpha waves: 8–13 Hz (Relaxed wakefulness)
Spindles: 15–18 Hz bursts (Stage N2)

The authors replace the first standard convolutional layer with a Gabor Convolutional Layer. A Gabor function is mathematically perfect for capturing these specific frequency-localized events. It is essentially a cosine wave wrapped in a Gaussian (bell curve) envelope.

The equation for the Gabor kernel \(K_G\) is:

Equation for Gabor Kernel

Here, the network learns the parameters \(u\) (center), \(\sigma\) (width), and \(f\) (frequency). Instead of learning random weights, the network optimizes these specific parameters to fit the EEG data.

The output of this layer is the convolution of the kernel and the raw signal \(X_t\):

Output convolution equation

Why does this matter? Because the kernels are mathematically constrained to be Gabor functions, we can visualize them after training and see exactly what frequencies the model is focusing on. This bridges the gap between “Deep Learning Magic” and “Medical Reality.”

Innovation 2: Distribution-Aware Uncertainty

This is the theoretical heavy lifting of the paper.

In standard Evidential Deep Learning, uncertainty (\(u\)) is calculated as \(u = K / S\), where \(K\) is the number of classes and \(S\) is the sum of all evidence plus the prior. As evidence (\(S\)) goes up, uncertainty (\(u\)) goes down.

The Problem: The authors demonstrated that adding noise to a signal might change the distribution of evidence without changing the sum significantly.

Figure 2. Density of uncertainty before and after adding noise.

As seen in Figure 2, in standard approaches, the uncertainty distribution for “Normal” (clean) data and “Noisy” data overlaps significantly. The model isn’t realizing that the noisy data is unreliable.

The Solution: The authors introduce a Distribution-Aware Subjective Opinion. They incorporate the Gini coefficient (a measure of inequality often used in economics) to measure how “sharp” or “flat” the evidence distribution is.

They redefine the belief mass (\(b_k\)) and uncertainty (\(u\)) as follows:

Distribution-aware equations

\(Gini(e)\): Calculates the spread of the evidence.
\(d\): A new “concentration” parameter derived from the Gini coefficient.
\(u\): Now depends on \(d\).

The Logic: If the evidence is spread flatly across all classes (conflicting evidence), the Gini coefficient is low. This lowers \(d\), which keeps the uncertainty \(u\) high—even if the total amount of evidence is large. This solves the “Jury Scenario B” problem we discussed earlier.

3. Trusted Fusion: Combining the Views

Once the Time branch and Frequency branch have generated their own opinions (Beliefs \(b\) and Uncertainty \(u\)), they must be combined.

The paper uses Dempster’s rule of combination, adapted for this new distribution-aware logic. The goal is to merge the opinions such that:

If both views agree and are certain, the final certainty increases.
If one view is uncertain, the system relies on the trusted view.
If views conflict, the uncertainty should reflect that.

The fusion rule is defined as:

Fusion Rule Equation

The specific calculations for the fused belief (\(b^{1\diamond2}\)) and fused uncertainty (\(u^{1\diamond2}\)) are:

Fusion Calculation Details

Notice the term \(u^{1\diamond2} = \frac{2 u^{1} u^{2}}{u^{1} + u^{2}}\). This behaves like a harmonic mean. If one view has very low uncertainty (e.g., \(u^1 \approx 0\)), the combined uncertainty drops significantly.

This fusion can be extended to any number of views:

Multi-view fusion equation

The Loss Function

To train this beast, the authors cannot use standard Cross-Entropy loss alone. They need a loss function that encourages the accumulation of evidence for the correct class while minimizing evidence for incorrect classes.

They use an Adjusted Cross-Entropy Loss (\(\mathcal{L}_{ace}\)) derived from the properties of the Dirichlet distribution:

Adjusted Cross Entropy Loss

To prevent the model from becoming overconfident too early (a common issue where the distribution collapses to a single point), they add a KL-Divergence regularization term. This forces the predicted distribution to remain close to a uniform distribution (high uncertainty) unless there is strong evidence to the contrary:

KL Divergence Loss

The final loss combines accuracy, view-consistency (making sure different views don’t wildly disagree without reason), and the KL-regularization:

Total Loss Function

4. Experiments and Results

The team tested TM-CEK on three massive public datasets: Sleep-EDF 20, Sleep-EDF 78, and SHHS (Sleep Heart Health Study).

Performance Comparison

First, does it work? Yes. The method outperforms state-of-the-art baselines, including DeepSleepNet and AttnSleep.

Table 1. Results comparison

Looking at Table 1, TM-CEK achieves the highest accuracy (Acc) and Macro F1 scores (MF1) across all datasets. For example, on Sleep-EDF 20, it hits 85.0% accuracy, beating the closest competitor (DFSC) by roughly 0.6%. While that sounds small, in medical diagnostics, consistent marginal gains are difficult to achieve.

Where does it struggle? (Confusion Matrix)

Transparency means knowing your weaknesses. The confusion matrices below show where the model excels and fails.

Figure 3. The normalized confusion matrix.

Strengths: The model is excellent at detecting Wake (W) and Deep Sleep (N3), with accuracies often exceeding 90%.
Weaknesses: Like almost all sleep algorithms, it struggles with N1 (Stage 1). You can see in the matrix that N1 is often confused with Wake or N2. This is expected; N1 is a transitional stage that is difficult even for humans to score consistently.

Robustness to Noise

This is the true test of “Trusted” learning. The researchers took the test data and added random Gaussian noise to see if the model would fall apart.

Figure 7. Comparison with different noise levels.

Figure 7 is telling. The Blue Line with circles (Trusted Acc) remains much higher than the Light Blue dotted line (No-Trusted Acc) as noise (\(\delta\)) increases.

No-Trusted Model: As noise hits \(\delta=50\), accuracy crashes.
TM-CEK: Accuracy degrades much more gracefully.

More importantly, the uncertainty estimation works.

Figure 6. Density of uncertainty.

In Figure 6, look at plot (d) (\(\sigma=100\)). The Red curve (Noisy data) has shifted far to the right compared to the Blue curve (Normal data). This means the model knows it is looking at noise and is reporting high uncertainty. Standard models would often leave these two distributions overlapping.

Visualizing the Expert Knowledge

Finally, did the Gabor layer actually learn meaningful brain waves?

Figure 5. Waveform and frequency domains of important optimized Gabor kernels.

Figure 5 visualizes the learned kernels.

Kernel 8 and 17: These learned low-frequency, high-amplitude shapes. These correspond perfectly to Slow Waves and Delta Waves found in deep sleep.
Kernel 5 and 25: These learned higher frequencies, likely corresponding to Theta waves or sleep spindles.

This proves that by constraining the first layer to Gabor functions, the model naturally “discovered” the same biomarkers that doctors have used for decades.

We can even quantify which kernels matter most for the final decision:

Figure 9. Impact of the Gabor kernels on sleep stage scoring.

Figure 9 shows the “Efficiency” of different kernels. Notice how specific kernels light up (Orange/Red) for specific classes. This provides a “feature-level explanation” for every decision the model makes.

5. Conclusion and Key Takeaways

The TM-CEK paper represents a significant step forward in making medical AI safe and understandable. It tackles the “Black Box” problem from two angles:

Input: By using Gabor filters, it forces the neural network to “speak the language” of sleep experts (frequencies and waveforms) rather than abstract vectors.
Output: By using Distribution-Aware Uncertainty, it ensures the model isn’t fooled by conflicting evidence, providing a safety net when data is noisy or ambiguous.

For students and researchers, the takeaway is clear: Accuracy isn’t the only metric that matters. In safety-critical fields, how a model learns and how well it knows its own limits are just as important as getting the right answer.

1. The Core Problem: Accuracy vs. Trustworthiness#

The Context: Multi-View Learning#

The “Evidence” Paradox#

2. The Proposed Framework: TM-CEK#

Innovation 1: Embedding Expert Knowledge (The Gabor Layer)#

Innovation 2: Distribution-Aware Uncertainty#

3. Trusted Fusion: Combining the Views#

The Loss Function#

4. Experiments and Results#

Performance Comparison#

Where does it struggle? (Confusion Matrix)#

Robustness to Noise#

Visualizing the Expert Knowledge#

5. Conclusion and Key Takeaways#