Breaking the Bias in AI Clustering with Attention KANs

In the world of machine learning, data rarely comes from a single source. Imagine a doctor diagnosing a patient: they don’t just look at a blood test. They look at X-rays, MRI scans, patient history, and genetic markers. This is Multi-View Data—different perspectives of the same underlying object.

To make sense of this data without human labels, we use Multi-View Clustering (MVC). The goal is to group similar data points together by synthesizing information from all these different views. It is a powerful tool used in everything from bioinformatics to computer vision.

However, there is a hidden danger in clustering: Bias.

Traditional clustering algorithms often latch onto dominant features to group data. Unfortunately, these “dominant” features are frequently sensitive attributes like gender, race, or age. If a bank uses clustering to determine creditworthiness, and the algorithm groups people primarily by gender rather than financial history, the result is discrimination.

Today, we are diving deep into a new research paper: “Deep Fair Multi-View Clustering with Attention KAN” (DFMVC-AKAN). This paper proposes a cutting-edge solution that doesn’t just improve clustering accuracy—it ensures fairness using a novel architecture based on Kolmogorov-Arnold Networks (KAN).

The Core Problem: Accuracy vs. Fairness

Existing Deep MVC methods are great at handling complex data, but they often struggle with a specific trade-off:

The Fairness Gap: Most methods ignore sensitive attributes. If the data contains bias, the model amplifies it.
The Complexity Trap: Existing solutions that do try to be fair often rely on standard Multi-Layer Perceptrons (MLPs) or CNNs. These architectures can struggle to capture highly complex, nonlinear relationships across different views without becoming massive and inefficient.
The “Equal” Fallacy: Some fairness methods force clusters to have perfectly equal numbers of protected groups (e.g., 50% men and 50% women in every cluster). While well-intentioned, this rigid constraint often destroys the clustering accuracy because it ignores the natural distribution of the data.

DFMVC-AKAN solves these problems by combining three powerful concepts:

Kolmogorov-Arnold Networks (KAN): A mathematically superior alternative to MLPs for function approximation.
Hybrid Attention: To dynamically focus on the most important features.
Distribution Alignment: A flexible way to enforce fairness without breaking the clustering structure.

Let’s break down how this architecture works.

The Architecture of DFMVC-AKAN

At a high level, the framework consists of three main modules working in harmony.

Figure 1: The architecture of the DFMVC-AKAN framework.

As shown in Figure 1, the process is split into parallel streams for each view.

Attention KAN Learning Module: Extracts robust features from each view (View 1 … View v).
View-Contrastive Module: Ensures that different views of the same object agree on which cluster it belongs to.
Fair Clustering Module: Fuses the views and applies a fairness constraint to ensure no sensitive attribute dominates a cluster.

Let’s dissect these modules one by one.

1. The Attention KAN Learning Module

The first challenge is extracting good features. The authors replace the traditional dense layers found in most deep learning models with a KAN-based encoder.

Why KAN? The Kolmogorov-Arnold representation theorem states that any multivariate continuous function can be represented as a superposition of continuous univariate functions. While MLPs approximate functions using fixed activation functions on neurons, KANs learn the activation functions on the edges (weights). This allows them to model complex nonlinear relationships more efficiently.

Step 1: The Hybrid Attention Mechanism Before the KAN layers process the data, the model needs to know what to look at. The authors introduce a hybrid attention mechanism combining Squeeze-and-Excitation (SE) and Multi-Head Attention.

First, the SE block recalibrates the features to emphasize informative channels:

Equation 2: Squeeze-and-Excitation calculation.

Here, \(\sigma\) is the sigmoid function and \(\delta\) is ReLU. This essentially helps the model decide which feature channels are “loudest” and most important.

Next, Multi-Head Attention captures the relationships between features. It projects the SE output into Query, Key, and Value spaces (represented as A, B, and C matrices here):

Equation 3: Projections for Multi-Head Attention.

The attention output for a specific head is computed by normalizing these projections:

Equation 6: Attention output calculation.

Finally, the outputs of all heads are concatenated and projected back:

Equation 7: Multi-Head Attention concatenation.

The model then combines the SE output and the Multi-Head output using a learnable parameter \(\alpha\). This gives the model the flexibility to balance between channel-wise importance and feature-to-feature relationships:

Equation 8: The final hybrid attention output.

Step 2: The KAN Layer Now that the features are “attended” to, they pass through the Kolmogorov-Arnold Network layers. Unlike a standard neuron that sums inputs and applies a fixed ReLU or Sigmoid, the KAN layer applies a learnable non-linear function \(\psi\) to each input dimension before summing them up.

Equation 9: The KAN layer computation.

This structure allows the encoder to approximate extremely complex, nonlinear inter-view relationships that standard networks might miss.

To ensure the encoder learns meaningful features, the model includes a decoder to reconstruct the original input from the latent representation \(\mathbf{z}\):

Equation 10: The decoder reconstruction.

The reconstruction loss ensures that we haven’t lost critical information during compression:

Equation 11: Reconstruction loss formula.

2. The View-Contrastive Module

In multi-view clustering, consistency is key. If View 1 (e.g., the image of a cat) thinks the object belongs to “Cluster A,” but View 2 (e.g., the caption “cute kitten”) thinks it belongs to “Cluster B,” the model is confused.

The View-Contrastive Module enforces Semantic Consistency.

First, the model predicts a cluster assignment probability \(\mathbf{H}\) for each view:

Equation 12: Cluster assignment probability.

We then calculate the similarity between the assignment vectors of the same sample across different views. A high dot product means both views agree on the cluster assignment.

Equation 13: Similarity calculation.

The model uses a contrastive loss function. It treats the same sample from different views as a “positive pair” (they should be similar) and different samples as “negative pairs” (they should be pushed apart).

The loss function maximizes the similarity of positive pairs relative to all other pairs:

Equation 15: Contrastive loss for a specific pair.

By minimizing this loss (\(L_{c1}\)), the model forces the different views to align semantically. To prevent the trivial solution where the model dumps everything into a single cluster, a regularization term (\(L_{c2}\)) is added to encourage a spread across clusters.

Equation 18: Total semantic contrastive loss.

3. The Fair Clustering Module

This is the crown jewel of the paper. We have robust features (KAN) and consistent views (Contrastive), but we still need to ensure the clustering is fair.

First, the view-specific features are fused into a unified representation \(\mathbf{Z}\) using learnable weights \(a_v\). This lets the model trust reliable views more than noisy ones.

Equation 19: Unified embedding fusion.

Soft Assignments & The Target Distribution The model computes the probability of sample \(i\) belonging to cluster \(j\) using the Student’s t-distribution (a standard technique in deep clustering). Let’s call this distribution \(\mathbf{Q}\).

Equation 20: Soft cluster assignment Q.

In a standard clustering algorithm, we would just sharpen this distribution and use it as a target. But DFMVC-AKAN modifies the target distribution \(\mathbf{P}\) to enforce fairness.

The goal is to prevent any cluster from being dominated by a sensitive subgroup (e.g., a cluster composed entirely of men). The authors define a target distribution \(\mathbf{P}\) that normalizes frequencies based on the sensitive subgroups (\(X_g\)).

Equation 21: The fairness-aware target distribution P.

Look closely at the denominator inside the fraction: \(\sum_{i' \in X_g}\). This term normalizes the assignment probabilities by the size of the sensitive group. If a group is overrepresented in a cluster, this term grows, shrinking the target probability and discouraging the model from putting more of that group into that cluster.

The fairness loss is simply the KL-divergence between the model’s prediction \(\mathbf{Q}\) and this balanced target \(\mathbf{P}\).

Equation 22: Fairness loss calculation.

By minimizing this loss, the model gently steers the clustering assignments toward a distribution that is balanced across sensitive attributes, without requiring hard, rigid constraints.

The Final Objective

The total loss function is a weighted sum of the three components we just discussed:

Reconstruction (\(L_r\)): Keep the data real.
Contrastive (\(L_c\)): Keep the views consistent.
Fairness (\(L_f\)): Keep the results unbiased.

Equation 23: Total optimization objective.

Experiments and Results

Does this complex architecture actually work? The researchers tested DFMVC-AKAN on four datasets containing sensitive attributes: Bank Marketing, Zafar, Credit Card, and Law School.

Table 3: Statistics of the datasets used.

They measured performance using two metrics:

NMI (Normalized Mutual Information): Measures clustering accuracy. Higher is better.
BAL (Balance): Measures fairness. Higher is better.

The Results Table

Table 2: Comparison with state-of-the-art methods.

The results in Table 2 are striking.

Accuracy: DFMVC-AKAN (bottom row) achieves the highest NMI across almost all datasets. For example, on the Zafar dataset, it hits 99.98% NMI, compared to 93.93% for the next best method (DFMVC).
Fairness: Crucially, it does this while maintaining or improving the Balance (BAL) score. On the Bank Marketing dataset, it achieves a balance of 42.52, beating the previous best of 42.16.

This proves that the “trade-off” between accuracy and fairness is not a hard rule—with the right architecture, you can improve both.

Visualizing the “Fairness”

To really see what’s happening, we can look at t-SNE visualizations. These plots show the high-dimensional data projected down to 2D dots.

Figure 2: t-SNE visualization of raw vs. fairness features.

Left (Raw Features): Look at the Bank Marketing plot (top left). The blue and orange points (representing Marital Status) are distinctly separated. The data is naturally biased; a standard algorithm would easily split these into two clusters based solely on marriage.
Right (Fairness Features): Look at the plot after DFMVC-AKAN processing (top right). The blue and orange points are thoroughly mixed. The model has learned a representation where the sensitive attribute (marriage) is no longer the defining feature, yet the data structure is preserved for the clustering task.

Does it Converge?

Complex models with multiple loss functions can sometimes be unstable. However, the convergence plots show that DFMVC-AKAN is well-behaved.

Figure 4: Convergence of training losses.

As seen in Figure 4, both the pre-training loss (a) and contrastive loss (b) drop rapidly and stabilize near zero, indicating efficient learning.

Ablation Study: Do we need all the parts?

You might wonder, “Do we really need the Fairness module? Or the Semantic module?” The authors tested this by removing parts of the model.

Table 4: Ablation study results.

Excl. Fairness (\(L_f\)): Removing the fairness module causes the BAL score to drop significantly (e.g., from 42.52 down to 41.59 on Banking Market). The model becomes biased.
Excl. Semantic (\(L_c\)): Removing the contrastive module causes the accuracy (NMI) to crash (e.g., from 80.46 down to 59.73 on Banking Market). The model loses track of the object’s identity across views.

This confirms that every component of DFMVC-AKAN is essential.

Conclusion and Takeaways

The DFMVC-AKAN paper represents a significant step forward in ethical AI. It tackles the difficult problem of multi-view clustering by moving away from standard MLPs and embracing the mathematical power of Kolmogorov-Arnold Networks.

Key Takeaways:

KANs are powerful: Replacing MLPs with KANs allows for better capture of nonlinear relationships in multi-view data.
Attention matters: The hybrid attention mechanism ensures the model focuses on relevant features rather than noise.
Fairness is an optimization problem: By treating fairness as a distribution alignment task rather than a hard constraint, we can remove bias without destroying clustering performance.

As AI systems become more integrated into society—screening loans, diagnosing patients, and filtering job applicants—methods like DFMVC-AKAN will be crucial in ensuring these systems are not just smart, but also fair.

The Core Problem: Accuracy vs. Fairness#

The Architecture of DFMVC-AKAN#

1. The Attention KAN Learning Module#

2. The View-Contrastive Module#

3. The Fair Clustering Module#

The Final Objective#

Experiments and Results#

The Results Table#

Visualizing the “Fairness”#

Does it Converge?#

Ablation Study: Do we need all the parts?#

Conclusion and Takeaways#