Introduction: The Challenge of Seeing and Hearing

Imagine watching a basketball game on TV. You see players dribbling and shooting, but you hear the commentator’s voice, the roar of the crowd, and maybe a faint squeak of sneakers. For a machine, understanding this scene is incredibly complex. The visual cues (basketball, cheering fans) and the dominant audio cues (speech, cheering) don’t always align perfectly. How can an AI learn to focus on the right signals when vision and sound tell slightly different stories?

A diagram showing a 10-second video clip of a basketball game with separate video, audio, and combined audio-visual event labels. The labels highlight the mismatch between what is seen (basketball) and what is heard (speech, cheering).

Figure 1: A 10-second basketball clip illustrating divergence between video labels (“basketball,” “cheering,” “clapping”) and audio labels (“speech,” “cheering”).

This mismatch highlights a central challenge in audio-visual learning—building models that can perceive and understand real-world scenes by integrating multiple sensory inputs. Humans handle this naturally; AI models, less so. While powerful pre-trained models for vision (like Swin Transformer) and audio (like HTS-AT) exist, fine-tuning them for every new task is computationally expensive.

Recent advances have introduced adapters—small, trainable modules inserted into large, frozen models—to achieve parameter-efficient fine-tuning (PEFT). But most such designs use cross-modal adapters that always fuse information across modalities. This can backfire: forcing sound and vision to interact when they’re not truly correlated can introduce noise and confusion.

The paper “Mixture of Experts for Audio-Visual Learning” proposes a more adaptable approach called Audio-Visual Mixture of Experts (AVMoE). It leverages a “Mixture of Experts” (MoE) scheme in which different adapters act as specialized experts—some focusing on merging modalities, others refining each one independently. A dynamic router decides which expert’s judgment matters most, enabling the model to intelligently adapt to the scenario.

This article explores how AVMoE works, why it matters, and what its experimental results reveal about the future of multimodal learning. We’ll cover:

  • The concepts behind parameter-efficient adapters and the Mixture of Experts strategy.
  • AVMoE’s architecture: its router and dual adapters.
  • Key results across audio-visual tasks from event localization to question answering.
  • Insights from ablation studies and visualizations that explain its success.

Background: Adapters and Mixture of Experts

Adapters: Efficient Fine-Tuning for Large Models

Large models such as Vision Transformers are remarkable pre-trained representations of visual or auditory information—but retraining all their parameters for every new task is wasteful. Adapters solve this by freezing the bulk of the backbone and inserting small bottleneck modules (a few new layers) that learn task-specific transformations. This yields efficiency gains, as only a few percent of total parameters are trained, while the backbone’s knowledge remains intact.

Mixture of Experts: Divide and Conquer

The Mixture of Experts (MoE) framework adds another layer of intelligence. Instead of relying on a single generalist network, multiple sub-networks—“experts”—handle specialized types of data or tasks. A gating or routing module determines which experts to “consult” for a given input and how to weight their outputs.

Think of it as a team of specialists:

  • Experts provide distinct perspectives (e.g., one excels at audio fusion, another at visual reasoning).
  • Router assigns weights dynamically, deciding who contributes most.

The result is a model that scales efficiently and adapts dynamically, activating only relevant experts at a time.

AVMoE merges these two concepts: adapters become the experts, and the router orchestrates their cooperation.


The Core Method: Inside the Audio-Visual Mixture of Experts (AVMoE)

AVMoE injects dynamic expert modules into frozen audio and visual transformers, enabling flexible modality handling. As illustrated below, it combines frozen backbones for audio and visual processing with trainable AVMoE modules that include routers and specialized adapters.

An overview of the AVMoE architecture, showing parallel visual and audio transformer encoders. Trainable AVMoE modules are inserted into the frozen backbones, and a final fusion module uses a router to weigh the outputs of a Cross-Modal Adapter and a Unimodal Adapter.

Figure 2: AVMoE architecture integrating trainable adapter experts into frozen pre-trained vision and audio backbones.

The Router: Allocating Expert Responsibility

The router is a lightweight Multi-Layer Perceptron that decides the weighting between adapters.

For concatenated audio-visual tokens \( i_t \):

\[ w_{\text{CMA}} = \frac{\exp(r_{\text{CMA}}(i_t))}{\exp(r_{\text{CMA}}(i_t)) + \exp(r_{\text{UA}}(i_t))}, \quad w_{\text{UA}} = \frac{\exp(r_{\text{UA}}(i_t))}{\exp(r_{\text{CMA}}(i_t)) + \exp(r_{\text{UA}}(i_t))} \]

Here \( w_{\text{CMA}} \) and \( w_{\text{UA}} \) determine how much attention is paid to the Cross-Modal Adapter (CMA) and Unimodal Adapter (UA). If modalities align well, \( w_{\text{CMA}} \) increases; if noise or mismatch is detected, \( w_{\text{UA}} \) gains priority.

To encourage exploration during training, Gaussian noise is added:

\[ g' = g + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2) \]

This prevents the router from collapsing into a single expert preference, enabling balanced utilization.

The Experts: Two Complementary Adapters

AVMoE employs two expert types with distinct roles—one to fuse across modalities, one to refine within them.

A comparison of the Cross-Modal Adapter (CMA) and Unimodal Adapter (UA) architectures. The CMA includes a feature fusion step, while the UA uses a self-attention step, highlighting their different purposes.

Figure 3: Adapter architectures—Cross-Modal Adapter (left) integrates features across modalities; Unimodal Adapter (right) focuses on single-modality self-attention.

1. Cross-Modal Adapter (CMA): Encouraging Collaboration

CMA facilitates inter-modal fusion through three stages:

  1. Token Compression: Using cross-attention to reduce modality-specific tokens into compact latent summaries. \[ S_{a}^{l} = f_{c}(L_{a}^{l}, X_{a}^{l}, X_{a}^{l}), \quad S_{v}^{l} = f_{c}(L_{v}^{l}, X_{v}^{l}, X_{v}^{l}) \]
  2. Feature Fusion: Combining audio summaries with visual tokens and vice versa. \[ X^{l}_{av} = f_c(X^{l}_{a}, S^{l}_{v}, S^{l}_{v}), \quad X^{l}_{va} = f_c(X^{l}_{v}, S^{l}_{a}, S^{l}_{a}) \]
  3. Bottleneck Refinement: Applying lightweight projection and activation layers for discriminative final features. \[ Z_{av}^{l} = \theta^{up}(\sigma(\theta^{down}(X_{av}^{l}))), \quad Z_{va}^{l} = \theta^{up}(\sigma(\theta^{down}(X_{va}^{l}))) \]

2. Unimodal Adapter (UA): Preserving Autonomy

UA is designed for intra-modal reasoning, i.e., refining each modality independently when cross-modal fusion is detrimental (e.g., silent visual scenes). It replaces cross-attention with self-attention to enhance within-modality coherence:

\[ X_{a}^{l} = f_{s}(X_{a}^{l}, S_{v}^{l}, S_{v}^{l}), \quad X_{v}^{l} = f_{s}(X_{v}^{l}, S_{a}^{l}, S_{a}^{l}) \]

Together, CMA and UA serve complementary roles—fusion and independence. The router dynamically determines their mixture per scenario.


Experimental Results: How AVMoE Performs

The authors evaluated AVMoE on four demanding tasks—Audio-Visual Event Localization (AVE), Video Parsing (AVVP), Segmentation (AVS), and Question Answering (AVQA)—against leading baselines (LAVisH, DG-SCT).

Audio-Visual Event Localization (AVE)

Goal: Identify and localize events audible and visible in a video.

Table comparing AVMoE with other methods on the AVE task. AVMoE achieves the highest accuracy (82.6%) with fewer total parameters than its closest competitor, DG-SCT.

Table 1: AVMoE achieves 82.6% accuracy, outperforming LAVisH and DG-SCT with fewer trainable parameters.

AVMoE consistently beats rivals across multiple backbones. Using distinct vision (Swin-V2-L) and audio (HTS-AT) encoders, it attains 82.6% accuracy—superior to DG-SCT’s 82.2%—despite fewer trainable parameters. This reflects its efficient expert allocation.

Audio-Visual Video Parsing (AVVP)

Goal: Parse videos into event segments labeled as audible, visible, or both.

Table showing AVMoE’s superior performance on the AVVP task for both segment-level and event-level metrics. It achieves a Type F-score of 58.8 and an Event F-score of 59.0 at the segment level.

Table 2: AVMoE leads with 59.0 event-level F-score and 58.8 type F-score, surpassing DG-SCT by nearly 2%.

Since AVVP data often includes mismatched modalities, AVMoE’s capacity for selective unimodal reliance is crucial. Its higher F-scores prove that dynamic routing mitigates cross-modal interference.

Audio-Visual Segmentation (AVS)

Goal: Segment the pixels of objects emitting sounds in a video frame.

Table comparing AVMoE on the AVS task. AVMoE achieves the highest scores, especially on the more complex multi-sound source (MS3) setting, with a Jaccard index of 54.5 and an F-score of 68.7.

Table 3: AVMoE surpasses DG-SCT in segmentation metrics, excelling in multi-source (MS3) scenarios.

In multi-source segmentation, AVMoE’s MoE flexibility shines. It delivers a striking 68.7% F-score versus DG-SCT’s 64.2%. Qualitative comparisons highlight superior precision.

Qualitative comparison of AVS results. AVMoE (“Ours”) produces more accurate and complete segmentation masks for sounding objects compared to DG-SCT, especially in multi-sound scenarios.

Figure 4: AVMoE correctly isolates sounding objects—excluding silent cars and producing cleaner contours—while DG-SCT over-segments.

Audio-Visual Question Answering (AVQA)

Goal: Answer multimodal questions requiring joint audio-visual reasoning.

Table showing AVMoE’s state-of-the-art performance on the AVQA task, outperforming all other methods across audio, visual, and combined audio-visual question types.

Table 4: AVMoE achieves state-of-the-art 75.7% average accuracy and outperforms DG-SCT even on the complex Audio-Visual category.

AVMoE’s adaptability enhances high-level reasoning. Its routing between CMA and UA empowers nuanced context understanding, improving accuracy across all question types.


Ablation and Insight: Why AVMoE Works

Expert Diversity Enhances Learning

Table from the ablation study showing that increasing the number of CMA and UA experts consistently improves performance across all tasks.

Table 5: Increasing the number of experts yields consistent performance gains on AVS, AVQA, and AVE tasks.

Adding more adapters—hence more “specialists”—leads to improved results. Even minimal configurations outperform prior adapter-only designs like LAVisH, validating MoE’s modular efficiency.

Resilience to Missing Modalities

Table comparing AVMoE and DG-SCT on visual-only data. AVMoE shows significantly less performance degradation, demonstrating its robustness to missing audio information.

Table 6: When tested on vision-only inputs, AVMoE maintains high performance while DG-SCT suffers major drops.

AVMoE remains robust even when an entire modality (e.g. audio) is removed at test time, thanks to unimodal adapters. This property is vital for real-world scenarios with incomplete sensors.

Visualizing Expert Activation

Heatmaps showing the activation probability of experts. When processing visual-only data, the router increases its activation of unimodal adapters (experts #3 and #4) in the visual branch.

Figure 7: Router activation shifts toward unimodal experts when processing visual-only inputs—evidence of true dynamic routing.

Heatmaps of expert activations reveal how the router intelligently transitions its reliance based on input type. Under visual-only data, unimodal adapters dominate—mirroring intuitive, human-like sensory prioritization.

Learning More Discriminative Features

t-SNE plots comparing features learned by a baseline model (“Original”) and AVMoE (“Ours”). The features from AVMoE form tighter, more distinct clusters for different event categories.

Figure 8: AVMoE’s learned embeddings show tighter clustering and clearer separation across classes on AVE and AVS tasks.

Feature visualizations using t-SNE show more compact intra-class and distinct inter-class clusters, indicating that AVMoE better organizes multimodal representations.


Conclusion: The Future of Flexible Multimodal Intelligence

The Mixture of Experts for Audio-Visual Learning study introduces AVMoE—a paradigm that embraces flexibility, efficiency, and robustness. By combining unimodal and cross-modal adapters under a dynamic router, AVMoE overcomes the limitations of static multimodal fusion.

Key takeaways:

  • Flexibility Matters: AVMoE adapts intelligently to modality mismatch, noise, or absence, selecting the right combination of experts for each case.
  • Efficiency Without Sacrifice: With modest trainable parameters, AVMoE achieves state-of-the-art performance across four major tasks.
  • Robustness for the Real World: Its graceful degradation under missing data makes it ideal for deployed multimodal systems.

In an era where AI systems increasingly interact with diverse sensory inputs, AVMoE demonstrates that specialization and dynamic coordination are fundamental to mastering the richness of real-world perception.