In the world of Natural Language Processing (NLP), extracting structured knowledge—like relations between entities—from unstructured text is a well-established field. We have sophisticated models that can read a sentence like “Steve Jobs co-founded Apple” and extract the triplet (Steve Jobs, Founder, Apple).

But what about speech? A massive amount of human knowledge is exchanged via podcasts, meetings, phone calls, and news broadcasts. Historically, extracting relations from speech (SpeechRE) has been treated as a two-step pipeline: transcribe the audio to text using Automatic Speech Recognition (ASR), and then run a text-based relation extraction model. While functional, this approach is prone to “error propagation”—if the ASR mishears a name, the relation extraction fails immediately.

Recent research has pushed for End-to-End SpeechRE, where a model listens to audio and directly outputs structured knowledge. However, this comes with two massive challenges:

  1. Data Scarcity: Most datasets use synthetic (Text-to-Speech) audio, which lacks the nuance and noise of real human speech.
  2. The Modality Gap: Speech encoders (processing continuous audio waves) and text decoders (generating discrete tokens) operate in fundamentally different feature spaces.

In this post, we will deconstruct a recent paper, “Multi-Level Cross-Modal Alignment for Speech Relation Extraction,” which proposes a novel architecture called MCAM (Multi-level Cross-modal Alignment Model). The authors introduce a clever “teacher-student” training framework that aligns speech and text at the token, entity, and sentence levels, effectively teaching a speech model to “think” like a text model.

The Background: Why SpeechRE is Hard

To understand why MCAM is necessary, we first need to look at the limitations of previous approaches.

In a typical end-to-end SpeechRE model, you have a speech encoder (like wav2vec 2.0) connected to a text decoder (like BART). The problem is that these two components are pre-trained on different data types. A simple connection, such as a length adapter (which shortens the audio sequence to match text length), is often insufficient. The semantic information in the speech features doesn’t naturally map to the semantic space the text decoder expects.

The authors of this paper conducted a preliminary study to see how existing alignment techniques from other fields (like Speech Translation) fared in Relation Extraction.

Figure 1: Illustration of the baseline (Wu et al., 2022) and its variants.

As shown in Figure 1, there are a few ways to connect these components:

  • (a) Baseline: A simple length adapter (L-Adapter).
  • (b) Token-level Alignment: Using Connectionist Temporal Classification (CTC) to align audio frames to specific tokens.
  • (c) Sentence-level Alignment: Compressing the audio into a global vector and aligning it with a text summary vector.

The researchers found something interesting when they tested these existing methods on the CoNLL04 dataset:

Table 1: Performance of the baseline (LNA-ED) and its variations on the CONLL04 test set.

As Table 1 shows:

  1. Token-level alignment improved Entity Recognition (ER) but didn’t help much with understanding the relations between them.
  2. Sentence-level alignment helped Relation Prediction (RP) but actually hurt Entity Recognition.

Why did this happen? The authors discovered that standard CTC-based alignment tends to overfit on high-frequency tokens (like “the”, “a”, “is”), leading to a collapse in feature quality for rare entity words.

Figure 2: Token frequencies in the CONLL04 training set vs CTC greedy decoding.

Figure 2 illustrates this collapse. The red line shows the token distribution generated by the CTC module. It drops off sharply compared to the actual training data (blue line), indicating that the model is ignoring the rich, low-frequency words that usually make up named entities.

Conversely, sentence-level compression loses the fine-grained details necessary to identify specific entities. It seems we can’t just choose one; we need a method that aligns speech and text at every level.

The Solution: Multi-Level Cross-Modal Alignment (MCAM)

The proposed model, MCAM, is designed to bridge the modality gap comprehensively. The architecture uses a Speech Encoder (wav2vec 2.0) to process input, an Alignment Adapter to translate those features into a text-compatible space, and a Text Decoder (BART) to generate the relation triplets.

The “secret sauce” lies in how the model is trained. During training, the authors introduce a Text Encoder (BART encoder) that acts as a teacher. The model learns to align the speech representation to the text representation provided by this teacher.

Figure 3: The overall architecture of our model.

As illustrated in Figure 3, the Alignment Adapter works at three distinct levels. Let’s break them down.

Level 1: Token-Level Alignment

The goal here is to align the speech features sequence (\(\mathbf{H}_s\)) with the text feature sequence (\(\mathbf{H}_t\)).

Standard approaches align speech features to static word embeddings. However, the meaning of a token changes based on context. MCAM improves on this by calculating alignment scores based on the contextualized output of the Text Encoder.

The model uses a Convolutional Neural Network (CNN) to downsample the long speech sequence. It then computes a similarity matrix between the speech features and the text features from the current batch. A CTC (Connectionist Temporal Classification) loss is applied to force the speech features to align monotonically with the text tokens.

By using dynamic text features (from the Text Encoder) rather than static embeddings, the model avoids overfitting to high-frequency function words, preserving the semantic richness needed for entities.

Level 2: Entity-Level Alignment

Aligning every token is good, but Relation Extraction relies heavily on specific entities. The model needs to know exactly which segment of the audio corresponds to “Steve Jobs” or “Apple.”

The challenge is that we don’t naturally know the timestamps of entities in the audio. Existing methods use external aligners (which add errors) or complex transport algorithms. MCAM proposes a simpler mechanism: Window-based Attention.

Since alignment is generally monotonic (time flows forward), the model assumes the speech corresponding to an entity token exists within a local window relative to its position. The model calculates the speech feature for an entity token \(\mathbf{h}_{i}^{(s)}\) using an attention mechanism restricted to that window:

Equation for Entity-Level speech feature extraction.

Here, the model looks at the text feature \(\mathbf{H}_t[i]\) and finds the most relevant speech segments within a window \([s:e]\).

Once the specific speech features for entities are extracted, the authors create a Mixed Feature Sequence (\(\mathbf{H}_m\)). They take the original text sequence and replace the text features of the entities with these extracted speech features.

The model is then trained to minimize the difference (KL Divergence) between the decoder’s output when given the pure text features versus the mixed features. This forces the speech representation of “Apple” to be functionally identical to the text representation of “Apple” in the eyes of the decoder.

Level 3: Sentence-Level Alignment

Finally, the model needs to understand the global context to predict the relationship between entities. Does the sentence say entity A founded entity B, or acquired entity B?

To capture this, MCAM uses a Semantic Compression Layer. Instead of a single “sentence vector,” the model uses \(R\) learnable query vectors (where \(R\) is the number of relation types). These queries “scan” the input features (both speech and text) via attention mechanisms to create global representations.

Equation for Semantic Compression.

This results in global features \(\mathbf{G}_s\) (from speech) and \(\mathbf{G}_t\) (from text).

However, compressing everything into a vector loses detail. To solve this, the authors treat these global vectors as Soft Prompts. They prepend these global vectors to the original sequence features.

The alignment is enforced using a Contrastive Loss (\(\mathcal{L}_{CL}\)). This objective pulls the global speech representation closer to its corresponding text representation while pushing it away from unrelated representations.

Equation for Contrastive Loss.

This ensures that the “gist” of the spoken sentence matches the “gist” of the written sentence.

Training the Model

The training process is a multi-task learning objective. The primary goal is, of course, to generate the correct text (the relation triplets). The model uses Cross-Entropy (\(\mathcal{L}_{CE}\)) loss for the generation task, applied to the text, mixed, and speech sequences.

Equation for Cross-Entropy Loss.

In addition to generation and the Contrastive Loss mentioned above, the model uses KL Divergence to distill knowledge from the text modality to the speech modality.

Equation for KL Divergence Loss.

The final loss function combines all these objectives, balanced by hyperparameters \(\alpha\) and \(\beta\).

Equation for Total Loss.

This composite loss ensures that the model isn’t just memorizing data, but is actively learning to align the modalities at every granular level (token, entity, and sentence).

Experiments and Results

To rigorously test MCAM, the researchers went beyond synthetic data. They constructed two real-world SpeechRE datasets based on CoNLL04 and ReTACRED. They hired native speakers to read the text instances, creating a dataset with natural pauses, intonations, and noise. They also used a “Mixed-CoNLL04” dataset (synthetic training data, real test data) to test robustness.

Tuning the Hyperparameters

Before looking at the main comparisons, it is important to see how the model is tuned. The researchers analyzed the impact of the loss weights \(\alpha\) (for Contrastive Learning) and \(\beta\) (for CTC).

Table 3: The performance of our model with different values of alpha and beta.

As Table 3 shows, the model is relatively stable, but achieves peak performance with \(\alpha=0.8\) and \(\beta=0.2\). This highlights that while token alignment (CTC) is necessary, the sentence-level semantic alignment (Contrastive Learning) carries significant weight in determining relationships.

They also investigated the Semantic Projection Layer—the shared space where text and speech features meet.

Figure 4: Performance with different layer numbers N of the semantic projection layer.

Figure 4 indicates that using the top 3 layers of the BART encoder as the projection layer yields the best results. Going deeper (\(N > 3\)) hurts performance, likely because the limited size of SpeechRE datasets makes it hard to train a larger number of parameters effectively.

Main Performance

The results were decisive. MCAM consistently outperformed existing baselines across three key metrics:

  1. ER: Entity Recognition
  2. RP: Relation Prediction
  3. RTE: Relation Triplet Extraction (the full task)

On the CoNLL04 dataset, MCAM achieved an F1 score of 40.13 on Entity Recognition and 22.07 on the full Triplet Extraction task. This is a massive leap over the baseline LNA-ED model, which scored only 18.87 and 10.41 respectively.

Key takeaways from the experimental analysis include:

  • End-to-End Superiority: MCAM outperformed pipeline models (ASR + TextRE). Pipeline models suffered heavily from error propagation—if the ASR missed an entity name, the relation extraction was impossible.
  • Real vs. Synthetic: All models performed worse on real speech compared to synthetic speech, validating the authors’ claim that synthetic datasets are insufficient benchmarks. However, MCAM showed much stronger robustness to the “messiness” of real speech than its competitors.
  • Alignment Matters: Comparisons with other alignment methods (like Chimera or SATE from speech translation) showed that MCAM’s specific multi-level approach is better suited for the granularity required in Relation Extraction.

Ablation Studies: Do we need all levels?

The authors removed components one by one to verify their contributions.

  • Removing Token-level alignment: Caused a significant drop in Entity Recognition. The model struggled to locate entities without the CTC guidance.
  • Removing Entity-level alignment: Also hurt Entity Recognition. The KL divergence loss on the “mixed” sequence is crucial for ensuring the decoder recognizes speech entities as effectively as text entities.
  • Removing Sentence-level alignment: Caused a specific drop in Relation Prediction. The model could find the entities but struggled to understand the relationship between them without the global context provided by the soft prompts.

Conclusion and Future Implications

The MCAM paper represents a significant step forward in extracting structured knowledge directly from speech. By acknowledging that speech and text are fundamentally different modalities, and by forcing alignment at the token, entity, and sentence levels, the authors created a model that is far more robust than previous attempts.

The introduction of real-world datasets (Human-read CoNLL04 and ReTACRED) is also a vital contribution, pushing the field away from the “crutch” of synthetic speech and towards handling the complexities of real human communication.

Why does this matter? Imagine a virtual assistant that doesn’t just transcribe your meeting, but automatically updates your company’s knowledge graph with “Project X deadline is Friday” or “Alice is the lead on Marketing.” MCAM brings us closer to that reality by removing the dependency on error-prone transcription steps and treating speech as a first-class citizen for knowledge extraction.

Future work in this space could look at “Zero-shot” SpeechRE—training models using only the abundant data available for ASR and TextRE, without requiring expensive, manually annotated speech relation datasets. But for now, MCAM sets a new standard for how we should bridge the gap between what is said and what is known.