Imagine watching a video of a person speaking, but the sound is muted. You can see their lips moving, their facial expressions shifting, and their jaw moving. If you were asked to “dub” this video simply by watching it, could you do it? You might guess the words (lip-reading), but could you guess the sound of their voice? The pitch? The emotional inflection?
This is the challenge of Video-to-Speech (VTS) synthesis. It’s a fascinating problem in computer vision and audio processing with applications ranging from restoring silent archival films to assisting people with speech disabilities.
However, VTS is notoriously difficult. There is a massive “modality gap” between what we see (pixels of a face) and what we hear (audio waveforms). A silent video doesn’t explicitly contain the speaker’s vocal pitch or unique timbre. In a recent paper titled “From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech,” researchers from KAIST propose a sophisticated new method that bridges this gap better than ever before.
In this post, we’ll dive into their hierarchical approach, explaining how they break speech down into Content, Timbre, and Prosody, and how they use Flow Matching to generate hyper-realistic audio.
The Problem: The Modality Gap
The core issue in Video-to-Speech is that visual and acoustic data live in two different worlds.
- Visuals provide lip motion (which hints at words) and facial expressions (which hint at emotion).
- Audio contains timbre (the unique “texture” of a voice), prosody (rhythm and pitch), and content (phonemes).
Previous methods often tried to map video features directly to audio features using end-to-end deep learning models (like GANs or Diffusion models). While these approaches work to an extent, they often struggle to disentangle these complex attributes. The result? Speech that sounds robotic, lacks the correct emotional punch, or sounds like a generic “average” voice rather than the specific person on screen.
The Solution: A Hierarchical Approach
The researchers realized that speech isn’t just one big block of data; it’s a layered signal. To generate high-quality speech, the model shouldn’t try to learn everything at once. Instead, it should learn in stages, moving from the most stable features to the most dynamic ones.
They propose a Hierarchical Visual Encoder that breaks the generation process into three distinct stages:
- Content Modeling: Determining what is being said.
- Timbre Modeling: Determining who is saying it.
- Prosody Modeling: Determining how it is being said.

As shown in Figure 1 above, the system takes a silent video and extracts three specific visual cues: Lip movements, Face Identity, and Facial Expressions. These align perfectly with the three acoustic stages mentioned above.
Deep Dive: The Architecture
Let’s look under the hood. The system consists of two main parts: the Hierarchical Visual Encoder and the Flow Matching Decoder.

1. The Hierarchical Visual Encoder
This is the brain of the operation. As illustrated in Figure 2, the encoder processes the video input sequentially.
Stage 1: Content (The “What”)
The most fundamental part of speech is the linguistic content. Without the right words, the tone doesn’t matter.
- Visual Input: The model focuses on the Lips.
- Mechanism: The researchers use a pre-trained AV-HuBERT model (a state-of-the-art lip-reading model) to extract lip motion features.
- Alignment: These visual features are aligned with “speech units” (quantized representations of speech content).
To ensure the model understands the context of the words (coarticulation), they use a specialized convolution block.

As seen in Figure 3, the predictor uses a “Masked ConvBlock.” This allows the model to predict the content of a current frame based on its neighbors, learning the temporal dependencies that are crucial for fluent speech.
The loss function for this stage combines standard Cross Entropy (CE) with this masked approach:

Stage 2: Timbre (The “Who”)
Once the words are established, the model needs to decide what the voice sounds like.
- Visual Input: The Face ID.
- Insight: There is a biometric correlation between facial structure and voice timbre. (Think about how you can sometimes guess what someone sounds like just by looking at their face).
- Mechanism: They use a face recognition network (ArcFace) to extract a static identity embedding. This is fused with the content features from Stage 1.
Stage 3: Prosody (The “How”)
Finally, the speech needs life—pitch, energy, and emotion.
- Visual Input: Facial Expressions.
- Insight: When you yell, your face tenses. When you are sad, your expression drops. These visual cues are highly correlated with pitch (F0) and energy.
- Mechanism: An expression encoder captures these subtle nuances. The system then predicts the pitch and energy contours for the speech.
2. The Flow Matching Decoder
Once the visual encoder has created a rich, layered representation (\(\mu\)) containing content, timbre, and prosody, it needs to be turned into a Mel-spectrogram (a visual representation of audio frequency over time).
Instead of using a standard Diffusion model (which can be slow and computationally heavy), the authors use Flow Matching, specifically Optimal Transport Conditional Flow Matching (OT-CFM).
In simple terms, while diffusion models “denoise” data by taking a jagged, random walk from noise to a clear image/audio, Flow Matching tries to find the straightest, most direct path (vector field) from the noise distribution to the target data distribution.
The flow trajectory is defined by this Ordinary Differential Equation (ODE):

To train this, they try to minimize the difference between the model’s predicted vector field (\(v_t\)) and the optimal target path (\(u_t\)):

This approach allows for high-quality generation with fewer sampling steps compared to traditional diffusion models. Finally, a neural vocoder (HiFi-GAN) turns the generated Mel-spectrogram into the final audio waveform.
Experimental Results
So, does it work? The results are incredibly impressive, setting a new state-of-the-art for the field.
The researchers tested their model on standard datasets like LRS3-TED (TED Talks) and LRS2-BBC. They compared their work against top competitors like SVTS, Intelligible, and DiffV2S.
Subjective Evaluation (Human Listening)
In a Mean Opinion Score (MOS) test, human listeners rated the generated speech on Naturalness and Intelligibility.

The results in the table above are striking. The proposed method achieves a Naturalness score of 4.49, which is almost identical to the Ground Truth (real human speech) score of 4.54. Previous methods like DiffV2S only reached 2.97. This suggests the synthetic speech is nearly indistinguishable from real recordings.
Visual Proof: Spectrograms
We can also “see” the improvement by looking at the generated Mel-spectrograms.

In Figure 4, look at the “Ours” column compared to the “GT” (Ground Truth) column.
- The yellow/green lines represent the harmonics and formants of speech.
- The proposed method preserves the detailed harmonic structure (the horizontal stripes) much better than competitors like SVTS or Intelligible, which look blurry or “smeared.”
- The red boxes highlight areas where the model captures dynamic changes in pitch that align perfectly with the lip movements.
Objective Metrics
The objective data backs up the subjective tests. The model scores lowest on Word Error Rate (WER), meaning the generated speech is clear and easy to understand, and highest on perceptual quality metrics like UTMOS and DNSMOS.

Why Hierarchy Matters: Ablation Study
To prove that the “divide and conquer” strategy was necessary, the researchers performed an ablation study, removing different parts of the model to see what would happen.

- w/o Hier: Removing the hierarchical structure caused a massive drop in quality.
- w/o Timbre / w/o Prosody: Removing these specific stages also hurt the model, proving that you really do need to model “Who” and “How” separately from “What.”
Conclusion
The paper “From Faces to Voices” demonstrates a significant leap forward in AI’s ability to reconstruct speech from silent video. By acknowledging that speech is a complex interplay of content, identity, and emotion—and by designing a neural network that mimics this hierarchy—the researchers have achieved generation quality that rivals real human speech.
The integration of Flow Matching further enhances this by ensuring the generation process is efficient and accurate. This technology opens exciting doors for the future of media editing, accessibility tools for the speech-impaired, and robust audio-visual understanding systems.
For students interested in multimodal learning, this paper is a perfect example of how domain knowledge (knowing how speech is produced) can guide network architecture (hierarchical encoders) to solve complex problems that brute-force deep learning cannot solve alone.
](https://deep-paper.org/en/paper/2503.16956/images/cover.png)