Introduction

In the rapidly evolving world of Generative AI, Text-to-Speech (TTS) has moved far beyond the robotic voices of the past. We have entered the era of Zero-Shot TTS. This technology allows a system to clone a speaker’s voice using only a few seconds of reference audio, without ever having been trained on that specific person’s voice before. While models like VALL-E and XTTS have revolutionized this space for English, low-resource languages often get left behind.

The primary bottleneck for non-English languages is not necessarily the model architecture, but the data. High-quality, large-scale, and diverse speech datasets are scarce. For the Vietnamese language, existing datasets have historically been limited by short audio durations, background noise, or a lack of speaker diversity.

This is the problem addressed by a new research paper titled “Zero-Shot Text-to-Speech for Vietnamese.” The researchers introduce PhoAudiobook, a massive, high-quality dataset comprising 941 hours of audio. By curating this dataset and benchmarking state-of-the-art models against it, they demonstrate how better data engineering can significantly improve the naturalness and prosody of synthesized speech for Vietnamese.

In this post, we will dissect the creation of PhoAudiobook, explore the pipeline used to cleanse the data, and analyze how leading models like VALL-E, VoiceCraft, and XTTS-v2 perform when trained on this new resource.

Background: The Challenge of Zero-Shot TTS

Before diving into the solution, it is helpful to understand the specific challenges of Zero-Shot TTS in low-resource environments.

Traditional TTS systems required hours of studio-recorded speech from a single speaker to build a high-quality voice model. Zero-Shot TTS changes the game by using speaker adaptation and speaker encoding. The model learns a general representation of human speech from thousands of different speakers. During inference (generation), it takes a short “prompt” (a few seconds of audio) to extract the unique characteristics (timbre, pitch, accent) of a new speaker and applies them to the text being read.

However, to learn these general representations effectively, these models require massive datasets. For languages like English, datasets containing tens of thousands of hours are available. For Vietnamese, available datasets are often fragmented, noisy, or lack the necessary metadata (like speaker IDs). Furthermore, linguistic nuances such as tones in Vietnamese add an extra layer of complexity that generic models often struggle to capture without high-quality training data.

The Core Method: Constructing PhoAudiobook

The heart of this research is the construction of the dataset itself. The authors did not simply scrape files and feed them into a model; they engineered a rigorous pipeline to ensure high fidelity.

The Pipeline

The researchers sourced their raw data from Vietnamese audiobooks. Audiobooks are an excellent source for TTS training because they are typically recorded in professional studios with clear articulation and minimal background noise.

The creation pipeline is visualized in the figure below:

Figure 1: PhoAudiobook creation pipeline.

Let’s break down the key stages of this pipeline shown in Figure 1:

  1. Collection & Extraction: The process begins with collecting 23,000 hours of raw audiobooks. However, audiobooks often contain background music or sound effects. To fix this, the team used Demucs, a music source separation model, to extract the vocal track and discard the background noise.
  2. Transcription: The isolated vocals were transcribed using the multilingual Whisper-large-v3 model. This provided both the text and the timestamps.
  3. Segmentation & Merging: This is a crucial innovation. Most existing datasets consist of very short clips (under 10 seconds). However, to teach a model proper prosody (the rhythm and flow of speech), longer context is needed. The researchers concatenated successive segments to create samples lasting between 10 and 20 seconds.
  4. Rigorous Filtering: To ensure the transcripts were accurate, they performed a “double-check” using a second model, PhoWhisper-large. If the transcript from PhoWhisper didn’t match the initial Whisper transcript, the sample was discarded. They also filtered out segments where multiple people might be speaking at once.
  5. Normalization: Finally, the audio volume was normalized, and the text was standardized (e.g., converting the number “43” to the written Vietnamese form “forty-three”).

Dataset Analysis and Comparison

The result of this pipeline is a refined corpus of 941 hours. How does this compare to existing resources?

Table 1: Characteristics of PhoAudiobook and other speech datasets for Vietnamese.

As shown in Table 1, PhoAudiobook stands out in several metrics:

  • Domain: It is the only major dataset derived entirely from Audiobooks, ensuring professional recording quality.
  • SI-SNR (Signal-to-Noise Ratio): It achieves a score of 4.91 dB, higher than all competitors, indicating cleaner audio.
  • Duration: While viVoice is slightly larger in total hours (1,016 vs. 941), PhoAudiobook offers distinct advantages in mean duration.

The difference in audio clip duration is visually striking:

Figure 2: Duration distributions of datasets.Audio samples are capped at 40 seconds for visualization purposes.

Figure 2 highlights a significant gap in previous research. Datasets like VinBigData and BUD500 are heavily skewed toward short clips (under 5-10 seconds). In contrast, PhoAudiobook (the purple violin plot at the bottom) has a dense concentration of clips between 10 and 20 seconds. This distribution is intentional, designed to help TTS models learn how to sustain a narrative flow over longer periods.

Experimental Setup

With the dataset created, the researchers sought to benchmark its effectiveness. They selected three state-of-the-art Zero-Shot TTS architectures:

  1. VALL-E: A language modeling approach that treats TTS as a conditional task, predicting audio codec tokens based on text and acoustic prompts.
  2. VoiceCraft: A token-infilling neural codec language model originally designed for speech editing but highly capable of zero-shot generation.
  3. XTTS-v2: A model based on the Tortoise architecture, known for its strong voice cloning and multilingual capabilities.

Training Strategy

To train these models effectively, the researchers augmented PhoAudiobook with an additional set of shorter clips, bringing the total training data to roughly 1,494 hours. This was done to ensure the models didn’t only learn to speak in long paragraphs but could also handle short, snappy sentences.

Evaluation Metrics

The models were evaluated using both objective and subjective metrics:

  • WER (Word Error Rate): Did the model say the correct words? (Lower is better).
  • MCD (Mel-Cepstral Distortion): How close is the spectral quality to the reference? (Lower is better).
  • RMSE\(_{F0}\): How closely did the pitch/intonation match the reference? (Lower is better).
  • MOS (Mean Opinion Score): Human rating of naturalness (Higher is better).
  • SMOS (Similarity MOS): Human rating of how much the voice sounds like the target speaker (Higher is better).

Results and Analysis

The experiments compared the new models (trained on PhoAudiobook) against a baseline model called viXTTS, which was fine-tuned on the older viVoice dataset.

Table 2: Test results of different TTS models.

Table 2 provides a comprehensive look at the performance. Here are the key takeaways:

1. The Dominance of XTTS-v2 + PhoAudiobook

The model labeled XTTS-v2\(_{PAB}\) (trained on PhoAudiobook) achieved the best results across almost all metrics for the in-domain test sets (PAB-S and PAB-U) and the external viVoice test set.

Notably, look at the WER (Word Error Rate) on the viVoice test set:

  • viXTTS (Baseline): 12.54%
  • XTTS-v2\(_{PAB}\) (Ours): 8.32%

This is a critical finding. The XTTS-v2\(_{PAB}\) model performed better on the viVoice test data than the model that was actually trained on the viVoice dataset. This strongly suggests that the quality of the PhoAudiobook data (cleaner audio, better transcripts) allows the model to generalize better than simply having a large quantity of noisier data.

2. Similarity and Naturalness

In terms of subjective human evaluation, XTTS-v2\(_{PAB}\) also led the pack. It achieved the highest SMOS scores, indicating that it was superior at cloning the unique identity of the speaker. The low RMSE\(_{F0}\) scores further confirm that it captured the pitch and prosody more accurately than the baseline.

3. The “Short Sentence” Anomaly

There is an interesting outlier in the results. Look at the VIVOS column in Table 2. VIVOS is a dataset consisting of very short sentences.

  • XTTS-v2\(_{PAB}\) WER: 37.81% (High error rate)
  • VALL-E\(_{PAB}\) WER: 12.63% (Low error rate)

Why did the top-performing XTTS model struggle here? The researchers observed that for very short text inputs, XTTS-v2 has a tendency to “ramble” or generate redundant speech at the end of the sentence. This appears to be an architectural limitation of XTTS-v2 rather than a dataset issue. Conversely, VALL-E and VoiceCraft proved much more robust when handling these short, concise inputs.

Conclusion and Implications

The release of PhoAudiobook marks a significant milestone for Vietnamese natural language processing. By curating 941 hours of high-quality, long-form audio, the researchers have provided a resource that surpasses existing datasets in both audio fidelity and metadata quality.

The experiments demonstrate a clear lesson for the AI community: Data engineering is as important as model architecture. A model trained on clean, well-structured data (PhoAudiobook) can outperform a model trained on noisy data, even when tested on that noisy data’s own validation set.

While XTTS-v2 proved to be the superior model for general long-form narration, the robustness of VALL-E and VoiceCraft on short sentences suggests that different architectures may be required depending on the specific application (e.g., an audiobook reader vs. a conversational assistant).

Currently, the models are trained solely on Vietnamese. Future work aims to explore “code-switching,” enabling these voices to fluently switch between Vietnamese and English within a single sentence—a feature highly relevant in modern, multilingual Vietnam.