](https://deep-paper.org/en/paper/2506.01322/images/cover.png)
PhoAudiobook: Bridging the Gap in Vietnamese Zero-Shot Text-to-Speech
Introduction In the rapidly evolving world of Generative AI, Text-to-Speech (TTS) has moved far beyond the robotic voices of the past. We have entered the era of Zero-Shot TTS. This technology allows a system to clone a speaker’s voice using only a few seconds of reference audio, without ever having been trained on that specific person’s voice before. While models like VALL-E and XTTS have revolutionized this space for English, low-resource languages often get left behind. ...
](https://deep-paper.org/en/paper/file-2399/images/cover.png)
](https://deep-paper.org/en/paper/2502.18316/images/cover.png)
](https://deep-paper.org/en/paper/file-2397/images/cover.png)
](https://deep-paper.org/en/paper/2503.14615/images/cover.png)
](https://deep-paper.org/en/paper/2502.13442/images/cover.png)
](https://deep-paper.org/en/paper/2502.13487/images/cover.png)
](https://deep-paper.org/en/paper/file-2393/images/cover.png)
](https://deep-paper.org/en/paper/2502.13497/images/cover.png)
](https://deep-paper.org/en/paper/2503.10995/images/cover.png)
](https://deep-paper.org/en/paper/file-2390/images/cover.png)
](https://deep-paper.org/en/paper/file-2389/images/cover.png)
](https://deep-paper.org/en/paper/2504.03561/images/cover.png)
](https://deep-paper.org/en/paper/2502.12835/images/cover.png)
](https://deep-paper.org/en/paper/2503.03499/images/cover.png)
](https://deep-paper.org/en/paper/2506.00134/images/cover.png)
](https://deep-paper.org/en/paper/2505.19155/images/cover.png)
](https://deep-paper.org/en/paper/file-2383/images/cover.png)
](https://deep-paper.org/en/paper/file-2382/images/cover.png)
](https://deep-paper.org/en/paper/file-2381/images/cover.png)