通过学习紧凑的语音表示，迈向低资源语言的高质量神经TT

论文标题

通过学习紧凑的语音表示，迈向低资源语言的高质量神经TT

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

论文作者

Guo, Haohan, Xie, Fenglong, Wu, Xixin, Lu, Hui, Meng, Helen

论文摘要

本文旨在通过使用紧凑的语音表示来减少培训数据要求来增强低资源TT。多阶段的多码书本（MSMC）VQ-GAN经过训练以学习表示形式，并将其解码为波形。随后，我们训练多阶段预测因子，以预测TTS合成的文本中的MSMCR。此外，我们通过利用更多音频来更好地学习MSMCR，以更好地了解培训策略。它使用扬声器相似性指标从其他语言中选择音频来增强培训集，并应用转移学习以提高培训质量。在MOS测试中，所提出的系统在标准和低资源场景中显着优于快速语和VIT，显示出较低的数据要求。提出的训练策略有效地增强了波形重建方面的MSMCR。它进一步提高了TTS性能，在低资源TTS的首选项测试中，它仅使用15分钟的配对数据赢得了77％的选票。

This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs from the text for TTS synthesis. Moreover, we optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages. It selects audio from other languages using speaker similarity metric to augment the training set, and applies transfer learning to improve training quality. In MOS tests, the proposed system significantly outperforms FastSpeech and VITS in standard and low-resource scenarios, showing lower data requirements. The proposed training strategy effectively enhances MSMCRs on waveform reconstruction. It improves TTS performance further, which wins 77% votes in the preference test for the low-resource TTS with only 15 minutes of paired data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题