论文标题
低资源表达文本到语音的发行扩展
Distribution augmentation for low-resource expressive text-to-speech
论文作者
论文摘要
本文介绍了一种新颖的文本到语音数据增强技术(TTS),该技术允许生成新的(文本,音频)培训示例,而无需任何其他数据。我们的目标是增加培训期间可用的文本条件的多样性。这有助于减少过度拟合,尤其是在低资源设置中。我们的方法依赖于以保持句法正确性的方式替换文本和音频片段。我们采取其他措施,以确保合成的语音不包含由不一致的音频样本组合引起的伪影。感知评估表明,我们的方法改善了许多数据集,扬声器和TTS体系结构的语音质量。我们还证明,它极大地提高了基于注意力的TTS模型的鲁棒性。
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models.