端到端的零击语音转换与位置变量的卷积

论文标题

端到端的零击语音转换与位置变量的卷积

End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions

论文作者

Kang, Wonjune, Hasegawa-Johnson, Mark, Roy, Deb

论文摘要

零声音转换正在成为一个越来越流行的研究主题，因为它承诺将语音转变为听起来像任何说话者的能力。但是，针对此任务的端到端方法的工作相对较少，这很有吸引力，因为它们消除了单独的Vocoder从中间功能中生成音频的需求。在这项工作中，我们提出了LVC-VC，这是一种端到端的零声音转换模型，该模型使用位置变量卷积（LVC）共同建模转换和语音合成过程。 LVC-VC利用了精心设计的输入功能，这些功能具有分离的内容和扬声器信息，并且使用了类似神经声码器的架构，该架构利用LVC有效地组合并执行语音转换，同时直接综合时域音频域。实验表明，与几个基线相比，我们的模型在语音风格传输和语音清晰度之间的性能特别均衡。

Zero-shot voice conversion is becoming an increasingly popular research topic, as it promises the ability to transform speech to sound like any speaker. However, relatively little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose LVC-VC, an end-to-end zero-shot voice conversion model that uses location-variable convolutions (LVCs) to jointly model the conversion and speech synthesis processes. LVC-VC utilizes carefully designed input features that have disentangled content and speaker information, and it uses a neural vocoder-like architecture that utilizes LVCs to efficiently combine them and perform voice conversion while directly synthesizing time domain audio. Experiments show that our model achieves especially well balanced performance between voice style transfer and speech intelligibility compared to several baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题