可控神经韵律合成

论文标题

可控神经韵律合成

Controllable Neural Prosody Synthesis

论文作者

Morrison, Max, Jin, Zeyu, Salamon, Justin, Bryan, Nicholas J., Mysore, Gautham J.

论文摘要

由于神经声码器和神经韵律发生器的出现，言语综合最近的忠诚度得到了重大改善。但是，这些系统缺乏对韵律的直观用户控制，使它们无法纠正韵律错误（例如，放置的重点和上下文不适当的情绪）或产生具有不同扬声器的兴奋性和情感的韵律。我们通过可控制的，上下文感知的神经韵律发生器来解决这些限制。给定真实或合成的语音记录，我们的模型允许用户在某些时间范围内输入韵律约束，并从输入文本和上下文韵律中生成剩余的时间范围。我们还提出了一个倾斜的神经声码器，以修改输入语音以匹配合成的韵律。通过客观和主观评估，我们表明我们可以成功地将用户控制纳入我们的韵律生成模型，而无需牺牲合成语音的整体自然性。

Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a real or synthesized speech recording, our model allows a user to input prosody constraints for certain time frames and generates the remaining time frames from input text and contextual prosody. We also propose a pitch-shifting neural vocoder to modify input speech to match the synthesized prosody. Through objective and subjective evaluations we show that we can successfully incorporate user control into our prosody generation model without sacrificing the overall naturalness of the synthesized speech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题