论文标题
用视觉上下文注意唇唇至语音综合注意
Lip to Speech Synthesis with Visual Context Attentional GAN
论文作者
论文摘要
在本文中,我们提出了一种新颖的口头到语音生成的对抗网络,视觉上下文注意力GAN(VCA-GAN),可以在语音合成过程中共同对本地和全球唇部运动进行建模。具体而言,所提出的VCA-GAN通过找到Viseme-to-Phoneme的映射函数来综合局部唇部视觉特征的语音,而将全局视觉上下文嵌入了发电机的中间层中,以阐明同烯烯型均应诱导的映射中的歧义。为了实现这一目标,提出了一个视觉上下文注意模块,该模块在局部视觉特征中编码全局表示形式,并通过视听注意力提供了与给定的粗音语音表示相对应的所需的全局视觉上下文。除了对本地和全局视觉表示的明确建模外,同步学习也被引入作为对比度学习的一种形式,它指导发电机与给定的输入唇部运动同步综合语音。广泛的实验表明,所提出的VCA-GAN的表现优于现有的最先进,并且能够有效地综合了以前作品中几乎没有处理过的多演讲者的语音。
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features, and provides the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention. In addition to the explicit modelling of local and global visual representations, synchronization learning is introduced as a form of contrastive learning that guides the generator to synthesize a speech in sync with the given input lip movements. Extensive experiments demonstrate that the proposed VCA-GAN outperforms existing state-of-the-art and is able to effectively synthesize the speech from multi-speaker that has been barely handled in the previous works.