论文标题
学习聆听:建模非确定性二元面部运动
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion
论文作者
论文摘要
我们提出了一个在二元对话中建模相互作用通信的框架:给定扬声器的多模式输入,我们自动添加了相应侦听器运动的多种可能性。我们使用Motion-Audio交叉注意变压器将演讲者的动作和语音音频结合在一起。此外,我们通过学习新型的运动编码VQ-VAE来学习逼真的听众运动的离散潜在表示,可以实现非确定性预测。我们的方法有机地捕获了非语言二元相互作用的多模式和非确定性。此外,它与扬声器同步生成现实的3D听众面部运动(请参见视频)。我们证明,我们的方法通过丰富的实验套件在定性和定量上优于基准。为了促进这一研究,我们介绍了一个新颖而大型的野外对话数据集。代码,数据和视频可在https://evonneng.github.io/learning2listen/上找到。
We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at https://evonneng.github.io/learning2listen/.