论文标题
Foley音乐:学习从视频中生成音乐
Foley Music: Learning to Generate Music from Videos
论文作者
论文摘要
在本文中,我们介绍了Foley Music,该系统可以将合理的音乐综合为有关演奏乐器的无声视频剪辑。我们首先为音乐发电机的成功视频确定了两个关键的中间表示形式:来自音频录音的视频和MIDI事件的身体关键点。然后,我们从视频中制定音乐发作,作为涉及中段的翻译问题。我们提出了一个图形$ - $变压器框架,该框架可以根据身体运动准确预测MIDI事件序列。然后,可以使用现成的音乐合成器工具将MIDI事件转换为逼真的音乐。我们演示了模型对包含各种音乐表演的视频的有效性。实验结果表明,我们的模型在生成令人愉悦的听音乐方面优于几个现有系统。更重要的是,MIDI表示形式完全可解释和透明,从而使我们能够灵活地进行音乐编辑。我们鼓励读者观看带有音频的演示视频来体验结果。
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the demo video with audio turned on to experience the results.