论文标题
滑冰混合物:使用MLP的长期运动音频建模
Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs
论文作者
论文摘要
花样滑冰评分是具有挑战性的,因为它需要判断玩家的技术动作以及与背景音乐的协调。大多数基于学习的方法无法很好地解决它,原因有两个:1)花样滑冰的每一步变化迅速,因此仅应用传统的框架采样将损失很多有价值的信息,尤其是在3至5分钟的长度视频中; 2)先前的方法很少考虑其模型中的关键视听关系。由于这些原因,我们介绍了一种新颖的建筑,名为Skating-Mixer。它将MLP框架扩展到多模式的方式,并通过我们设计的记忆复发单元(MRU)有效地学习长期表示。除了模型外,我们还收集了高质量的音频FS1000数据集,该数据集包含1000多个视频,这些视频具有7种具有7种不同评级指标的程序类型,并在数量和多样性中都超过了其他数据集。实验表明,所提出的方法在公共FIS-V和我们的FS1000数据集上实现了所有主要指标。此外,我们还包括一项分析,将我们的方法应用于北京2022年冬季奥运会的最新比赛,证明我们的方法具有很强的适用性。
Figure skating scoring is challenging because it requires judging the technical moves of the players as well as their coordination with the background music. Most learning-based methods cannot solve it well for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in 3 to 5 minutes long videos; 2) prior methods rarely considered the critical audio-visual relationship in their models. Due to these reasons, we introduce a novel architecture, named Skating-Mixer. It extends the MLP framework into a multimodal fashion and effectively learns long-term representations through our designed memory recurrent unit (MRU). Aside from the model, we collected a high-quality audio-visual FS1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method achieves SOTAs over all major metrics on the public Fis-V and our FS1000 dataset. In addition, we include an analysis applying our method to the recent competitions in Beijing 2022 Winter Olympic Games, proving our method has strong applicability.