滑冰混合物：使用MLP的长期运动音频建模

论文标题

滑冰混合物：使用MLP的长期运动音频建模

Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs

论文作者

Xia, Jingfei, Zhuge, Mingchen, Geng, Tiantian, Fan, Shun, Wei, Yuantai, He, Zhenyu, Zheng, Feng

论文摘要

花样滑冰评分是具有挑战性的，因为它需要判断玩家的技术动作以及与背景音乐的协调。大多数基于学习的方法无法很好地解决它，原因有两个：1）花样滑冰的每一步变化迅速，因此仅应用传统的框架采样将损失很多有价值的信息，尤其是在3至5分钟的长度视频中； 2）先前的方法很少考虑其模型中的关键视听关系。由于这些原因，我们介绍了一种新颖的建筑，名为Skating-Mixer。它将MLP框架扩展到多模式的方式，并通过我们设计的记忆复发单元（MRU）有效地学习长期表示。除了模型外，我们还收集了高质量的音频FS1000数据集，该数据集包含1000多个视频，这些视频具有7种具有7种不同评级指标的程序类型，并在数量和多样性中都超过了其他数据集。实验表明，所提出的方法在公共FIS-V和我们的FS1000数据集上实现了所有主要指标。此外，我们还包括一项分析，将我们的方法应用于北京2022年冬季奥运会的最新比赛，证明我们的方法具有很强的适用性。

Figure skating scoring is challenging because it requires judging the technical moves of the players as well as their coordination with the background music. Most learning-based methods cannot solve it well for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in 3 to 5 minutes long videos; 2) prior methods rarely considered the critical audio-visual relationship in their models. Due to these reasons, we introduce a novel architecture, named Skating-Mixer. It extends the MLP framework into a multimodal fashion and effectively learns long-term representations through our designed memory recurrent unit (MRU). Aside from the model, we collected a high-quality audio-visual FS1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method achieves SOTAs over all major metrics on the public Fis-V and our FS1000 dataset. In addition, we include an analysis applying our method to the recent competitions in Beijing 2022 Winter Olympic Games, proving our method has strong applicability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题