自我监督时空学习的视频披肩程序

论文标题

自我监督时空学习的视频披肩程序

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

论文作者

Luo, Dezhao, Liu, Chang, Zhou, Yu, Yang, Dongbao, Ma, Can, Ye, Qixiang, Wang, Weiping

论文摘要

我们提出了一种新型的自我监督方法，称为视频披肩程序（VCP），以学习丰富的时空表示。 VCP首先通过扣留视频剪辑，然后通过在保留的剪辑上应用时空操作来创建“空白”，然后创建“选项”。最后，它充满了“选项”的空白，并通过预测剪辑上应用的操作类别来学习表示形式。 VCP可以在自学学习中充当代理任务或目标任务。作为一项代理任务，它将丰富的自我监督表示形式转换为视频剪辑操作（选项），从而增强了灵活性并降低了表示学习的复杂性。作为目标任务，它可以以统一和可解释的方式评估学习的表示模型。使用VCP，我们训练时空表示模型（3D-CNN），并将此类模型应用于行动识别和视频检索任务。对常用基准测试的实验表明，受过训练的模型的表现优于最先进的自我监督模型，并具有明显的边距。

We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates "blanks" by withholding video clips and then creates "options" by applying spatio-temporal operations on the withheld clips. Finally, it fills the blanks with "options" and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning. As a target task, it can assess learned representation models in a uniform and interpretable manner. With VCP, we train spatial-temporal representation models (3D-CNNs) and apply such models on action recognition and video retrieval tasks. Experiments on commonly used benchmarks show that the trained models outperform the state-of-the-art self-supervised models with significant margins.

下载PDF全文

下载文献需遵守相关版权规定

论文标题