XKD：跨模式知识蒸馏与域对齐方式进行视频表示学习

论文标题

XKD：跨模式知识蒸馏与域对齐方式进行视频表示学习

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

论文作者

Sarkar, Pritam, Etemad, Ali

论文摘要

我们提出了XKD，这是一个新颖的自我监管框架，可以从未标记的视频中学习有意义的表示。 XKD经过两个伪目标训练。首先，执行蒙版数据重建以从音频和视觉流学习特定于模式的表示。接下来，通过教师学生的设置在两种方式之间进行自我监管的跨模式知识蒸馏，以学习互补信息。我们引入了一种新型的领域对准策略，以解决音频和视觉方式之间的域差异，从而有效的跨模式知识蒸馏。此外，为了开发能够处理音频和视觉流的通用网络，引入了XKD的模态性不足的变体，该变体使用相同的载主进行不同的音频和视觉任务。我们提出的跨模式知识蒸馏将视频动作分类提高了$ 8 \％$ $ $ $ $ \％\％$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $。此外，XKD在动力学中将多模式动作分类提高了$ 5.5 \％$。 XKD在ESC50上显示出声音分类的最先进性能，可实现$ 96.5 \％$的顶级1精度。

We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled videos. XKD is trained with two pseudo objectives. First, masked data reconstruction is performed to learn modality-specific representations from audio and visual streams. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through a teacher-student setup to learn complementary information. We introduce a novel domain alignment strategy to tackle domain discrepancy between audio and visual modalities enabling effective cross-modal knowledge distillation. Additionally, to develop a general-purpose network capable of handling both audio and visual streams, modality-agnostic variants of XKD are introduced, which use the same pretrained backbone for different audio and visual tasks. Our proposed cross-modal knowledge distillation improves video action classification by $8\%$ to $14\%$ on UCF101, HMDB51, and Kinetics400. Additionally, XKD improves multimodal action classification by $5.5\%$ on Kinetics-Sound. XKD shows state-of-the-art performance in sound classification on ESC50, achieving top-1 accuracy of $96.5\%$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题