有效的视频文本检索的掩盖对比预训练

论文标题

有效的视频文本检索的掩盖对比预训练

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

论文作者

Shu, Fangxun, Chen, Biaolong, Liao, Yue, Xiao, Shuwen, Sun, Wenyu, Li, Xiaobo, Zhu, Yousong, Wang, Jinqiao, Liu, Si

论文摘要

我们提出了一个简单而有效的端到端视频预训练（VIDLP）框架，掩盖了视频视频训练（MAC），用于视频文本检索任务。我们的MAC旨在通过掩模采样机制来减少VIDLP模型中视频表示的空间和时间冗余，以提高训练前效率。比较常规的时间稀疏采样，我们建议将空间区域的高比例随机掩盖，而仅将可见区域馈入编码器中，作为稀疏的空间采样。同样，我们对文本输入采用掩模采样技术以保持一致性。我们没有盲目地涂上面具，而是从MAE中进行了预测范式，而是提出了一个蒙版，然后提出一个对准范式，以进行有效的视频文本对准。动机是，视频文本检索任务依赖于高级对齐，而不是低级重建，而与蒙版建模的多模式对齐鼓励模型从不完整和不稳定的输入中学习强大而一般的多模式表示。耦合这些设计可实现有效的端到端预训练：减少拖鞋（60％的折扣），加速预训练（提高3倍）并提高性能。我们的MAC在各种视频文本检索数据集（包括MSR-VTT，DIDEMO和ActivityNet）上实现了最新的结果。我们的方法是输入方式的杂食。随着最小的修改，我们在图像文本检索任务上取得了竞争成果。

We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only feed visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from incomplete and unstable inputs. Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance. Our MAC achieves state-of-the-art results on various video-text retrieval datasets, including MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input modalities. With minimal modifications, we achieve competitive results on image-text retrieval tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题