通过掩模进行缩放语言图像预训练

论文标题

通过掩模进行缩放语言图像预训练

Scaling Language-Image Pre-training via Masking

论文作者

Li, Yanghao, Fan, Haoqi, Hu, Ronghang, Feichtenhofer, Christoph, He, Kaiming

论文摘要

我们提供快速的语言图像预训练（FLIP），这是一种简单，更有效的训练剪辑方法。我们的方法在训练过程中随机掩盖并去除大部分图像贴片。掩盖使我们能够在相同的墙壁锁定时间内从更多的图像文本对中学习，并与相似的记忆足迹进行对比更多的样本。它导致准确性和训练时间之间的良好权衡。在我们对4亿张图像文本对的实验中，Flip提高了无屏蔽基线的准确性和速度。在大量的下游任务上，翻转主要优于接受相同数据训练的剪辑对应物。通过加速，我们探索了增加模型大小，数据大小或训练长度的缩放行为，并报告了令人鼓舞的结果和比较。我们希望我们的工作能够促进对扩展视觉学习学习的未来研究。

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题