UNIMO：通过跨模式对比学习迈向统一的模式理解和产生

论文标题

UNIMO：通过跨模式对比学习迈向统一的模式理解和产生

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

论文作者

Li, Wei, Gao, Can, Niu, Guocheng, Xiao, Xinyan, Liu, Hao, Liu, Jiachen, Wu, Hua, Wang, Haifeng

论文摘要

存在的训练预训练方法要么关注单模式任务或多模式任务，而且无法有效地相互适应。他们只能利用单模式数据（即文本或图像）或有限的多模式数据（即图像文本对）。在这项工作中，我们提出了一个统一的模式预训练架构，即Unimo，它可以有效地适应单模式和多模式的理解和生成任务。可以利用大规模的自由文本语料库和图像收集来提高视觉和文本理解的能力，并利用跨模式的对比度学习（CMCL）将文本和视觉信息对齐到图像 - 文本对的语料库上的统一语义空间。由于非生产单模式数据非常丰富，因此我们的模型可以利用更大的数据来学习更多可推广的表示。此外，文本知识和视觉知识可以在统一的语义空间中相互增强。实验结果表明，UNIMO显着提高了几个单模式和多模式下游任务的性能。我们的代码和预培训模型在UNIMO项目页面上是公开的https://unimo-ptm.github.io/

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks. Our code and pre-trained models are public at the UNIMO project page https://unimo-ptm.github.io/

下载PDF全文

下载文献需遵守相关版权规定

论文标题