FAD-VLP：时尚视觉和语言预培训，以统一检索和字幕

论文标题

FAD-VLP：时尚视觉和语言预培训，以统一检索和字幕

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

论文作者

Mirchandani, Suvir, Yu, Licheng, Wang, Mengjiao, Sinha, Animesh, Jiang, Wenwen, Xiang, Tao, Zhang, Ning

论文摘要

时尚域中的多模式任务具有电子商务的巨大潜力，但涉及挑战性的视觉和语言学习问题 - 例如，取回给定参考图像以及用户的文本反馈的时尚项目。关于多模式时尚任务的先前工作要么受到各个基准测试中的数据的限制，要么利用了通用的视觉和语言预训练，但没有利用时尚数据的特征。此外，这些作品主要仅限于多模式理解任务。为了解决这些差距，我们做出了两个关键的贡献。首先，我们提出了一个新型时尚特定的预训练框架，该框架基于由时尚图像文本对构建的弱监督的三胞胎。我们表明，基于三胞胎的任务是标准多模式预训练任务的有效补充。其次，我们提出了一个基于灵活的解码器模型体系结构，能够进行时尚检索和字幕任务。总之，我们的模型设计和预训练方法在各种时尚任务中都具有竞争力，包括跨模式检索，带有文本反馈，图像字幕，相对图像字幕和多模式分类的图像检索。

Multimodal tasks in the fashion domain have significant potential for e-commerce, but involve challenging vision-and-language learning problems - e.g., retrieving a fashion item given a reference image plus text feedback from a user. Prior works on multimodal fashion tasks have either been limited by the data in individual benchmarks, or have leveraged generic vision-and-language pre-training but have not taken advantage of the characteristics of fashion data. Additionally, these works have mainly been restricted to multimodal understanding tasks. To address these gaps, we make two key contributions. First, we propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks. Second, we propose a flexible decoder-based model architecture capable of both fashion retrieval and captioning tasks. Together, our model design and pre-training approach are competitive on a diverse set of fashion tasks, including cross-modal retrieval, image retrieval with text feedback, image captioning, relative image captioning, and multimodal categorization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题