图像字幕的语义条件扩散网络

论文标题

图像字幕的语义条件扩散网络

Semantic-Conditional Diffusion Networks for Image Captioning

论文作者

Luo, Jianjie, Li, Yehao, Pan, Yingwei, Yao, Ting, Feng, Jianlin, Chao, Hongyang, Mei, Tao

论文摘要

文本到图像生成的最新进展见证了扩散模型的兴起，这些模型充当强大的生成模型。然而，利用此类潜在变量模型以捕获离散单词之间的依赖性并在图像字幕中追求复杂的视觉语言对准并不是微不足道的。在本文中，我们打破了学习基于变压器的编码器编码器中根深蒂固的惯例，并提出了一种基于图像字幕的基于扩散模型的新范式，即语义 - 条件 - 条件扩散网络（SCD-NET）。从技术上讲，对于每个输入图像，我们首先通过跨模式检索模型搜索语义相关句子，以传达全面的语义信息。在触发扩散变压器的学习之前，丰富的语义被进一步视为语义，这在扩散过程中产生了输出句子。在SCD-NET中，将多个扩散变压器结构堆叠以逐步增强输出句子，以级联的方式使用更好的访问语言对准和语言相干性。此外，为了稳定扩散过程，一种新的自我批判序列训练策略旨在指导SCD-NET的学习，并了解标准自回归变压器模型。对可可数据集的广泛实验证明了在具有挑战性的图像字幕任务中使用扩散模型的有希望的潜力。源代码可在\ url {https://github.com/yehli/xmodaler/tree/master/master/configs/image_caption/scdnet}获得。

Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题