maskgit：掩蔽的生成图像变压器

论文标题

maskgit：掩蔽的生成图像变压器

MaskGIT: Masked Generative Image Transformer

论文作者

Chang, Huiwen, Zhang, Han, Jiang, Lu, Liu, Ce, Freeman, William T.

论文摘要

生成变压器在综合高保真和高分辨率图像方面经历了计算机视觉社区的快速流行。然而，到目前为止，最好的生成变压器模型仍然天真地将图像视为一系列令牌，并在栅格扫描顺序（即逐线）下顺序解码图像。我们发现这种策略既不是最佳也不有效。本文提出了使用双向变压器解码器的新型图像合成范式，我们将其称为MaskGit。在培训期间，Maskgit学会通过在各个方向上参与令牌来预测随机掩盖的令牌。在推理时，模型开始同时生成图像的所有令牌，然后在上一代上迭代地调整图像。我们的实验表明，MaskGit在Imagenet数据集上的最先进变压器模型明显胜过，并加速自回归解码高达64倍。此外，我们说明MaskGit可以很容易地扩展到各种图像编辑任务，例如介入，外推和图像操纵。

Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题