DQ-DRE：用于提取和接地的双查询检测变压器

论文标题

DQ-DRE：用于提取和接地的双查询检测变压器

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

论文作者

Liu, Shilong, Liang, Yaoyuan, Li, Feng, Huang, Shijia, Zhang, Hao, Su, Hang, Zhu, Jun, Zhang, Lei

论文摘要

在本文中，我们通过考虑提取和接地（PEG）来研究视觉接地问题。与以前的短语 - 检验设置相反，PEG需要一个模型才能从文本中提取短语并同时从图像中找到对象，这在实际应用程序中是更实际的设置。由于可以将短语提取被视为$ 1 $ D的文本分割问题，因此我们将PEG作为双重检测问题，并提出了一种新颖的DQ-DET模型，该模型引入了双查询，以探究来自图像和文本的不同特征的对象预测和短语蒙版预测。每对双重查询均设计为具有共享位置零件，但内容零件不同。这样的设计有效地减轻了图像和文本之间的模态对准难度（与单个查询设计相反），并使变形金刚解码器能够利用短语面罩引导的注意以提高性能。为了评估PEG的性能，我们还提出了一个新的度量CMAP（跨模式平均精度），类似于对象检测中的AP度量。新的指标克服了在短语接地中多盒对一个短语案例中回忆@1的歧义。结果，我们的PEG预先训练的DQ-DETR在所有可视觉接地基准测试基准中建立了新的最新结果，并具有RESNET-101主链。例如，它在refcoco Testa的召回率和resnet-101骨架上的召回率方面达到了$ 91.04 \％\％\％\％\％\％$ $。代码将在\ url {https://github.com/idea-research/dq-detr}上可用。

In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from images simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries is designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves $91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone. Code will be availabl at \url{https://github.com/IDEA-Research/DQ-DETR}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题