ppmn：一阶段全景叙事接地的像素 - 短语匹配网络

论文标题

ppmn：一阶段全景叙事接地的像素 - 短语匹配网络

PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

论文作者

Ding, Zihan, Ding, Zi-han, Hui, Tianrui, Huang, Junshi, Wei, Xiaoming, Wei, Xiaolin, Liu, Si

论文摘要

Panoptic叙事接地（PNG）是一项新的任务，其目标是通过静物图像的密集叙事标题来分割事物和内容类别的视觉对象。先前的两阶段方法首先提取了通过现成的全面分割模型提取分割区域的建议，然后进行粗糙的区域短语匹配，以将每个名词短语的候选区域接地。但是，两阶段的管道通常遭受第一阶段低质量建议的性能限制，以及由区域特征池的损失以及为事物和东西类别设计的复杂策略引起的空间细节。为了减轻这些缺点，我们提出了一个单阶段的端到端像素匹配网络（PPMN），该网络将每个短语与其相应的像素直接匹配，而不是区域建议，并通过简单组合输出全段分段。因此，我们的模型可以从密集注释的像素色素对的监督而不是稀疏的区域短语对中利用足够，更精细的跨模式语义对应关系。此外，我们还提出了与语言兼容的像素聚集（LCPA）模块，以进一步通过多轮修补幅度增强短语特征的判别能力，该简化为每个短语选择最兼容的像素以自适应地汇总相应的视觉上下文。广泛的实验表明，我们的方法在PNG基准测试中实现了新的最新性能，并具有4.0个绝对平均召回率的增长。

Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to segment visual objects of things and stuff categories described by dense narrative captions of a still image. The previous two-stage approach first extracts segmentation region proposals by an off-the-shelf panoptic segmentation model, then conducts coarse region-phrase matching to ground the candidate regions for each noun phrase. However, the two-stage pipeline usually suffers from the performance limitation of low-quality proposals in the first stage and the loss of spatial details caused by region feature pooling, as well as complicated strategies designed for things and stuff categories separately. To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals and outputs panoptic segmentation by simple combination. Thus, our model can exploit sufficient and finer cross-modal semantic correspondence from the supervision of densely annotated pixel-phrase pairs rather than sparse region-phrase pairs. In addition, we also propose a Language-Compatible Pixel Aggregation (LCPA) module to further enhance the discriminative ability of phrase features through multi-round refinement, which selects the most compatible pixels for each phrase to adaptively aggregate the corresponding visual context. Extensive experiments show that our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题