场景图足以改善图像字幕吗？

论文标题

场景图足以改善图像字幕吗？

Are scene graphs good enough to improve Image Captioning?

论文作者

Milewski, Victor, Moens, Marie-Francine, Calixto, Iacer

论文摘要

许多表现最佳的图像字幕模型仅依赖于使用对象检测模型计算的对象功能来生成图像描述。但是，最近的研究建议直接使用场景图将有关对象关系的信息引入字幕，以便更好地描述对象之间的相互作用。在这项工作中，我们彻底调查了图像字幕中场景图的使用。我们经验研究是否使用其他场景图编码器是否可以导致更好的图像描述并提出条件图注意网络（C-GAT），其中使用图像字幕描述器状态来调节图形更新。最后，我们确定预测场景图中的噪声在多大程度上影响字幕质量。总体而言，我们发现使用场景图特征的模型和仅在不同字幕指标上使用对象检测功能的模型之间没有显着差异，这表明现有场景图生成模型仍然太嘈杂，无法在图像字幕上有用。此外，尽管通常使用高质量的场景图时，预测场景图的质量通常非常低，但与强大的自下而上的自上而下的基线相比，我们获得了高达3.3苹果酒的收益。我们开源代码在https://github.com/iacercalixto/butd-image-captioning中复制所有实验。

Many top-performing image captioning models rely solely on object features computed with an object detection model to generate image descriptions. However, recent studies propose to directly use scene graphs to introduce information about object relations into captioning, hoping to better describe interactions between objects. In this work, we thoroughly investigate the use of scene graphs in image captioning. We empirically study whether using additional scene graph encoders can lead to better image descriptions and propose a conditional graph attention network (C-GAT), where the image captioning decoder state is used to condition the graph updates. Finally, we determine to what extent noise in the predicted scene graphs influence caption quality. Overall, we find no significant difference between models that use scene graph features and models that only use object detection features across different captioning metrics, which suggests that existing scene graph generation models are still too noisy to be useful in image captioning. Moreover, although the quality of predicted scene graphs is very low in general, when using high quality scene graphs we obtain gains of up to 3.3 CIDEr compared to a strong Bottom-Up Top-Down baseline. We open source code to reproduce all our experiments in https://github.com/iacercalixto/butd-image-captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题