VQA-LOL：视觉问题在逻辑镜头下回答

论文标题

VQA-LOL：视觉问题在逻辑镜头下回答

VQA-LOL: Visual Question Answering under the Lens of Logic

论文作者

Gokhale, Tejas, Banerjee, Pratyay, Baral, Chitta, Yang, Yezhou

论文摘要

逻辑连接及其对自然语言句子含义的含义是理解的基本方面。在本文中，我们调查了视觉问题回答（VQA）系统是否训练了回答有关图像问题的问题，能够回答多个此类问题的逻辑组成。当在此\ textIt {逻辑镜头}下，最新的VQA模型很难正确回答这些逻辑上的问题。我们将VQA数据集的增强构建为基准，其中包含逻辑组成和语言转换（否定，分离，连接和反义词）。我们建议使用问题注意事项和逻辑注意力来理解问题中的逻辑连接剂的{logic（lol）}模型，以及一种新颖的Fréchet兼容性损失，这确保了组件问题的答案和构成问题的答案与推断的逻辑操作保持一致。我们的模型在学习逻辑组成的同时保留了VQA的性能方面显示出很大的改善。我们建议这项工作是通过将逻辑上的连接器嵌入视觉理解中的发展。

Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this \textit{Lens of Logic}, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our {Lens of Logic (LOL)} model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题