论文标题

事件级视觉问题回答的跨模式因果关系推理

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

论文作者

Liu, Yang, Li, Guanbin, Lin, Liang

论文摘要

现有的视觉问题回答方法通常会遭受跨模式的伪造相关性和过度简化的事件级别的推理过程,而事件级别的推理过程未能捕获事件的时间性,因果关系,并且在视频上跨越了动态。在这项工作中,为了解决事件级视觉问题回答的任务,我们为跨模式因果关系推理提供了一个框架。特别是,引入了一组因果干预操作,以发现视觉和语言方式之间的基本因果结构。我们的框架被称为跨模式因果关系推理(CMCIR),涉及三个模块:i)通过前对开道和后门的因果关系来协作,以散布视觉和语言性的虚假相关性; ii)时空变压器(STT)模块,用于捕获视觉和语言语义之间的细粒度相互作用; iii)视觉语言特征融合(VLFF)模块,用于自适应学习全局语义感知的视觉语言表示。在四个事件级数据集上进行的广泛实验证明了我们在发现视觉语言因果结构并实现强大的事件级别的视觉问题回答时的优越性。数据集,代码和模型可在https://github.com/hcplab-sysu/cmcir上找到。

Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering. The datasets, code, and models are available at https://github.com/HCPLab-SYSU/CMCIR.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源