论文标题
VL解释:一种交互式可视化工具,用于解释视觉语言变压器
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
论文作者
论文摘要
基于变压器的模型的突破不仅彻底改变了NLP字段,而且彻底改变了视觉和多模式系统。但是,尽管可视化和可解释性工具已用于NLP模型,但视觉和多模式变压器的内部机制在很大程度上仍然不透明。随着这些变压器的成功,了解它们的内部运作越来越重要,因为揭开这些黑色盒子将导致更有能力和值得信赖的模型。为了为这一任务做出贡献,我们提出了VL-Interpret,它提供了新颖的交互式可视化,以解释多模式变压器中的关注和隐藏表示。 VL解释是一种任务不可知论和集成的工具,(1)在所有层次的注意力头和语言组件的各个层中跟踪各种统计数据,((2)可视化跨模式和模式内的关注,并通过易于阅读的热图,以及(3)绘制通过视觉和语言象征的隐藏表示,并通过它们通过它们的隐藏表示形式,就像它们通过它们所传递的换型层。在本文中,我们通过分析KD-VLP(一种端到端的审计视觉语言多模式多模式变压器模型)在视觉常识性推理(VCR)和WebQA的任务中,证明了VL解干的功能,这是两个视觉问题。此外,我们还提出了一些有关通过我们的工具学到的多模式变压器行为的有趣发现。
Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner workings, as unraveling these black-boxes will lead to more capable and trustworthy models. To contribute to this quest, we propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers. VL-InterpreT is a task agnostic and integrated tool that (1) tracks a variety of statistics in attention heads throughout all layers for both vision and language components, (2) visualizes cross-modal and intra-modal attentions through easily readable heatmaps, and (3) plots the hidden representations of vision and language tokens as they pass through the transformer layers. In this paper, we demonstrate the functionalities of VL-InterpreT through the analysis of KD-VLP, an end-to-end pretraining vision-language multimodal transformer-based model, in the tasks of Visual Commonsense Reasoning (VCR) and WebQA, two visual question answering benchmarks. Furthermore, we also present a few interesting findings about multimodal transformer behaviors that were learned through our tool.