可解释的图像识别的可学习视觉词

论文标题

可解释的图像识别的可学习视觉词

Learnable Visual Words for Interpretable Image Recognition

论文作者

Xiao, Wenxiao, Ding, Zhengming, Liu, Hongfu

论文摘要

为了解释深层模型的预测，基于注意力的视觉提示被广泛用于解决\ textit {为什么}深层模型做出这样的预测。除此之外，当前的研究界对推理\ textit {}如何做出预测更加感兴趣，其中某些基于原型的方法采用可解释的表示及其相应的视觉提示来揭示深层模型行为的黑盒机制。但是，这些开创性的尝试只能学习特定于类别的原型并使它们的普遍能力恶化，或者演示了几个说明性示例，而无需对基于视觉的可解释性进行定量评估，并进一步限制了其实际用途。在本文中，我们重新审视视觉词的概念，并提出可学习的视觉词（LVW），以使用两个新型模块来解释模型预测行为：语义视觉词学习和双重保真度保存。语义视觉词学习放宽了类别特定的约束，从而使一般的视觉单词能够在不同的类别上共享。除了使用视觉词来预测视觉单词与基本模型相结合外，我们的双忠诚度保留还包括注意力指导的语义一致性，鼓励学习的视觉单词专注于相同的概念区域进行预测。六个视觉基准的实验证明了我们所提出的LVW在准确性和模型解释中比最先进方法具有的卓越有效性。此外，我们详细介绍了各种深入的分析，以进一步探索我们对看不见类别的方法的概括和普遍性。

To interpret deep models' predictions, attention-based visual cues are widely used in addressing \textit{why} deep models make such predictions. Beyond that, the current research community becomes more interested in reasoning \textit{how} deep models make predictions, where some prototype-based methods employ interpretable representations with their corresponding visual cues to reveal the black-box mechanism of deep model behaviors. However, these pioneering attempts only either learn the category-specific prototypes and deteriorate their generalizing capacities, or demonstrate several illustrative examples without a quantitative evaluation of visual-based interpretability with further limitations on their practical usages. In this paper, we revisit the concept of visual words and propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules: semantic visual words learning and dual fidelity preservation. The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories. Beyond employing the visual words for prediction to align visual words with the base model, our dual fidelity preservation also includes the attention guided semantic alignment that encourages the learned visual words to focus on the same conceptual regions for prediction. Experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation over the state-of-the-art methods. Moreover, we elaborate on various in-depth analyses to further explore the learned visual words and the generalizability of our method for unseen categories.

下载PDF全文

下载文献需遵守相关版权规定

论文标题