解释对神经网络的攻击

论文标题

解释对神经网络的攻击

Explaining Away Attacks Against Neural Networks

论文作者

Saito, Sean, Wang, Jin

论文摘要

我们研究了识别对基于图像的神经网络的对抗性攻击的问题。我们提出了有趣的实验结果，表明在清洁和对抗数据的模型预测中产生的解释之间存在显着差异。利用这种直觉，我们提出了一个框架，该框架可以根据模型给出的解释来确定给定输入是否为对抗性。可以在此处找到我们的实验的代码：https：//github.com/seansaito/explaining-away-away-attacks-against-neural-networks。

We investigate the problem of identifying adversarial attacks on image-based neural networks. We present intriguing experimental results showing significant discrepancies between the explanations generated for the predictions of a model on clean and adversarial data. Utilizing this intuition, we propose a framework which can identify whether a given input is adversarial based on the explanations given by the model. Code for our experiments can be found here: https://github.com/seansaito/Explaining-Away-Attacks-Against-Neural-Networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题