论文标题
解释对神经网络的攻击
Explaining Away Attacks Against Neural Networks
论文作者
论文摘要
我们研究了识别对基于图像的神经网络的对抗性攻击的问题。我们提出了有趣的实验结果,表明在清洁和对抗数据的模型预测中产生的解释之间存在显着差异。利用这种直觉,我们提出了一个框架,该框架可以根据模型给出的解释来确定给定输入是否为对抗性。可以在此处找到我们的实验的代码:https://github.com/seansaito/explaining-away-away-attacks-against-neural-networks。
We investigate the problem of identifying adversarial attacks on image-based neural networks. We present intriguing experimental results showing significant discrepancies between the explanations generated for the predictions of a model on clean and adversarial data. Utilizing this intuition, we propose a framework which can identify whether a given input is adversarial based on the explanations given by the model. Code for our experiments can be found here: https://github.com/seansaito/Explaining-Away-Attacks-Against-Neural-Networks.