使用解释性生成恶意软件分类器的端到端对抗示例

论文标题

使用解释性生成恶意软件分类器的端到端对抗示例

Generating End-to-End Adversarial Examples for Malware Classifiers Using Explainability

论文作者

Rosenberg, Ishai, Meir, Shai, Berrebi, Jonathan, Gordon, Ilay, Sicard, Guillaume, David, Eli

论文摘要

近年来，可解释的机器学习（ML）的话题进行了广泛的研究。到目前为止，这项研究的重点是常规ML用户用例，例如调试ML模型。本文采取了不同的姿势，并表明对手可以利用可解释的ML绕过多功能类型的恶意软件分类器。以前针对此类分类器的对抗性攻击仅添加新功能，而不会修改现有功能，以避免损害修改后的恶意软件可执行文件的功能。当前的攻击使用单个算法，两者都选择要修改并盲目修改它们的功能，从而将所有功能都相同。在本文中，我们提出了一种不同的方法。我们将对抗性示例生成任务分为两个部分：首先，我们使用解释性算法找到了所有功能对于特定样本的重要性，然后我们进行了特定于功能的修改（逐个功能）。为了将我们的攻击应用于黑框方案，我们介绍了解释性可传递性的概念，也就是说，使用不同功能子集应用于不同分类器的解释性算法并在不同数据集上进行培训仍然会导致重要特征的类似子集。我们得出的结论是，对手可以利用解释性算法，因此，培训更多可解释的分类器的拥护者应考虑这些分类者对对抗性攻击的较高脆弱性的权衡。

In recent years, the topic of explainable machine learning (ML) has been extensively researched. Up until now, this research focused on regular ML users use-cases such as debugging a ML model. This paper takes a different posture and show that adversaries can leverage explainable ML to bypass multi-feature types malware classifiers. Previous adversarial attacks against such classifiers only add new features and not modify existing ones to avoid harming the modified malware executable's functionality. Current attacks use a single algorithm that both selects which features to modify and modifies them blindly, treating all features the same. In this paper, we present a different approach. We split the adversarial example generation task into two parts: First we find the importance of all features for a specific sample using explainability algorithms, and then we conduct a feature-specific modification, feature-by-feature. In order to apply our attack in black-box scenarios, we introduce the concept of transferability of explainability, that is, applying explainability algorithms to different classifiers using different features subsets and trained on different datasets still result in a similar subset of important features. We conclude that explainability algorithms can be leveraged by adversaries and thus the advocates of training more interpretable classifiers should consider the trade-off of higher vulnerability of those classifiers to adversarial attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题