论文标题

Cysecbert:网络安全域的域调整语言模型

CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain

论文作者

Bayer, Markus, Kuehn, Philipp, Shanehsaz, Ramin, Reuter, Christian

论文摘要

网络安全领域正在迅速发展。需要了解过去,当前和 - 在最好的情况下 - 即将到来的威胁,因为攻击变得越来越高,目标更大,系统更加复杂。由于无法手动解决此问题,因此网络安全专家需要依靠机器学习技术。在典型领域中,像伯特这样的预训练的语言模型已经证明是有帮助的,它提供了进一步的微调基线。但是,由于领域知识和网络安全中的许多技术术语,通用语言模型可能会错过文本信息的要旨,因此弊大于利。因此,我们创建了一个高质量的数据集,并提出了专门针对网络安全域量身定制的语言模型,该模型可以作为处理自然语言的网络安全系统的基本构建块。将模型与基于15个不同域依赖性外在和内在任务的其他模型以及超级基准测试的一般任务进行了比较。一方面,内在任务的结果表明,与其他模型相比,我们的模型可以改善单词的内部表示空间。另一方面,由序列标记和分类组成的外部,域依赖性任务表明,与其他模型相比,该模型在特定的应用程序场景中是最好的。此外,我们证明了我们反对灾难性遗忘作品的方法,因为该模型能够检索以前训练的领域独立的知识。使用的数据集和训练有素的模型公开可用

The field of cybersecurity is evolving fast. Experts need to be informed about past, current and - in the best case - upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源