Hatecheck：仇恨言语检测模型的功能测试

论文标题

Hatecheck：仇恨言语检测模型的功能测试

HateCheck: Functional Tests for Hate Speech Detection Models

论文作者

Röttger, Paul, Vidgen, Bertram, Nguyen, Dong, Waseem, Zeerak, Margetts, Helen, Pierrehumbert, Janet B.

论文摘要

检测在线仇恨是一项艰巨的任务，即使是最先进的模型也很难。通常，通过使用准确性和F1得分等指标来测量其在持有测试数据上的性能来评估仇恨言语检测模型。但是，这种方法使很难识别特定的模型弱点。由于仇恨言语数据集中越来越多的系统差距和偏见，它也有可能高估可普遍的模型性能。为了启用更多针对性的诊断见解，我们引入了Hatecheck，这是一套仇恨言语检测模型的功能测试。我们指定了29个模型功能，该功能是通过对先前研究的审查以及对民间社会利益相关者进行的一系列访谈的动机。我们为每个功能制作测试用例，并通过结构化注释过程来验证其质量。为了说明Hatecheck的实用程序，我们测试了近期的变压器模型以及两个流行的商业模型，揭示了关键的模型弱点。

Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck's utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.

下载PDF全文

下载文献需遵守相关版权规定

论文标题