构建大量名称歧义数据集的框架：算法，可视化和人类协作

论文标题

构建大量名称歧义数据集的框架：算法，可视化和人类协作

A framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration

论文作者

Xiao, Zhuoyue, Zhang, Yutao, Chen, Bo, Liu, Xiaozhao, Tang, Jie

论文摘要

我们提出了一个手动标记的作者名称歧义（和）名为Whoiswho的数据集，该数据集由399,255个文档和45,187个不同的作者组成，其中有421个模棱两可的作者名称。为了标记如此大量的准确性和数据，我们提出了一个新颖的注释框架，其中人类和计算机在其中有效，精确地协作。在框架内，我们还提出了一个归纳歧义模型，以对两个文件是否属于同一作者进行分类。我们评估了关于Whoiswho的建议方法和其他最先进的歧义方法。实验结果表明：（1）我们的模型在此具有挑战性的基准上优于其他歧义算法。（2）问题和问题仍然在很大程度上尚未解决，需要更多的深入研究。我们认为，如此大规模的基准将为作者姓名歧义任务带来巨大的价值。我们还进行了几项实验，以证明我们的注释框架可以帮助注释者有效地取得准确的结果，并有效地消除了人类注释者的错误标签问题。

We present a manually-labeled Author Name Disambiguation(AND) Dataset called WhoisWho, which consists of 399,255 documents and 45,187 distinct authors with 421 ambiguous author names. To label such a great amount of AND data of high accuracy, we propose a novel annotation framework where the human and computer collaborate efficiently and precisely. Within the framework, we also propose an inductive disambiguation model to classify whether two documents belong to the same author. We evaluate the proposed method and other state-of-the-art disambiguation methods on WhoisWho. The experiment results show that: (1) Our model outperforms other disambiguation algorithms on this challenging benchmark. (2) The AND problem still remains largely unsolved and requires more in-depth research. We believe that such a large-scale benchmark would bring great value for the author name disambiguation task. We also conduct several experiments to prove our annotation framework could assist annotators to make accurate results efficiently and eliminate wrong label problems made by human annotators effectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题