论文标题
改善宽覆盖的医疗实体与语义类型预测和大规模数据集链接
Improving Broad-Coverage Medical Entity Linking with Semantic Type Prediction and Large-Scale Datasets
论文作者
论文摘要
医疗实体链接是识别和标准化非结构化文本中提到的医学概念的任务。大多数现有方法采用(1)检测提及的三步方法,(2)生成候选概念列表,最后(3)选择其中的最佳概念。在本文中,我们探讨了减轻候选人生成模块中候选概念过度概念的问题,这是医疗实体链接中最不明显的组成部分。为此,我们提出了Medtype,这是一个完全模块化的系统,该系统根据实体的预测语义类型来修剪无关的候选概念。我们将Medtype纳入五个现成的工具包中,以链接医疗实体,并证明它一致地改善了几个基准数据集的实体链接性能。为了解决有关医疗实体链接的注释培训数据的缺乏,我们介绍了Wikimed and PubMedds,两个大规模的医疗实体将数据集链接在一起,并证明在这些数据集中预训练MESTYPE进一步改善了实体链接性能。我们将源代码和数据集公开用于链接研究的医疗实体。
Medical entity linking is the task of identifying and standardizing medical concepts referred to in an unstructured text. Most of the existing methods adopt a three-step approach of (1) detecting mentions, (2) generating a list of candidate concepts, and finally (3) picking the best concept among them. In this paper, we probe into alleviating the problem of overgeneration of candidate concepts in the candidate generation module, the most under-studied component of medical entity linking. For this, we present MedType, a fully modular system that prunes out irrelevant candidate concepts based on the predicted semantic type of an entity mention. We incorporate MedType into five off-the-shelf toolkits for medical entity linking and demonstrate that it consistently improves entity linking performance across several benchmark datasets. To address the dearth of annotated training data for medical entity linking, we present WikiMed and PubMedDS, two large-scale medical entity linking datasets, and demonstrate that pre-training MedType on these datasets further improves entity linking performance. We make our source code and datasets publicly available for medical entity linking research.