论文标题
生物医学文本的词汇转移:如果无法添加数据,请添加令牌
Vocabulary Transfer for Biomedical Texts: Add Tokens if You Can Not Add Data
论文作者
论文摘要
在特定的NLP子域中工作,主要是由于数据持续不足。严格的隐私问题和有限的数据可访问性通常会导致此短缺。此外,医疗领域需要高准确性,即使模型性能的边际改进也会产生深远的影响。在这项研究中,我们研究了词汇转移的潜力,以增强生物医学NLP任务中的模型性能。具体而言,我们专注于词汇扩展,该技术涉及扩展目标词汇以结合域特异性生物医学术语。我们的发现表明,词汇扩展可导致下游模型性能和推理时间的可测量改进。
Working within specific NLP subdomains presents significant challenges, primarily due to a persistent deficit of data. Stringent privacy concerns and limited data accessibility often drive this shortage. Additionally, the medical domain demands high accuracy, where even marginal improvements in model performance can have profound impacts. In this study, we investigate the potential of vocabulary transfer to enhance model performance in biomedical NLP tasks. Specifically, we focus on vocabulary extension, a technique that involves expanding the target vocabulary to incorporate domain-specific biomedical terms. Our findings demonstrate that vocabulary extension, leads to measurable improvements in both downstream model performance and inference time.