论文标题
在西班牙语中检测未拟合的借款:带注释的语料库和建模方法
Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling
论文作者
论文摘要
这项工作为借用身份证明提供了一种新资源,并在此任务上分析了几种模型的性能和错误。我们介绍了一个新的带有富含词汇借用的西班牙新闻的注释新闻,这是一种无需拼字适应的语言的单词,并使用它来评估几种序列标签模型(CRF,BilstM-CRF,以及基于变形金刚的模型)如何执行。该语料库包含370,000个令牌,并且比可用于此任务的以前的Corpora更大,更大,OOV丰富和主题变化。我们的结果表明,用子字嵌入的BilstM-CRF模型以及在代码开关数据上鉴定的基于变压器的嵌入或上下文化单词嵌入的组合优于多语言BERT模型获得的结果。
This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings -- words from one language that are introduced into another without orthographic adaptation -- and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.