Multimwe：构建多语言多字表达（MWE）平行语料库

论文标题

Multimwe：构建多语言多字表达（MWE）平行语料库

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

论文作者

Han, Lifeng, Jones, Gareth J. F., Smeaton, Alan F.

论文摘要

多字表达式（MWES）是自然语言处理研究（NLP）的热门话题，包括MWE检测，MWE分解和研究在其他NLP领域（例如机器翻译）中对MWE的开发的主题。但是，双语或多语言MWE语料库的可用性非常有限。我们知道的唯一双语MWE语料库是来自Parseme（解析和多字表达式）欧盟项目。这是只有871对英国 - 德国MWE的一小部分。在本文中，我们介绍了我们从根平行的语料库中提取的多语性和双语MWE语料库。过滤后，我们的系列分别为3,159,226和143,042双语MWE配对，分别用于德国英语和中文英语。我们检查了MT实验中这些提取的双语MWE的质量。我们在MT中应用MWE的最初实验表明，在定性分析中，对MWE术语的翻译表现得到了改善，并且在定量分析中，在德语 - 英语和中文 - 英语对方面都有更好的一般评估分数。我们遵循一条标准的实验管道来创建可在线可用的Multimwe Corpora。研究人员可以将此免费的语料库用于自己的模型，也可以在知识库中使用它们作为模型功能。

Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU Project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题