论文标题
AMDIX:神经机器翻译的混合样本数据增强方法
AdMix: A Mixed Sample Data Augmentation Method for Neural Machine Translation
论文作者
论文摘要
在神经机器翻译(NMT)中,诸如反向翻译之类的数据增强方法证明了它们在改善翻译性能方面的有效性。在本文中,我们为NMT提出了一种新型的数据增强方法,该方法与任何其他培训数据无关。我们的方法包括两个部分:1)在原始句子对中引入微弱的离散噪声(单词更换,单词删除,单词交换),以形成增强样品; 2)通过将增强样品与训练语料库中的原始样品轻轻混合在一起来生成新的合成训练数据。在三个不同量表的三个翻译数据集上进行的实验表明,与强变压器基线相比,Ambix实现了显着改进(1.0至2.7 BLEU点)。当与其他数据增强技术(例如,反向翻译)结合使用时,我们的方法可以获得进一步的改进。
In Neural Machine Translation (NMT), data augmentation methods such as back-translation have proven their effectiveness in improving translation performance. In this paper, we propose a novel data augmentation approach for NMT, which is independent of any additional training data. Our approach, AdMix, consists of two parts: 1) introduce faint discrete noise (word replacement, word dropping, word swapping) into the original sentence pairs to form augmented samples; 2) generate new synthetic training data by softly mixing the augmented samples with their original samples in training corpus. Experiments on three translation datasets of different scales show that AdMix achieves signifi cant improvements (1.0 to 2.7 BLEU points) over strong Transformer baseline. When combined with other data augmentation techniques (e.g., back-translation), our approach can obtain further improvements.