通过神经句子编辑快速跨域数据增强

论文标题

通过神经句子编辑快速跨域数据增强

Fast Cross-domain Data Augmentation through Neural Sentence Editing

论文作者

Raille, Guillaume, Djambazovska, Sandra, Musat, Claudiu

论文摘要

数据增强有望减轻数据稀缺性。在初始数据短缺的情况下，这是最重要的。对于现有方法，这也是最困难的增强，因为学习完整的数据分布是不可能的。对于自然语言，句子编辑提供了一种解决方案 - 依靠原始语言的较小但有意义的更改。学习哪些变化是有意义的，还需要大量的培训数据。因此，我们的目标是在数据丰富的源域中学习这一点，并将其应用于不同的目标域，其中数据稀缺 - 跨域增强。我们创建了Edit-Transformer，这是一种基于变压器的句子编辑器，其比艺术的状态快得多，并且还起作用跨域。我们认为，由于其结构，编辑转换器比其基于编辑的前任更适合跨域环境。我们在Yelp-Wikipedia域对上显示了这种性能差距。最后，我们表明，由于这种跨域性能优势，编辑转换器会导致多个下游任务的有意义的性能提高。

Data augmentation promises to alleviate data scarcity. This is most important in cases where the initial data is in short supply. This is, for existing methods, also where augmenting is the most difficult, as learning the full data distribution is impossible. For natural language, sentence editing offers a solution - relying on small but meaningful changes to the original ones. Learning which changes are meaningful also requires large amounts of training data. We thus aim to learn this in a source domain where data is abundant and apply it in a different, target domain, where data is scarce - cross-domain augmentation. We create the Edit-transformer, a Transformer-based sentence editor that is significantly faster than the state of the art and also works cross-domain. We argue that, due to its structure, the Edit-transformer is better suited for cross-domain environments than its edit-based predecessors. We show this performance gap on the Yelp-Wikipedia domain pairs. Finally, we show that due to this cross-domain performance advantage, the Edit-transformer leads to meaningful performance gains in several downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题