Treemix：基于组成选区的数据增强自然语言理解

论文标题

Treemix：基于组成选区的数据增强自然语言理解

TreeMix: Compositional Constituency-based Data Augmentation for Natural Language Understanding

论文作者

Zhang, Le, Yang, Zichao, Yang, Diyi

论文摘要

数据增强是解决过度拟合的有效方法。许多以前的作品提出了NLP的不同数据增强策略，例如噪声注入，单词更换，反向翻译等。尽管有效，但它们错过了语言的一个重要特征 - 复合性，复杂表达的含义是从其子部构建的。在此激励的情况下，我们提出了一种称为Treemix的自然语言理解的组成数据增强方法。具体而言，Treemix利用选区解析树将句子分解为组成型子结构和混合数据增强技术以重新组合它们以生成新的句子。与以前的方法相比，Treemix引入了更大的多样性，并鼓励模型学习NLP数据的组成性。关于文本分类和扫描的广泛实验表明，Tremix优于当前最新数据增强方法。

Data augmentation is an effective approach to tackle over-fitting. Many previous works have proposed different data augmentations strategies for NLP, such as noise injection, word replacement, back-translation etc. Though effective, they missed one important characteristic of language--compositionality, meaning of a complex expression is built from its sub-parts. Motivated by this, we propose a compositional data augmentation approach for natural language understanding called TreeMix. Specifically, TreeMix leverages constituency parsing tree to decompose sentences into constituent sub-structures and the Mixup data augmentation technique to recombine them to generate new sentences. Compared with previous approaches, TreeMix introduces greater diversity to the samples generated and encourages models to learn compositionality of NLP data. Extensive experiments on text classification and SCAN demonstrate that TreeMix outperforms current state-of-the-art data augmentation methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题