一种新的数据归一化方法，通过最大程度地减少长尾效应来改善对话的产生

论文标题

一种新的数据归一化方法，通过最大程度地减少长尾效应来改善对话的产生

A New Data Normalization Method to Improve Dialogue Generation by Minimizing Long Tail Effect

论文作者

Zhan, Zhiqiang, Hou, Zifeng, Zhang, Yang

论文摘要

最近的神经模型在对话生成方面已取得了重大进展。大多数一代模型基于语言模型。但是，由于语言学中的较长尾巴现象，训练有素的模型倾向于产生经常出现在训练数据集中的单词，从而导致单调问题。为了解决这个问题，我们分析了Wikipedia的大型语料库，并提出了三种基于频率的数据归一化方法。我们基于变压器和三个数据集进行了广泛的实验，分别从社交媒体，字幕和工业应用中收集。实验结果表明，生成的响应的多样性和信息性（定义为名词和动词数量）的显着改善。更具体地说，三个数据集的Umigram和Bigram多样性分别增加了2.6％-12.6％和2.2％-18.9％。此外，信息性（即名词和动词的数量）分别增加了4.0％-7.0％和1.4％-12.1％。此外，简单性和有效性使我们的方法能够适应不同的生成模型，而无需额外的计算成本。

Recent neural models have shown significant progress in dialogue generation. Most generation models are based on language models. However, due to the Long Tail Phenomenon in linguistics, the trained models tend to generate words that appear frequently in training datasets, leading to a monotonous issue. To address this issue, we analyze a large corpus from Wikipedia and propose three frequency-based data normalization methods. We conduct extensive experiments based on transformers and three datasets respectively collected from social media, subtitles, and the industrial application. Experimental results demonstrate significant improvements in diversity and informativeness (defined as the numbers of nouns and verbs) of generated responses. More specifically, the unigram and bigram diversity are increased by 2.6%-12.6% and 2.2%-18.9% on the three datasets, respectively. Moreover, the informativeness, i.e. the numbers of nouns and verbs, are increased by 4.0%-7.0% and 1.4%-12.1%, respectively. Additionally, the simplicity and effectiveness enable our methods to be adapted to different generation models without much extra computational cost.

下载PDF全文

下载文献需遵守相关版权规定

论文标题