罗马：一个可用于评估自然语言生成的强大指标

论文标题

罗马：一个可用于评估自然语言生成的强大指标

RoMe: A Robust Metric for Evaluating Natural Language Generation

论文作者

Rony, Md Rashad Al Hasan, Kovriguina, Liubov, Chaudhuri, Debanjan, Usbeck, Ricardo, Lehmann, Jens

论文摘要

评估自然语言生成（NLG）系统是一项具有挑战性的任务。首先，指标应确保生成的假设反映了参考的语义。其次，它应该考虑生成句子的语法质量。第三，它应该足够坚固，可以处理生成句子的各种表面形式。因此，必须对有效的评估指标进行多方面。在本文中，我们提出了一个自动评估指标，其中包含了自然语言理解的几个核心方面（语言能力，句法和语义变化）。我们提出的指标Rome使用了语义功能，例如语义相似性，再加上树木编辑距离和语法可接受性，并使用自我监督的神经网络来评估生成的句子的整体质量。此外，我们对最先进的方法和罗马进行了广泛的鲁棒性分析。经验结果表明，罗马在评估几个NLG任务的系统生成的句子方面与人类对最新指标的判断有更强的相关性。

Evaluating Natural Language Generation (NLG) systems is a challenging task. Firstly, the metric should ensure that the generated hypothesis reflects the reference's semantics. Secondly, it should consider the grammatical quality of the generated sentence. Thirdly, it should be robust enough to handle various surface forms of the generated sentence. Thus, an effective evaluation metric has to be multifaceted. In this paper, we propose an automatic evaluation metric incorporating several core aspects of natural language understanding (language competence, syntactic and semantic variation). Our proposed metric, RoMe, is trained on language features such as semantic similarity combined with tree edit distance and grammatical acceptability, using a self-supervised neural network to assess the overall quality of the generated sentence. Moreover, we perform an extensive robustness analysis of the state-of-the-art methods and RoMe. Empirical results suggest that RoMe has a stronger correlation to human judgment over state-of-the-art metrics in evaluating system-generated sentences across several NLG tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题