论文标题
学习一个未参考的指标以进行在线对话评估
Learning an Unreferenced Metric for Online Dialogue Evaluation
论文作者
论文摘要
评估两个代理之间的对话互动的质量是一项艰巨的任务,尤其是在开放域聊天风格的对话中。最近已经做出了开发自动对话评估指标的努力,但是其中大多数人并未概括地看不见的数据集和/或需要在推断期间产生人类的参考响应,这使得它无法在线评估。在这里,我们提出了一个未参考的自动化评估指标,该指标使用大型预训练的语言模型来提取话语的潜在表示,并利用它们之间存在的时间过渡。我们表明,我们的模型与在线环境中的人类注释达到了更高的相关性,而在推断期间不需要真正的响应以进行比较。
Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue. There have been recent efforts to develop automatic dialogue evaluation metrics, but most of them do not generalize to unseen datasets and/or need a human-generated reference response during inference, making it infeasible for online evaluation. Here, we propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances, and leverages the temporal transitions that exist between them. We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.