迈向对话历史的端到端整合，以提高口语理解

论文标题

迈向对话历史的端到端整合，以提高口语理解

Towards End-to-End Integration of Dialog History for Improved Spoken Language Understanding

论文作者

Sunder, Vishal, Thomas, Samuel, Kuo, Hong-Kwang J., Ganhotra, Jatin, Kingsbury, Brian, Fosler-Lussier, Eric

论文摘要

对话历史记录在对话系统中的口语理解（SLU）性能中起着重要作用。对于端到端（E2E）SLU，以前的工作以文本形式使用了对话记录，这使该模型取决于级联的自动语音识别器（ASR）。这取消了E2E系统的好处，该系统旨在紧凑而强大的ASR错误。在本文中，我们提出了一个层次对话模型，该模型能够直接以语音形式使用对话历史记录，从而使其完全E2E。我们还通过共同训练具有声音和语义嵌入的明确绑定的类似基于文本的对话模型，从可用的黄金对话笔录中提取语义知识。我们还提出了一种新颖的技术，我们称之为Dropframe来处理以E2E添加对话历史记录所产生的漫长训练时间。在HarpervalleyBank对话框数据集中，我们的E2E历史集成在对话框操作识别的任务上，超过7.7％的绝对F1得分优于历史独立基线。我们的模型与最先进的基于历史记录的级联基线竞争性能，但使用48％的参数。在没有黄金转录本以微调ASR模型的情况下，我们的模型的表现优于该基线的绝对F1分数的显着余量。

Dialog history plays an important role in spoken language understanding (SLU) performance in a dialog system. For end-to-end (E2E) SLU, previous work has used dialog history in text form, which makes the model dependent on a cascaded automatic speech recognizer (ASR). This rescinds the benefits of an E2E system which is intended to be compact and robust to ASR errors. In this paper, we propose a hierarchical conversation model that is capable of directly using dialog history in speech form, making it fully E2E. We also distill semantic knowledge from the available gold conversation transcripts by jointly training a similar text-based conversation model with an explicit tying of acoustic and semantic embeddings. We also propose a novel technique that we call DropFrame to deal with the long training time incurred by adding dialog history in an E2E manner. On the HarperValleyBank dialog dataset, our E2E history integration outperforms a history independent baseline by 7.7% absolute F1 score on the task of dialog action recognition. Our model performs competitively with the state-of-the-art history based cascaded baseline, but uses 48% fewer parameters. In the absence of gold transcripts to fine-tune an ASR model, our model outperforms this baseline by a significant margin of 10% absolute F1 score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题