基于关注的单个序列到序列模型，用于总结上的最新结果

论文标题

基于关注的单个序列到序列模型，用于总结上的最新结果

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

论文作者

Tüske, Zoltán, Saon, George, Audhkhasi, Kartik, Kingsbury, Brian

论文摘要

通常认为，直到直接的序列到序列（SEQ2SEQ）语音识别模型仅当大量数据（至少一千小时）可用于培训时，与混合模型具有竞争力。在本文中，我们表明，可以使用基于LSTM的单一注意力模型在总机-300数据库上实现最新的识别性能。使用跨完全扬声器模型，我们的单频扬声器独立系统在没有发音词典的情况下，在HUB5'00的总和hub5'00的调用板和呼叫者子集上达到6.4％和12.5％的单词错误率（WER）。虽然仔细的正则化和数据增强对于达到这种绩效水平至关重要，但有关Twardboard-2000的实验表明，没有什么比更多的数据更有用。总体而言，使用SWB-2000，使用SWB-2000，在无外部数据资源的情况下，各种正规化和简单但相当大的模型的组合导致了新的最新状态，在总机和Callhome设置上进行了4.7％和7.8％。

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00, without a pronunciation lexicon. While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data. Overall, the combination of various regularizations and a simple but fairly large model results in a new state of the art, 4.7% and 7.8% WER on the Switchboard and CallHome sets, using SWB-2000 without any external data resources.

下载PDF全文

下载文献需遵守相关版权规定

论文标题