论文标题

基于关注的单个序列到序列模型,用于总结上的最新结果

Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard

论文作者

Tüske, Zoltán, Saon, George, Audhkhasi, Kartik, Kingsbury, Brian

论文摘要

通常认为,直到直接的序列到序列(SEQ2SEQ)语音识别模型仅当大量数据(至少一千小时)可用于培训时,与混合模型具有竞争力。在本文中,我们表明,可以使用基于LSTM的单一注意力模型在总机-300数据库上实现最新的识别性能。使用跨完全扬声器模型,我们的单频扬声器独立系统在没有发音词典的情况下,在HUB5'00的总和hub5'00的调用板和呼叫者子集上达到6.4%和12.5%的单词错误率(WER)。虽然仔细的正则化和数据增强对于达到这种绩效水平至关重要,但有关Twardboard-2000的实验表明,没有什么比更多的数据更有用。总体而言,使用SWB-2000,使用SWB-2000,在无外部数据资源的情况下,各种正规化和简单但相当大的模型的组合导致了新的最新状态,在总机和Callhome设置上进行了4.7%和7.8%。

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00, without a pronunciation lexicon. While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data. Overall, the combination of various regularizations and a simple but fairly large model results in a new state of the art, 4.7% and 7.8% WER on the Switchboard and CallHome sets, using SWB-2000 without any external data resources.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源