论文标题
探索变压器以进行大规模语音识别
Exploring Transformers for Large-Scale Speech Recognition
论文作者
论文摘要
尽管经常性的神经网络仍然在很大程度上定义了最新的语音识别系统,但变形金刚网络已被证明是竞争性的替代方案,尤其是在离线状态下。大多数对变压器的研究都在相对较小的规模设置中受到限制,并且通常采用某些形式的数据论证方法来应对数据稀疏问题。在本文中,我们旨在了解大规模语音识别环境中变形金刚的行为,在该环境中,我们使用了大约65,000个小时的培训数据。我们研究了有关扩展变压器的各个方面,包括模型初始化,热身训练以及不同的层归一化策略。在流条件下,我们将基于注意力面罩的未来上下文lookahead方法与变压器-XL网络进行了比较。从我们的实验中,我们表明,与BLSTM基线相比,变压器可以达到约6%的相对单词错误率(WER)降低,而以离线方式可以达到相对的基线,而以流式传输方式,Transformer-XL与具有800毫秒延迟约束的LC-BLSTM相当。
While recurrent neural networks still largely define state-of-the-art speech recognition systems, the Transformer network has been proven to be a competitive alternative, especially in the offline condition. Most studies with Transformers have been constrained in a relatively small scale setting, and some forms of data argumentation approaches are usually applied to combat the data sparsity issue. In this paper, we aim at understanding the behaviors of Transformers in the large-scale speech recognition setting, where we have used around 65,000 hours of training data. We investigated various aspects on scaling up Transformers, including model initialization, warmup training as well as different Layer Normalization strategies. In the streaming condition, we compared the widely used attention mask based future context lookahead approach to the Transformer-XL network. From our experiments, we show that Transformers can achieve around 6% relative word error rate (WER) reduction compared to the BLSTM baseline in the offline fashion, while in the streaming fashion, Transformer-XL is comparable to LC-BLSTM with 800 millisecond latency constraint.