论文标题
端到端自动语音识别的基于插入的建模
Insertion-Based Modeling for End-to-End Automatic Speech Recognition
论文作者
论文摘要
端到端(E2E)模型在自动语音识别(ASR)的研究领域引起了人们的关注。迄今为止提出的许多E2E模型都假设除了连接式时间分类(CTC)及其变体外,输出令牌序列的从左到右的自动回归产生。但是,从左到右的解码不能考虑将来的输出上下文,并且并不总是对ASR最佳。非左至权利模型之一被称为非自动回旋变压器(NAT),并已在神经机器翻译(NMT)研究领域进行了深入研究。一种NAT模型,即蒙版预测,已应用于ASR,但该模型需要一些启发式方法或其他组件来估计输出令牌序列的长度。本文建议将最初为NMT提出的另一种称为基于插入的NAT的NAT应用于ASR任务。基于插入的模型解决了上述掩码预测问题,并可以生成输出序列的任意生成顺序。此外,我们还引入了基于插入模型和CTC的联合培训的新公式。这种公式通过以非自动回归方式取决于基于插入的令牌生成来加强CTC。我们对三个公共基准进行了实验,并在具有类似的解码条件的强大自动回归变压器上实现了竞争性能。
End-to-end (E2E) models have gained attention in the research field of automatic speech recognition (ASR). Many E2E models proposed so far assume left-to-right autoregressive generation of an output token sequence except for connectionist temporal classification (CTC) and its variants. However, left-to-right decoding cannot consider the future output context, and it is not always optimal for ASR. One of the non-left-to-right models is known as non-autoregressive Transformer (NAT) and has been intensively investigated in the area of neural machine translation (NMT) research. One NAT model, mask-predict, has been applied to ASR but the model needs some heuristics or additional component to estimate the length of the output token sequence. This paper proposes to apply another type of NAT called insertion-based models, that were originally proposed for NMT, to ASR tasks. Insertion-based models solve the above mask-predict issues and can generate an arbitrary generation order of an output sequence. In addition, we introduce a new formulation of joint training of the insertion-based models and CTC. This formulation reinforces CTC by making it dependent on insertion-based token generation in a non-autoregressive manner. We conducted experiments on three public benchmarks and achieved competitive performance to strong autoregressive Transformer with a similar decoding condition.