NextFormer：Convnext增强构象体以进行端到端语音识别

论文标题

NextFormer：Convnext增强构象体以进行端到端语音识别

Nextformer: A ConvNeXt Augmented Conformer For End-To-End Speech Recognition

论文作者

Jiang, Yongjun, Yu, Jian, Yang, Wenwen, Zhang, Bihong, Wang, Yanfeng

论文摘要

构象模型已经实现了最先进的（SOTA），从而导致了端到端的语音识别。但是，构象异构体主要集中在时间建模上，而对语音特征的时间频率属性则减少了关注。在本文中，我们增强了与Convnext并提出NextFormer结构的构象。我们使用Convnext块的堆栈来替换顺符中常用的子采样模块，以利用时间频率语音功能中包含的信息。此外，我们在构象异构层的中间插入了一个额外的倒数采样模块，以使我们的模型更有效和准确。我们在两个开放数据集Aishell-1和Wenetspeech上进行实验。在Aishell-1上，与构型基准相比，NextFormer在非流式和流式传输模式中分别获得7.3％和6.3％的相对CER，并且在更大的WenetsPeech数据集中，NextFormer可提供5.0％〜6.5％和7.5％和7.5％〜14.6％的相对降低，同时保持非流和流的相对成本，同时进行计算型号。据我们所知，提出的NextFormer模型在Aishell-1（CER 4.06％）和Wenetspeech（CER 7.56％/11.29％）上实现了SOTA结果。

Conformer models have achieved state-of-the-art(SOTA) results in end-to-end speech recognition. However Conformer mainly focuses on temporal modeling while pays less attention on time-frequency property of speech feature. In this paper we augment Conformer with ConvNeXt and propose Nextformer structure. We use stacks of ConvNeXt block to replace the commonly used subsampling module in Conformer for utilizing the information contained in time-frequency speech feature. Besides, we insert an additional downsampling module in middle of Conformer layers to make our model more efficient and accurate. We conduct experiments on two opening datasets, AISHELL-1 and WenetSpeech. On AISHELL-1, compared to Conformer baselines, Nextformer obtains 7.3% and 6.3% relative CER reduction in non-streaming and streaming mode respectively, and on a much larger WenetSpeech dataset, Nextformer gives 5.0%~6.5% and 7.5%~14.6% relative CER reduction in non-streaming and streaming mode, while keep the computational cost FLOPs comparable to Conformer. To the best of our knowledge, the proposed Nextformer model achieves SOTA results on AISHELL-1(CER 4.06%) and WenetSpeech(CER 7.56%/11.29%).

下载PDF全文

下载文献需遵守相关版权规定

论文标题