论文标题
集成培训和适应RNN换能器ASR模型的文本输入
Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models
论文作者
论文摘要
与使用模块化体系结构的混合自动语音识别(ASR)系统相比,每个组件都可以独立地适应一个新的域,由于其全天候整体构造,因此很难自定义近期端到端(E2E)ASR系统。在本文中,我们为E2E ASR模型提出了一个新颖的文本表示和培训框架。通过这种方法,我们表明训练有素的RNN换能器(RNN-T)模型的内部LM组件可以有效地使用仅文本数据。使用语音和文本输入训练的RNN-T模型可以改进基线模型,该基线模型在Specter上训练,在调用板上降低了近13%的单词错误率(WER)和NIST HUB 5 2000评估的Callhome测试集。通过将此通用RNN-T模型自定义为三个单独的数据集,进一步证明了所提出方法的有用性。我们使用这种新颖的LM样式定制技术在这些设置中观察到20-45%的相对单词错误率(WER)降低,仅使用来自新域中的未配对文本数据。
Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their all-neural monolithic construction. In this paper, we propose a novel text representation and training framework for E2E ASR models. With this approach, we show that a trained RNN Transducer (RNN-T) model's internal LM component can be effectively adapted with text-only data. An RNN-T model trained using both speech and text inputs improves over a baseline model trained on just speech with close to 13% word error rate (WER) reduction on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation. The usefulness of the proposed approach is further demonstrated by customizing this general purpose RNN-T model to three separate datasets. We observe 20-45% relative word error rate (WER) reduction in these settings with this novel LM style customization technique using only unpaired text data from the new domains.