学会共同抄录和字幕以端到端自发的语音识别

论文标题

学会共同抄录和字幕以端到端自发的语音识别

Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

论文作者

Poncelet, Jakob, Van hamme, Hugo

论文摘要

电视字幕是多种类型语音的抄写的丰富来源，从新闻报道中的阅读语音到脱口秀和肥皂中的对话和自发演讲。但是，字幕不是语音的逐字记录（即精确）转录，因此不能直接用于改善自动语音识别（ASR）模型。我们提出了共同执行ASR和自动副标题的多任务双码变压器模型。 ASR解码器（可能是预训练）预测逐字输出，字幕解码器会在共享编码器时生成字幕。这两个解码器可以独立或连接。该模型经过训练，可以共同执行这两个任务，并能够有效地使用字幕数据。通过合并其他字幕解码器，我们显示了常规ASR以及自发和对话ASR的改进。该方法不需要字幕的预处理（对齐，过滤，伪标记等）。

TV subtitles are a rich source of transcriptions of many types of speech, ranging from read speech in news reports to conversational and spontaneous speech in talk shows and soaps. However, subtitles are not verbatim (i.e. exact) transcriptions of speech, so they cannot be used directly to improve an Automatic Speech Recognition (ASR) model. We propose a multitask dual-decoder Transformer model that jointly performs ASR and automatic subtitling. The ASR decoder (possibly pre-trained) predicts the verbatim output and the subtitle decoder generates a subtitle, while sharing the encoder. The two decoders can be independent or connected. The model is trained to perform both tasks jointly, and is able to effectively use subtitle data. We show improvements on regular ASR and on spontaneous and conversational ASR by incorporating the additional subtitle decoder. The method does not require preprocessing (aligning, filtering, pseudo-labeling, ...) of the subtitles.

下载PDF全文

下载文献需遵守相关版权规定

论文标题