视听多渠道集成和对重叠语音的认可

论文标题

视听多渠道集成和对重叠语音的认可

Audio-visual Multi-channel Integration and Recognition of Overlapped Speech

论文作者

Yu, Jianwei, Zhang, Shi-Xiong, Wu, Bo, Liu, Shansong, Hu, Shoukang, Geng, Mengzhe, Liu, Xunying, Meng, Helen, Yu, Dong

论文摘要

在过去的几十年中，自动语音识别（ASR）技术已经显着提高。但是，迄今为止，对演讲重叠的认识仍然是一项高度挑战的任务。为此，多通道麦克风阵列数据被广泛用于当前ASR系统。由视觉模态对声学信号损坏的不变性以及它们提供的其他提示，以将目标扬声器与干扰声源分开，本文介绍了一个基于视听的多通道识别系统，用于重叠的语音。它受益于语音分离前端和识别后端之间的紧密整合，这两者都包含了其他视频输入。开发了一系列基于TF遮罩，过滤器和总和和基于掩码的MVDR神经通道集成方法的视听多渠道语音分离前端组件。为了减少分离和识别组件之间的错误成本不匹配，使用与连接的噪声比率（SI-SNR）的多任务标准插值共同对整个系统进行微调。实验表明：提出的视听多渠道识别系统优于基线音频仅多次多通道ASR系统，最多高达8.04％（相对相对31.68％）和22.86％和22.86％（58.51％的相对相对）的绝对性降低，使用该模拟或对lrs2的模拟构建的重叠式构建的绝对降低。当使用遮挡的视频输入和随机覆盖的面部区域随机覆盖时，还使用拟议的视听多通道识别系统获得一致的性能改进。

Automatic speech recognition (ASR) technologies have been significantly advanced in the past few decades. However, recognition of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in current ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption and the additional cues they provide to separate the target speaker from the interfering sound sources, this paper presents an audio-visual multi-channel based recognition system for overlapped speech. It benefits from a tight integration between a speech separation front-end and recognition back-end, both of which incorporate additional video input. A series of audio-visual multi-channel speech separation front-end components based on TF masking, Filter&Sum and mask-based MVDR neural channel integration approaches are developed. To reduce the error cost mismatch between the separation and recognition components, the entire system is jointly fine-tuned using a multi-task criterion interpolation of the scale-invariant signal to noise ratio (Si-SNR) with either the connectionist temporal classification (CTC), or lattice-free maximum mutual information (LF-MMI) loss function. Experiments suggest that: the proposed audio-visual multi-channel recognition system outperforms the baseline audio-only multi-channel ASR system by up to 8.04% (31.68% relative) and 22.86% (58.51% relative) absolute WER reduction on overlapped speech constructed using either simulation or replaying of the LRS2 dataset respectively. Consistent performance improvements are also obtained using the proposed audio-visual multi-channel recognition system when using occluded video input with the face region randomly covered up to 60%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题