论文标题
语音增强协助端到端的多任务学习用于语音活动检测
Speech enhancement aided end-to-end multi-task learning for voice activity detection
论文作者
论文摘要
在低信噪(SNR)环境中,强大的语音活动检测(VAD)是一项具有挑战性的任务。最近的研究表明,语音增强有助于VAD,但性能提高是有限的。为了解决这个问题,我们在这里提出了一个语音增强辅助VAD的端到端多任务模型。该模型有两个解码器,一个用于语音增强,另一个用于VAD。两个解码器共享相同的编码器和语音分离网络。与分别为VAD和语音增强的两个分开目标的直接思想不同,在这里,我们提出了一个新的关节优化目标 - VAD掩盖的尺度不变源源与距离比率(MSI-SDR)。 MSI-SDR使用VAD信息来掩盖训练过程中语音增强解码器的输出。它使VAD和语音增强任务不仅在共享的编码器和分离网络上共同优化,而且在客观层面上进行了优化。从理论上讲,它还满足实时工作要求。实验结果表明,多任务方法显着优于其单任务VAD对应物。此外,MSI-SDR在相同的多任务设置中优于SI-SDR。
Robust voice activity detection (VAD) is a challenging task in low signal-to-noise (SNR) environments. Recent studies show that speech enhancement is helpful to VAD, but the performance improvement is limited. To address this issue, here we propose a speech enhancement aided end-to-end multi-task model for VAD. The model has two decoders, one for speech enhancement and the other for VAD. The two decoders share the same encoder and speech separation network. Unlike the direct thought that takes two separated objectives for VAD and speech enhancement respectively, here we propose a new joint optimization objective -- VAD-masked scale-invariant source-to-distortion ratio (mSI-SDR). mSI-SDR uses VAD information to mask the output of the speech enhancement decoder in the training process. It makes the VAD and speech enhancement tasks jointly optimized not only at the shared encoder and separation network, but also at the objective level. It also satisfies real-time working requirement theoretically. Experimental results show that the multi-task method significantly outperforms its single-task VAD counterpart. Moreover, mSI-SDR outperforms SI-SDR in the same multi-task setting.