使用检测和通过的有限数据来增强ASR的口吃语音

论文标题

使用检测和通过的有限数据来增强ASR的口吃语音

Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass

论文作者

Shonibare, Olabanji, Tong, Xiaosu, Ravichandran, Venkatesh

论文摘要

据估计，全球约有7000万人受到称为口吃的言语障碍的影响。随着自动语音识别（ASR）的最新进展，语音助手在我们的日常生活中越来越有用。现在可以通过语音运行教育，零售，电信和医疗保健领域的许多技术。不幸的是，口吃（PW）的人无法获得这些好处。我们提出了一种简单但有效的方法，称为“检测并通过”，以使现代ASR系统可用于在有限的数据设置中口吃的人。该算法使用对有限数据训练的上下文意识到的分类器来检测包含口吃的声学框架。为了提高口吃语音的鲁棒性，这些额外的信息将传递给推断期间要使用的ASR模型。我们的实验表明，在各种最新状态ASR系统中，单词错误率（WER）降低了12.18％至71.24％。在改变了用于确定低帧速率（LFR）声学特征的每个堆叠式框架的相关后验概率的阈值时，我们能够确定最佳设置将WER降低了23.93％至71.67％，而不同的ASR系统则将其降低。

It is estimated that around 70 million people worldwide are affected by a speech disorder called stuttering. With recent advances in Automatic Speech Recognition (ASR), voice assistants are increasingly useful in our everyday lives. Many technologies in education, retail, telecommunication and healthcare can now be operated through voice. Unfortunately, these benefits are not accessible for People Who Stutter (PWS). We propose a simple but effective method called 'Detect and Pass' to make modern ASR systems accessible for People Who Stutter in a limited data setting. The algorithm uses a context aware classifier trained on a limited amount of data, to detect acoustic frames that contain stutter. To improve robustness on stuttered speech, this extra information is passed on to the ASR model to be utilized during inference. Our experiments show a reduction of 12.18% to 71.24% in Word Error Rate (WER) across various state of the art ASR systems. Upon varying the threshold of the associated posterior probability of stutter for each stacked frame used in determining low frame rate (LFR) acoustic features, we were able to determine an optimal setting that reduced the WER by 23.93% to 71.67% across different ASR systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题