论文标题
使用人声的实时MRI
Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract
论文作者
论文摘要
发音到声学(正向)映射是一种使用各种关节采集技术预测语音的技术(例如,超声舌成像,唇部视频)。声音的实时MRI(RTMRI)以前尚未用于此目的。 MRI的优点是它具有高“相对”空间分辨率:它不仅可以捕获舌,唇和下颌运动,还可以捕获绒毛和咽部区域,而其他技术通常是不可能的。在当前的论文中,我们以特定于扬声器特定的方式培训各种DNN(完全连接,卷积和经常性的神经网络),以使用RTMRI作为输入,以进行发音到语音转换。我们使用两个男性和两个女性演讲者,讲述了USC-Timit关节数据库,每个数据库都有460个句子。我们通过客观(标准化的MSE和MCD)和主观度量(感知测试)评估结果,并表明CNN-LSTM网络是首选以多个图像作为输入,并在2.8-4.5 dB之间获得MCD分数。在实验中,我们发现说话者“ M1”的预测明显弱于其他说话者。我们表明,这是由于以下事实,即说话者的录音中有74%是不同步的。
Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques (e.g. ultrasound tongue imaging, lip video). Real-time MRI (rtMRI) of the vocal tract has not been used before for this purpose. The advantage of MRI is that it has a high `relative' spatial resolution: it can capture not only lingual, labial and jaw motion, but also the velum and the pharyngeal region, which is typically not possible with other techniques. In the current paper, we train various DNNs (fully connected, convolutional and recurrent neural networks) for articulatory-to-speech conversion, using rtMRI as input, in a speaker-specific way. We use two male and two female speakers of the USC-TIMIT articulatory database, each of them uttering 460 sentences. We evaluate the results with objective (Normalized MSE and MCD) and subjective measures (perceptual test) and show that CNN-LSTM networks are preferred which take multiple images as input, and achieve MCD scores between 2.8-4.5 dB. In the experiments, we find that the predictions of speaker `m1' are significantly weaker than other speakers. We show that this is caused by the fact that 74% of the recordings of speaker `m1' are out of sync.