pho-lid：一种统一模型，结合了语言识别的声学和音调信息

论文标题

pho-lid：一种统一模型，结合了语言识别的声学和音调信息

PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

论文作者

Liu, Hexin, Perera, Leibny Paola Garcia, Khong, Andy W. H., Styles, Suzy J., Khudanpur, Sanjeev

论文摘要

我们提出了一个新颖的模型，以分层纳入语音识别（LID）的音素和音调信息，而无需进行音素注释进行训练。在该模型中，称为Pho-Lid，一个自我监督的音素分割任务和盖子任务共享卷积神经网络（CNN）模块，该模块在输入语音中编码语言识别和顺序语音信息，以生成语音嵌入的中间序列。然后将这些嵌入到变压器编码器层中，以进行说服级盖。我们称此架构CNN-Trans。我们在AP17-OLR数据和NIST LRE 2017的MLS14集上对其进行了评估，并表明具有多任务优化的Pho-LID模型在所有模型中表现出最高的盖子性能，而与CNN-Trans模型相比，AP17-OLR数据的平均成本相对相对提高了40％以上的相对成本超过40％。可视化的混淆矩阵暗示我们提出的方法比CNN-Trans模型在NIST LRE 2017数据中的同一集群的语言上实现了更高的性能。预测的音素边界与相应的音频谱图之间的比较说明了盖子的利用信息。

We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of phonotactic embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multi-task optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.

下载PDF全文

下载文献需遵守相关版权规定

论文标题