对文化神经扬声器表示编码的信息的经验分析

论文标题

对文化神经扬声器表示编码的信息的经验分析

An empirical analysis of information encoded in disentangled neural speaker representations

论文作者

Peri, Raghuveer, Li, Haoqi, Somandepalli, Krishna, Jati, Arindam, Narayanan, Shrikanth

论文摘要

强大的说话者表示的主要特征是它们与与说话者身份无关的可变性因素不变。说话者表示的分解是用于提高说话者表示的鲁棒性到在语音生产过程中获得的固有因素（例如，情感，词汇内容）和信号捕获过程中获得的外在因素的固有因素的技术之一。可以通过有监督的方式来实现神经说话者表示的分解，并以滋扰因素的注释（与说话者身份无关的因素）或无人监督的方式，而没有标签的标签。无论哪种情况，重要的是要了解各种变异性因素在表示形式中的多大程度。在这项工作中，我们检查了有或没有无监督分解的说话者表示，以了解与一系列因素相关的信息量。使用分类实验，我们提供了经验证据，这些证据使分解降低了说话者表示的滋扰因素的信息，同时保留说话者信息。在几种具有挑战性的声学条件下，通过说话者验证实验对声音语料库进行了进一步验证。在训练分离的扬声器嵌入过程中，我们还使用数据扩展显示了扬声器验证任务的鲁棒性。最后，根据我们的发现，我们提供了有关使用无监督的分离技术可以有效分离的因素的见解，并讨论了潜在的未来方向。

The primary characteristic of robust speaker representations is that they are invariant to factors of variability not related to speaker identity. Disentanglement of speaker representations is one of the techniques used to improve robustness of speaker representations to both intrinsic factors that are acquired during speech production (e.g., emotion, lexical content) and extrinsic factors that are acquired during signal capture (e.g., channel, noise). Disentanglement in neural speaker representations can be achieved either in a supervised fashion with annotations of the nuisance factors (factors not related to speaker identity) or in an unsupervised fashion without labels of the factors to be removed. In either case it is important to understand the extent to which the various factors of variability are entangled in the representations. In this work, we examine speaker representations with and without unsupervised disentanglement for the amount of information they capture related to a suite of factors. Using classification experiments we provide empirical evidence that disentanglement reduces the information with respect to nuisance factors from speaker representations, while retaining speaker information. This is further validated by speaker verification experiments on the VOiCES corpus in several challenging acoustic conditions. We also show improved robustness in speaker verification tasks using data augmentation during training of disentangled speaker embeddings. Finally, based on our findings, we provide insights into the factors that can be effectively separated using the unsupervised disentanglement technique and discuss potential future directions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题