论文标题
从模拟混合物到模拟对话,作为端到端神经腹泻的训练数据
From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
论文作者
论文摘要
如今,端到端的神经诊断(EEND)是说话者诊断中最杰出的研究主题之一。 EEND为标准级联诊断系统提供了一种有吸引力的替代方法,因为单个系统立即接受培训以解决整个诊断问题。但是,正在提出几种反应的变体和方法,但是,所有这些模型都需要大量的带注释的数据进行培训,但可用的注释数据很少。因此,回旋工作主要使用模拟混合物进行训练。但是,模拟混合物在许多方面都不类似于真实的对话。在这项工作中,我们提出了一种创建合成对话的替代方法,该方法通过使用有关在真实对话中估计的暂停和重叠的统计信息来类似于真实的对话。此外,我们分析了统计源,不同的增强和数据量的效果。我们证明我们的方法的性能比原始方法要好,同时降低了对微调阶段的依赖。实验是在Callhome和Dihard 3的2扬声器电话对话上进行的。与本出版物一起,我们发布了REEND的实现以及创建模拟对话的方法。
End-to-end neural diarization (EEND) is nowadays one of the most prominent research topics in speaker diarization. EEND presents an attractive alternative to standard cascaded diarization systems since a single system is trained at once to deal with the whole diarization problem. Several EEND variants and approaches are being proposed, however, all these models require large amounts of annotated data for training but available annotated data are scarce. Thus, EEND works have used mostly simulated mixtures for training. However, simulated mixtures do not resemble real conversations in many aspects. In this work we present an alternative method for creating synthetic conversations that resemble real ones by using statistics about distributions of pauses and overlaps estimated on genuine conversations. Furthermore, we analyze the effect of the source of the statistics, different augmentations and amounts of data. We demonstrate that our approach performs substantially better than the original one, while reducing the dependence on the fine-tuning stage. Experiments are carried out on 2-speaker telephone conversations of Callhome and DIHARD 3. Together with this publication, we release our implementations of EEND and the method for creating simulated conversations.