论文标题
ICML 2022表达性发声研讨会和竞争:认识,产生和个性化声音爆发
The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts
论文作者
论文摘要
ICML表达性发声(EXVO)竞争的重点是理解和产生声乐爆发:笑声,喘息,哭泣和其他非语言发声,这是情感表达和交流至关重要的。 EXVO 2022,包括三个竞赛曲目,使用1,702位扬声器的59,201个发声的大规模数据集。首先是Exvo-Multitask,要求参与者训练多任务模型,以识别声音爆发中表达的情绪和人口特征。第二个是exvo生成的,要求参与者训练一种产生声音爆发的生成模型,传达了十种不同的情绪。第三个exvo-fewshot要求参与者利用少量的学习融合说话者身份来训练模型,以识别声音爆发传达的10种情感。本文描述了这三个曲目,并使用最先进的机器学习策略为基线模型提供了绩效指标。每条曲目的基线如下,对于Exvo-Multitask,一个总分,计算一致性相关系数(CCC)的谐波平均值,未加权的平均召回率(UAR)和倒的平均绝对错误(MAE)(MAE)($ S_ {MTL} $)是0.335 $ s_ {mtl} $ 0.335 $ s_ {mtl} $;对于Exvo生成,我们报告了Fréchet成立距离(FID)的得分范围为4.81至8.27(取决于情绪),训练集和生成的样本之间。然后,我们将倒置的FID与生成样品的感知评级($ s_ {gen} $)相结合,并获得0.174 $ s_ {gen} $;对于Exvo-Fewshot,获得平均CCC为0.444。
The ICML Expressive Vocalization (ExVo) Competition is focused on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers. The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts. The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions. The third, ExVo-FewShot, requires participants to leverage few-shot learning incorporating speaker identity to train a model for the recognition of 10 emotions conveyed by vocal bursts. This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies. The baseline for each track is as follows, for ExVo-MultiTask, a combined score, computing the harmonic mean of Concordance Correlation Coefficient (CCC), Unweighted Average Recall (UAR), and inverted Mean Absolute Error (MAE) ($S_{MTL}$) is at best, 0.335 $S_{MTL}$; for ExVo-Generate, we report Fréchet inception distance (FID) scores ranging from 4.81 to 8.27 (depending on the emotion) between the training set and generated samples. We then combine the inverted FID with perceptual ratings of the generated samples ($S_{Gen}$) and obtain 0.174 $S_{Gen}$; and for ExVo-FewShot, a mean CCC of 0.444 is obtained.