论文标题

复音音频事件检测:多标签或多级多任务分类问题?

Polyphonic audio event detection: multi-label or multi-class multi-task classification problem?

论文作者

Phan, Huy, Nguyen, Thi Ngoc Tho, Koch, Philipp, Mertins, Alfred

论文摘要

复音事件是音频事件检测(AED)系统的主要错误来源。在深度学习的背景下,处理事件重叠的最常见方法是将AED任务视为多标签分类问题。通过这样做,我们固有地考虑了多个单VS.-rest分类问题,这些问题由单个(即共享)网络共同解决。在这项工作中,为了更好地处理复音混合物,我们建议通过将每个可能的标签组合视为一个类,将任务作为多类分类问题。为了避免由于组合爆炸而导致的大量类别的类别,我们将事件类别分为多组,并以分裂和混淆方式构建多任务问题,其中每个任务都是多类分类问题。然后,为多级多任务建模设计了网络体系结构。该网络由骨干子网和多个特定于任务的子网组成。特定于任务的子网旨在学习时间频率和引导注意力面罩,以从主链学到的常见特征地图中提取手头的任务功能。高度事件重叠的TUT-SED合成2016年实验表明,所提出的方法比常见的多标签方法更有利的性能。

Polyphonic events are the main error source of audio event detection (AED) systems. In deep-learning context, the most common approach to deal with event overlaps is to treat the AED task as a multi-label classification problem. By doing this, we inherently consider multiple one-vs.-rest classification problems, which are jointly solved by a single (i.e. shared) network. In this work, to better handle polyphonic mixtures, we propose to frame the task as a multi-class classification problem by considering each possible label combination as one class. To circumvent the large number of arising classes due to combinatorial explosion, we divide the event categories into multiple groups and construct a multi-task problem in a divide-and-conquer fashion, where each of the tasks is a multi-class classification problem. A network architecture is then devised for multi-class multi-task modelling. The network is composed of a backbone subnet and multiple task-specific subnets. The task-specific subnets are designed to learn time-frequency and channel attention masks to extract features for the task at hand from the common feature maps learned by the backbone. Experiments on the TUT-SED-Synthetic-2016 with high degree of event overlap show that the proposed approach results in more favorable performance than the common multi-label approach.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源