论文标题
TONET:用于唱机音乐唱歌的音调网络
TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music
论文作者
论文摘要
在音乐信息检索领域,唱歌旋律提取是一个重要的问题。现有方法通常依靠频域表示来估计唱歌频率。但是,这种设计并没有导致人类水平的性能在对音调(音高)和八度的旋律信息感知中。在本文中,我们提出了一种插件模型,该模型通过利用新颖的输入表示和一种新颖的网络体系结构来改善音调和八度感知。首先,我们提出了改进的输入表示形式,即Tone-CFP,该表示通过频率键重新排列明确地对谐波进行了谐波。其次,我们介绍了一个编码器架构,该体系结构旨在获得显着特征映射,音调特征映射和八度特征映射。第三,我们提出了一种音调融合机制,以改善最终显着特征图。进行实验以通过各种基线主链模型来验证吨位的能力。我们的结果表明,与音调-CFP的音调融合可以显着提高各种数据集的唱歌语音提取性能 - 八度和音调的准确性具有很大的提高。
Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets -- with substantial gains in octave and tone accuracy.