论文标题
自我监督的音频模型有效地解释了人类对语音的皮质反应
Self-supervised models of audio effectively explain human cortical responses to speech
论文作者
论文摘要
自我监督的语言模型非常有效地预测语言理解过程中的高级皮质反应。但是,人脑中低级听觉处理的最佳模型依赖于手工构造的声学过滤器或监督音频神经网络中的表示。在这项工作中,我们利用了自我监督的语音表示学习(SSL)的进步,以创建人类听觉系统的最新模型。与声学基线,音素特征和监督模型相比,自我监视模型中间层的表示(APC,WAV2VEC,WAV2VEC 2.0和Hubert)始终在听觉Cortex(AC)内的fMRI记录中产生最佳的预测性能。参与低级听觉处理的大脑区域表现出对早期SSL模型层的偏爱,而高级语义区域则偏爱更晚的层。我们表明,这些趋势是由于模型沿其表示深度以多种语言水平(声学,语音和词汇)编码信息的能力。总体而言,这些结果表明,自我监督的模型有效地捕获了与人类皮质中语音处理不同阶段有关的信息的层次结构。
Self-supervised language models are very effective at predicting high-level cortical responses during language comprehension. However, the best current models of lower-level auditory processing in the human brain rely on either hand-constructed acoustic filters or representations from supervised audio neural networks. In this work, we capitalize on the progress of self-supervised speech representation learning (SSL) to create new state-of-the-art models of the human auditory system. Compared against acoustic baselines, phonemic features, and supervised models, representations from the middle layers of self-supervised models (APC, wav2vec, wav2vec 2.0, and HuBERT) consistently yield the best prediction performance for fMRI recordings within the auditory cortex (AC). Brain areas involved in low-level auditory processing exhibit a preference for earlier SSL model layers, whereas higher-level semantic areas prefer later layers. We show that these trends are due to the models' ability to encode information at multiple linguistic levels (acoustic, phonetic, and lexical) along their representation depth. Overall, these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.