语义对象预测和空间声音超分辨率带有双耳声音

论文标题

语义对象预测和空间声音超分辨率带有双耳声音

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

论文作者

Vasudevan, Arun Balajee, Dai, Dengxin, Van Gool, Luc

论文摘要

人类可以通过集成视觉和听觉提示来牢固地识别和定位对象。尽管机器现在可以使用图像进行相同的操作，但使用声音的工作减少了。这项工作开发了一种纯粹基于双耳声音的声音制作物体的密集语义标记的方法。我们提出了一个新颖的传感器设置，并记录了一个新的视听数据集，其中包括八个专业双耳麦克风和360度相机。视觉和音频提示的共存被利用用于监督转移。特别是，我们采用了一个跨模式蒸馏框架，该框架由视觉“老师”方法和声音“学生”方法组成 - 学生方法经过培训以产生与教师方法相同的结果。这样，可以在不使用人类注释的情况下对听觉系统进行训练。我们还提出了两个辅助任务，即a）空间声音超分辨率的新任务，以增加声音的空间分辨率，b）场景的密集深度预测。然后，我们将这三个任务提交为一个端到端可训练的多任务网络，以提高整体性能。数据集上的实验结果表明，1）我们的方法为语义预测和两个辅助任务带来了有希望的结果； 2）这三个任务是互惠互利的 - 训练它们达到最佳性能，3）麦克风的数量和方向都很重要。将发布数据和代码以促进该新方向的研究。

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision `teacher' method and a sound `student' method -- the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves promising results for semantic prediction and the two auxiliary tasks; and 2) the three tasks are mutually beneficial -- training them together achieves the best performance and 3) the number and orientations of microphones are both important. The data and code will be released to facilitate the research in this new direction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题