SEPTR：可分开的音频谱图处理的变压器

论文标题

SEPTR：可分开的音频谱图处理的变压器

SepTr: Separable Transformer for Audio Spectrogram Processing

论文作者

Ristea, Nicolae-Catalin, Ionescu, Radu Tudor, Khan, Fahad Shahbaz

论文摘要

随着视觉变压器在多个计算机视觉任务中的成功应用之后，这些模型引起了信号处理社区的关注。这是因为信号通常表示为频谱图（例如，通过离散的傅立叶变换）可以直接提供作为视觉变压器的输入。然而，天真地将变压器应用于频谱图是次优的。由于轴代表不同的尺寸，即频率和时间，因此我们认为一种更好的方法是将注意力集中在每个轴上。为此，我们提出了可分离的变压器（SEPTR），该体系结构以连续的方式采用两个变压器块，在同一时间间隔内首次参与令牌，第二个在同一频率箱内进行了代币。我们对三个基准数据集进行实验，表明我们的可分离体系结构的表现优于常规视觉变压器和其他最新方法。与标准变压器不同，SEPTR线性缩放具有输入大小的可训练参数的数量，从而具有较低的内存足迹。我们的代码可在https://github.com/ristea/septr上作为开源。

Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a better approach is to separate the attention dedicated to each axis. To this end, we propose the Separable Transformer (SepTr), an architecture that employs two transformer blocks in a sequential manner, the first attending to tokens within the same time interval, and the second attending to tokens within the same frequency bin. We conduct experiments on three benchmark data sets, showing that our separable architecture outperforms conventional vision transformers and other state-of-the-art methods. Unlike standard transformers, SepTr linearly scales the number of trainable parameters with the input size, thus having a lower memory footprint. Our code is available as open source at https://github.com/ristea/septr.

下载PDF全文

下载文献需遵守相关版权规定

论文标题