通过空间意识匹配和CrossTransFormer改善很少的学习学习

论文标题

通过空间意识匹配和CrossTransFormer改善很少的学习学习

Improving Few-shot Learning by Spatially-aware Matching and CrossTransformer

论文作者

Zhang, Hongguang, Torr, Philip H. S., Koniusz, Piotr

论文摘要

当前的几次学习模型在固定分辨率输入下捕获所谓的元学习环境中的视觉对象关系。但是，在对象之间的尺度和位置不匹配下，这种模型的概括能力有限，因为仅提供目标类别的样本。因此，缺乏与比较图像对之间的尺度和位置相匹配的机制会导致性能降解。图像内容的重要性因对象及其类标签的不同，例如，通用对象和场景依赖于它们的全局外观，而细粒对象则更多地依赖于其本地化的视觉模式。在本文中，我们研究了尺度和位置不匹配的影响，在几个射击的学习方案中，并提出了一种新颖的空间意识匹配（SM）方案，以在多个尺度和位置进行有效匹配，并通过为最佳匹配对提供最高的权重来学习图像关系。对SM进行了训练，以激活支持数据和查询数据之间最相关的位置和尺度。我们将SM应用和评估在各种几次学习模型和骨干上进行全面评估。此外，我们利用辅助自我探讨歧视者来训练/预测我们使用的特征向量的空间和尺度级索引。最后，我们开发了一种基于变压器的新型管道，以在空间意识的匹配过程中利用自我和交叉注意。我们提出的设计与骨干和/或比较器的选择是正交的。

Current few-shot learning models capture visual object relations in the so-called meta-learning setting under a fixed-resolution input. However, such models have a limited generalization ability under the scale and location mismatch between objects, as only few samples from target classes are provided. Therefore, the lack of a mechanism to match the scale and location between pairs of compared images leads to the performance degradation. The importance of image contents varies across coarse-to-fine scales depending on the object and its class label, e.g., generic objects and scenes rely on their global appearance while fine-grained objects rely more on their localized visual patterns. In this paper, we study the impact of scale and location mismatch in the few-shot learning scenario, and propose a novel Spatially-aware Matching (SM) scheme to effectively perform matching across multiple scales and locations, and learn image relations by giving the highest weights to the best matching pairs. The SM is trained to activate the most related locations and scales between support and query data. We apply and evaluate SM on various few-shot learning models and backbones for comprehensive evaluations. Furthermore, we leverage an auxiliary self-supervisory discriminator to train/predict the spatial- and scale-level index of feature vectors we use. Finally, we develop a novel transformer-based pipeline to exploit self- and cross-attention in a spatially-aware matching process. Our proposed design is orthogonal to the choice of backbone and/or comparator.

下载PDF全文

下载文献需遵守相关版权规定

论文标题