通过利用地理和时间信息来进行细粒图像分类的动态MLP

论文标题

通过利用地理和时间信息来进行细粒图像分类的动态MLP

Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information

论文作者

Yang, Lingfeng, Li, Xiang, Song, Renjie, Zhao, Borui, Tao, Juntian, Zhou, Shihao, Liang, Jiajun, Yang, Jian

论文摘要

细粒度的图像分类是一项具有挑战性的计算机视觉任务，其中各种物种具有相似的视觉外观，如果仅基于视觉线索，则会导致错误分类。因此，利用其他信息（例如，数据拍摄的位置和日期）很有帮助，这些信息和日期很容易访问但很少被利用。在本文中，我们首先证明了现有的多模式方法仅在单个维度上融合多个特征，从本质上讲，这在特征歧视方面没有足够的帮助。为了充分探索多模式信息的潜力，我们在图像表示顶部提出了动态MLP，该动态MLP与更较高和更宽的维度的多模式特征相互作用。动态MLP是通过可变位置和日期的学习嵌入来参数化的有效结构。它可以被视为一种自适应非线性投影，用于在视觉任务中生成更具歧视性的图像表示。据我们所知，这是第一次尝试探索动态网络的概念，以利用细粒度的图像分类任务中利用多模式信息。广泛的实验证明了我们方法的有效性。 T-SNE算法在视觉上表明我们的技术改善了视觉上相似但具有不同类别的图像表示的可识别性。 Furthermore, among published works across multiple fine-grained datasets, dynamic MLP consistently achieves SOTA results https://paperswithcode.com/dataset/inaturalist and takes third place in the iNaturalist challenge at FGVC8 https://www.kaggle.com/c/inaturalist-2021/leaderboard.代码可从https://github.com/ylingfeng/dynamicmlp.git获得

Fine-grained image classification is a challenging computer vision task where various species share similar visual appearances, resulting in misclassification if merely based on visual clues. Therefore, it is helpful to leverage additional information, e.g., the locations and dates for data shooting, which can be easily accessible but rarely exploited. In this paper, we first demonstrate that existing multimodal methods fuse multiple features only on a single dimension, which essentially has insufficient help in feature discrimination. To fully explore the potential of multimodal information, we propose a dynamic MLP on top of the image representation, which interacts with multimodal features at a higher and broader dimension. The dynamic MLP is an efficient structure parameterized by the learned embeddings of variable locations and dates. It can be regarded as an adaptive nonlinear projection for generating more discriminative image representations in visual tasks. To our best knowledge, it is the first attempt to explore the idea of dynamic networks to exploit multimodal information in fine-grained image classification tasks. Extensive experiments demonstrate the effectiveness of our method. The t-SNE algorithm visually indicates that our technique improves the recognizability of image representations that are visually similar but with different categories. Furthermore, among published works across multiple fine-grained datasets, dynamic MLP consistently achieves SOTA results https://paperswithcode.com/dataset/inaturalist and takes third place in the iNaturalist challenge at FGVC8 https://www.kaggle.com/c/inaturalist-2021/leaderboard. Code is available at https://github.com/ylingfeng/DynamicMLP.git

下载PDF全文

下载文献需遵守相关版权规定

论文标题