论文标题
一种深度学习的方法,以实现负担得起的细分
A Deep Learning Approach to Object Affordance Segmentation
论文作者
论文摘要
学会理解和推断对象功能是迈向强大的视觉智能的重要一步。最近,重大的研究工作集中在细分物体部分,这些零件可以使特定类型的人类对象相互作用,即所谓的“对象负担”。但是,大多数作品将其视为静态语义分割问题,仅专注于对象外观并依靠强大的监督和对象检测。在本文中,我们提出了一种新颖的方法,该方法利用了人类对象相互作用的时空性质进行负担分割。特别是,我们设计了一种仅使用序列最后帧的地面标签训练的自动编码器,并能够在视频和静态图像中推断Pixel Profes的标签。我们的模型超出了对象标签和边界框的需求,该机制可以使交互热点的隐式定位。出于评估目的,我们介绍了SOR3D-AFF语料库,该语料库由人类对象的相互作用序列组成,并支持从像素的注释方面支持9种类型的负担,涵盖了类似工具的对象的典型操作。我们表明,与SOR3D-AFF上的强有监督的方法相比,我们的模型可以实现竞争成果,同时能够预测两个仅提供图像的数据集中的类似看不见的对象的负担。
Learning to understand and infer object functionalities is an important step towards robust visual intelligence. Significant research efforts have recently focused on segmenting the object parts that enable specific types of human-object interaction, the so-called "object affordances". However, most works treat it as a static semantic segmentation problem, focusing solely on object appearance and relying on strong supervision and object detection. In this paper, we propose a novel approach that exploits the spatio-temporal nature of human-object interaction for affordance segmentation. In particular, we design an autoencoder that is trained using ground-truth labels of only the last frame of the sequence, and is able to infer pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism that enables the implicit localization of the interaction hotspot. For evaluation purposes, we introduce the SOR3D-AFF corpus, which consists of human-object interaction sequences and supports 9 types of affordances in terms of pixel-wise annotation, covering typical manipulations of tool-like objects. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF, while being able to predict affordances for similar unseen objects in two affordance image-only datasets.