无位置的人姿势估计

论文标题

无位置的人姿势估计

Location-free Human Pose Estimation

论文作者

Xu, Xixia, Gao, Yingguo, Yan, Ke, Lin, Xue, Zou, Qi

论文摘要

人姿势估计（HPE）通常需要大规模训练数据才能达到高性能。但是，为人体收集高质量和细粒度的注释是耗时的。为了减轻此问题，我们在没有关键点的监督的情况下重新访问HPE并提出了一个无位置的框架。我们从分类的角度重新制定了基于回归的HPE。受到基于CAM的弱监督对象定位的启发，我们观察到可以通过零件感知的CAM获得粗糙关键点的位置，但由于细粒度的HPE和对象级定位之间的差距，因此不满意。为此，我们提出了一个定制的变压器框架，以挖掘人类环境的细粒度表示，并配备了结构关系，以捕获关键点之间的细微差异。具体而言，我们设计了一个多尺度的空间引导上下文编码器，以完全捕获全球人类环境，同时着眼于部分感知的区域和关系编码的姿势原型生成模块以编码结构关系。所有这些共同加强了位置上图像级类别标签的弱监督。我们的模型仅在类别级别进行监督时就可以在三个数据集上实现竞争性能，并且重要的是，它可以通过在MS-Coco和MPII上使用25％的位置标签，通过完全监督的方法获得可比的结果。

Human pose estimation (HPE) usually requires large-scale training data to reach high performance. However, it is rather time-consuming to collect high-quality and fine-grained annotations for human body. To alleviate this issue, we revisit HPE and propose a location-free framework without supervision of keypoint locations. We reformulate the regression-based HPE from the perspective of classification. Inspired by the CAM-based weakly-supervised object localization, we observe that the coarse keypoint locations can be acquired through the part-aware CAMs but unsatisfactory due to the gap between the fine-grained HPE and the object-level localization. To this end, we propose a customized transformer framework to mine the fine-grained representation of human context, equipped with the structural relation to capture subtle differences among keypoints. Concretely, we design a Multi-scale Spatial-guided Context Encoder to fully capture the global human context while focusing on the part-aware regions and a Relation-encoded Pose Prototype Generation module to encode the structural relations. All these works together for strengthening the weak supervision from image-level category labels on locations. Our model achieves competitive performance on three datasets when only supervised at a category-level and importantly, it can achieve comparable results with fully-supervised methods with only 25\% location labels on MS-COCO and MPII.

下载PDF全文

下载文献需遵守相关版权规定

论文标题