以自我为中心动作预期的相互作用区域视觉变压器

论文标题

以自我为中心动作预期的相互作用区域视觉变压器

Interaction Region Visual Transformer for Egocentric Action Anticipation

论文作者

Roy, Debaditya, Rajendiran, Ramanathan, Fernando, Basura

论文摘要

人类对象的相互作用是最重要的视觉提示之一，我们提出了一种新型的方式来代表以自我为中心作用预期的人类对象相互作用。我们通过计算动作的执行并使用这些更改来完善视频表示形式，通过计算对象和人手的外观的变化来模拟相互作用的新变压器变体。具体而言，我们使用空间交叉注意（SCA）对手和对象之间的相互作用进行建模，并使用轨迹跨注意事项进一步注入上下文信息，以获得环境精制的相互作用令牌。使用这些令牌，我们构建了一个以相互作用为中心的视频表示形式进行动作预期。我们称我们的模型Inavit在大规模的中心数据集Epicktichens100（EK100）和EGTEA凝视+上实现了最先进的动作预期性能。 Inavit优于其他基于视觉变压器的方法，包括以对象为中心的视频表示。在EK100评估服务器上，INAVIT是公众排行榜上表现最佳的方法（在提交时），在均值TOP5召回中，它的表现优于第二好的模型，比第二好的模型在3.3％中。

Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.

下载PDF全文

下载文献需遵守相关版权规定

论文标题