视频识别的组情境化

论文标题

视频识别的组情境化

Group Contextualization for Video Recognition

论文作者

Hao, Yanbin, Zhang, Hao, Ngo, Chong-Wah, He, Xiangnan

论文摘要

从复杂的时空动态空间中学习判别性表示对于视频识别至关重要。除了那些风格化的时空计算单元外，还证明进一步完善了具有轴向上下文的学习功能，这在实现这一目标方面有希望。但是，以前的作品通常专注于利用一种上下文来校准整个功能渠道，并且几乎不适用于处理各种视频活动。可以通过使用成对的时空注意力来解决问题，以通过跨轴环境重新计算特征响应，但以重型计算为代价。在本文中，我们提出了一种有效的特征改进方法，该方法将特征通道分解为几个组，并在同时通过不同的轴向上下文分别完善它们。我们将此轻巧的功能校准称为组情境化（GC）。具体而言，我们设计了一个有效元素校准器的家族，即ecal-g/s/t/l，其中它们的轴向上下文是从全球或本地汇总的信息动力学，以将特征通道组的上下文化。可以将GC模块密集地插入现成视频网络的每个残差层中。在很少的计算开销中，当在不同网络上插入GC时，可以观察到一致的改进。通过利用校准器并行的四种不同类型的上下文嵌入功能，预计学到的表示形式将对各种活动类型的活动更具弹性。在具有丰富时间变化的视频中，经验上的GC可以将2D-CNN（例如TSN和TSM）的性能提高到与最先进的视频网络相当的水平。代码可从https://github.com/haoyanbin918/group-contextualization获得。

Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized spatio-temporal computational units, further refining the learnt feature with axial contexts is demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind of contexts to calibrate entire feature channels and could hardly apply to deal with diverse video activities. The problem can be tackled by using pair-wise spatio-temporal attentions to recompute feature response with cross-axis contexts at the expense of heavy computations. In this paper, we propose an efficient feature refinement method that decomposes the feature channels into several groups and separately refines them with different axial contexts in parallel. We refer this lightweight feature calibration as group contextualization (GC). Specifically, we design a family of efficient element-wise calibrators, i.e., ECal-G/S/T/L, where their axial contexts are information dynamics aggregated from other axes either globally or locally, to contextualize feature channel groups. The GC module can be densely plugged into each residual layer of the off-the-shelf video networks. With little computational overhead, consistent improvement is observed when plugging in GC on different networks. By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities. On videos with rich temporal variations, empirically GC can boost the performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to the state-of-the-art video networks. Code is available at https://github.com/haoyanbin918/Group-Contextualization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题