用层次图推理的细粒度视频检索

论文标题

用层次图推理的细粒度视频检索

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

论文作者

Chen, Shizhe, Zhao, Yida, Jin, Qin, Wu, Qi

论文摘要

视频和文本之间的跨模式检索引起了越来越多的注意力，因为网络上的视频迅速出现。此问题的当前主要方法是学习一个关节嵌入空间以测量跨模式相似性。但是，简单的关节嵌入不足以表示复杂的视觉和文字细节，例如场景，对象，动作及其组成。为了改善细粒度的视频检索，我们提出了一个分层图推理（HGR）模型，该模型将视频文本分解为全球到本地级别。具体来说，该模型将文本分解为层次结构的语义图，包括三个层次，动作，实体和关系跨层的级别。基于注意力的图形推理用于生成层次结构的文本嵌入，可以指导学习多样的和分层视频表示。 HGR模型汇总了来自不同视频文本级别的匹配，以捕获全球和本地详细信息。三个视频文本数据集的实验结果证明了我们模型的优势。这种分层分解还可以更好地跨数据集概括，并提高了区分细粒语义差异的能力。

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach for this problem is to learn a joint embedding space to measure cross-modal similarities. However, simple joint embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. To be specific, the model disentangles texts into hierarchical semantic graph including three levels of events, actions, entities and relationships across levels. Attention-based graph reasoning is utilized to generate hierarchical textual embeddings, which can guide the learning of diverse and hierarchical video representations. The HGR model aggregates matchings from different video-text levels to capture both global and local details. Experimental results on three video-text datasets demonstrate the advantages of our model. Such hierarchical decomposition also enables better generalization across datasets and improves the ability to distinguish fine-grained semantic differences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题