论文标题
实时流式传输视频时间动作细分
Streaming Video Temporal Action Segmentation In Real Time
论文作者
论文摘要
时间动作细分(TAS)是迈向长期视频理解的关键一步。最近的研究遵循一种模式,该模式基于功能而不是原始视频图片信息来构建模型。但是,我们声称这些模型经过复杂的训练并限制了应用程序场景。他们很难实时细分视频的人类动作,因为它们必须在提取完整的视频功能后工作。由于实时操作分割任务与TAS任务不同,因此我们将其定义为流视频实时时间动作分割(SVTAS)任务。在本文中,我们为SVTAS任务提出了一个实时端到端多模式模型。更具体地说,在我们无法获得任何未来信息的情况下,我们将当前的人类实时流式传输视频片段进行分割。此外,我们提出的模型将语言模型提取的最后一个蒸汽视频块特征与图像模型提取的当前图像特征提取,以改善实时时间动作细分的数量。据我们所知,这是第一个多模式的实时时间动作分割模型。在与完整的视频时间动作分段相同的评估标准下,我们的模型实时段落的人类行动不到40%的最先进模型计算,并实现了完整视频最先进模型的准确性的90%。
Temporal action segmentation (TAS) is a critical step toward long-term video understanding. Recent studies follow a pattern that builds models based on features instead of raw video picture information. However, we claim those models are trained complicatedly and limit application scenarios. It is hard for them to segment human actions of video in real time because they must work after the full video features are extracted. As the real-time action segmentation task is different from TAS task, we define it as streaming video real-time temporal action segmentation (SVTAS) task. In this paper, we propose a real-time end-to-end multi-modality model for SVTAS task. More specifically, under the circumstances that we cannot get any future information, we segment the current human action of streaming video chunk in real time. Furthermore, the model we propose combines the last steaming video chunk feature extracted by language model with the current image feature extracted by image model to improve the quantity of real-time temporal action segmentation. To the best of our knowledge, it is the first multi-modality real-time temporal action segmentation model. Under the same evaluation criteria as full video temporal action segmentation, our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model.