论文标题
对象状态的多任务学习从未切割的视频中变化
Multi-Task Learning of Object State Changes from Uncurated Videos
论文作者
论文摘要
我们旨在通过观察人们与长期未经固定的Web视频中的对象进行互动,从而在时间上定位对象状态变化和相应的状态修改动作。我们介绍了三个主要贡献。首先,我们探索替代性的多任务网络体系结构,并确定一个模型,该模型可以有效地学习多个对象状态和诸如倒水和倒咖啡等动作。其次,我们设计了一个多任务自制的学习过程,该过程利用对象和状态修改措施之间的不同类型的约束,从而实现了对对象状态的时间定位模型的端到端培训,并且仅从嘈杂的视频级别的监督中进行视频中的动作。第三,我们报告了包含数以万计的长(UN)策划的Web视频的大规模更改和硬币数据集的结果,这些视频描述了各种相互作用,例如孔钻孔,奶油搅拌或纸平面折叠。我们表明,我们的多任务模型可实现40%的相对改善,而对于先前的单任务方法,对于此问题,我们的多任务模型显着优于基于图像的和基于视频的零拍模型。我们还在零摄像机设置中的Epic-Kitchens和EGO4D数据集的长时间以自我为中心的视频中测试了我们的方法,以证明我们学到的模型的鲁棒性。
We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos. We introduce three principal contributions. First, we explore alternative multi-task network architectures and identify a model that enables efficient joint learning of multiple object states and actions such as pouring water and pouring coffee. Second, we design a multi-task self-supervised learning procedure that exploits different types of constraints between objects and state-modifying actions enabling end-to-end training of a model for temporal localization of object states and actions in videos from only noisy video-level supervision. Third, we report results on the large-scale ChangeIt and COIN datasets containing tens of thousands of long (un)curated web videos depicting various interactions such as hole drilling, cream whisking, or paper plane folding. We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods and significantly outperforms both image-based and video-based zero-shot models for this problem. We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup demonstrating the robustness of our learned model.