论文标题
从烹饪视频中提取结构化程序知识的基准
A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos
论文作者
论文摘要
观看教学视频通常用于了解过程。视频字幕是自动收集此类知识的一种方法。但是,它仅提供对多模型模型的间接总体评估,而没有对他们所学的较细粒度定量度量。相反,我们提出了从烹饪视频中提取的结构化程序知识的基准。这项工作与现有任务是互补的,但要求模型以动词题材的形式产生可解释的结构化知识。我们手动注释的开放式摄影资源包括356个教学烹饪视频和15,523个视频剪辑/句子级注释。我们的分析表明,提出的任务是具有挑战性的,标准的建模方法,例如无监督的分割,语义角色标记和视觉动作检测,当被迫以结构化形式预测过程的每一个动作时,都表现较差。
Watching instructional videos are often used to learn about procedures. Video captioning is one way of automatically collecting such knowledge. However, it provides only an indirect, overall evaluation of multimodal models with no finer-grained quantitative measure of what they have learned. We propose instead, a benchmark of structured procedural knowledge extracted from cooking videos. This work is complementary to existing tasks, but requires models to produce interpretable structured knowledge in the form of verb-argument tuples. Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations. Our analysis shows that the proposed task is challenging and standard modeling approaches like unsupervised segmentation, semantic role labeling, and visual action detection perform poorly when forced to predict every action of a procedure in a structured form.