作者:Medhini Narasimhan Licheng Yu Sean Bell Ning Zhang Trevor Darrell
鉴于网上有大量的教学视频,从视频中学习一系列不同的多步骤任务模型是一个很有吸引力的目标。引入了一种新的预训练视频模型VideoTaskformer,该模型专注于表示教学视频的语义和结构。我们使用一个简单有效的目标预训练VideoTaskformer:预测从结构化视频中随机屏蔽的步骤的弱监督文本标签(屏蔽步骤建模)。与之前在本地学习分步表示的工作相比,我们的方法涉及全局学习,利用整个周围任务的视频作为上下文。从这些学习到的表现中,我们可以验证一个看不见的视频是否正确地执行了一个给定的步骤,并预测在给定的步骤之后可能会采取哪些步骤。我们引入了两个新的基准来检测指令视频中的错误,以验证是否存在异常步骤以及步骤是否按正确的顺序执行。我们
Given the enormous number of instructional videos available online, learninga diverse array of multi-step task models from videos is an appealing goal. Weintroduce a new pre-trained video model, VideoTaskformer, focused onrepresenting the semantics and structure of instructional videos. We pre-trainVideoTaskformer using a simple and effective objective: predicting weaklysupervised textual labels for steps that are randomly masked out from aninstructional video (masked step modeling). Compared to prior work which learnsstep representations locally, our approach involves learning them globally,leveraging video of the entire surrounding task as context. From these learnedrepresentations, we can verify if an unseen video correctly executes a giventask, as well as forecast which steps are likely to be taken after a givenstep. We introduce two new benchmarks for detecting mistakes in instructionalvideos, to verify if there is an anomalous step and if steps are executed inthe right order. We also introduce a long-term forecasting benchmark, where thegoal is to predict long-range future steps from a given step. Our methodoutperforms previous baselines on these tasks, and we believe the tasks will bea valuable way for the community to measure the quality of steprepresentations. Additionally, we evaluate VideoTaskformer on 3 existingbenchmarks — procedural activity recognition, step classification, and stepforecasting — and demonstrate on each that our method outperforms existingbaselines and achieves new state-of-the-art performance.
论文链接:http://arxiv.org/pdf/2303.13519v1
更多计算机论文:http://cspaper.cn/