基于动作识别新基准的时空表征学习的大规模研究 A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition

作者:Andong Deng Taojiannan Yang Chen Chen


The goal of building a benchmark (suite of datasets) is to provide a unifiedprotocol for fair evaluation and thus facilitate the evolution of a specificarea. Nonetheless, we point out that existing protocols of action recognitioncould yield partial evaluations due to several limitations. To comprehensivelyprobe the effectiveness of spatiotemporal representation learning, we introduceBEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18video datasets grouped into 5 categories (anomaly, gesture, daily, sports, andinstructional), which covers a diverse set of real-world applications. WithBEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by bothsupervised and self-supervised learning. We also report transfer performancevia standard finetuning, few-shot finetuning, and unsupervised domainadaptation. Our observation suggests that current state-of-the-art cannotsolidly guarantee high performance on datasets close to real-worldapplications, and we hope BEAR can serve as a fair and challenging evaluationbenchmark to gain insights on building next-generation spatiotemporal learners.Our dataset, code, and models are released at:https://github.com/AndongDeng/BEAR



Related posts