AUTOVIDEO: AN AUTOMATED VIDEO ACTION RECOGNITION SYSTEM

Page created by Lillian Meyer
 
CONTINUE READING
AUTOVIDEO: AN AUTOMATED VIDEO ACTION RECOGNITION SYSTEM
AutoVideo: An Automated Video Action Recognition System
                                           Daochen Zha†* , Zaid Pervaiz Bhat‡* , Yi-Wei Chen‡* , Yicheng Wang‡* , Sirui Ding‡* , Jiaben
                                          Chen†* , Kwei-Herng Lai†* , Anmoll Kumar Jain‡, Mohammad Qazim Bhat‡, Na Zou§, Xia Hu†
                                                                               †Department of Computer Science, Rice University
                                                                    ‡Department of Computer Science and Engineering, Texas A&M University
                                                             §Department of Engineering Technology and Industrial Distribution, Texas A&M University
                                                                                      {daochen.zha,khlai,xiahu}@rice.edu,
                                                                {zaid.bhat1234,yiwei chen,wangyc,siruiding,jain.anmollkumar,nzou1}@tamu.edu,
                                                                             mqazimbhat@gmail.com, chenjb1@shanghaitech.edu.cn
arXiv:2108.04212v3 [cs.CV] 12 Oct 2021

                                                                       Abstract
                                           Action recognition is a crucial task for video understand-
                                           ing. In this paper, we present AutoVideo, a Python sys-
                                           tem for automated video action recognition. It currently sup-
                                           ports seven action recognition algorithms and various pre-
                                           processing modules. Unlike the existing libraries that only                                                 CSV Reader
                                           provide model zoos, AutoVideo is built with the standard
                                           pipeline language. The basic building block is primitive,
                                           which wraps a pre-processing module or an algorithm with               Pr e-Pr ocessi n g               Fr am e Ex t r act i on
                                           some hyperparameters. AutoVideo is highly modular and ex-
                                           tendable. It can be easily combined with AutoML searchers.
                                           The pipeline language is quite general so that we can easily
                                                                                                                                                Fr am e Tr an sf or m at i on
                                           enrich AutoVideo with algorithms for various other video-
                                           related tasks in the future. AutoVideo is released under MIT
                                           license at https://github.com/datamllab/autovideo.
                                                                                                                                                    TSN    TSM   ...   ECO

                                                                                                               Act i on Recogon i t i on
                                                                   Introduction
                                         Video-based action recognition aims at identifying different                                              Au t om at ed Tu n er
                                         actions from video clips. It is a crucial task for video un-
                                         derstanding with broad applications in various areas, such        Figure 1: System overview. Each module in AutoVideo
                                         as security (Meng, Pears, and Bailey 2007), healthcare (Gao       is wrapped as a primitive with some hyperparameters.
                                         et al. 2018) and behavior analysis (Poppe 2010).                  A pipeline consists of a series of primitives from pre-
                                            Recently, deep learning models have achieved promising         processing to action recognition. AutoVideo is equipped
                                         performance in action recognition tasks (Tran et al. 2015;        with tuners to search models and hyperparameters.
                                         Wang et al. 2016; Carreira and Zisserman 2017; Zolfaghari,
                                                                                                           standing libraries out there (Yue Zhao 2019; Team 2021),
                                         Singh, and Brox 2018). However, the successes of these
                                                                                                           they often only focus on providing a model zoo for users.
                                         models heavily rely on the efforts of human expertise. Given
                                                                                                           In contrast, we wrap the models with a highly modular and
                                         a new task beyond the benchmark datasets, a practitioner
                                                                                                           extendable pipeline language and inherently support hyper-
                                         often still needs extensive laborious trials to test different
                                                                                                           parameter tuning to save users’ efforts. Unlike some other
                                         models and tune the hyperparameters, which significantly
                                                                                                           research-oriented AutoML studies for videos (Piergiovanni
                                         impedes real-world applications.
                                                                                                           et al. 2019; Ryoo et al. 2019), we target an open-sourced
                                            To bridge this gap, we aim to design an easy-to-use toolkit
                                                                                                           toolkit with unified interfaces for practitioners to save the
                                         to help practitioners quickly develop prototypes for any new
                                                                                                           engineering effort in developing models.
                                         video action recognition tasks. In particular, we will present
                                         a highly modular and extendable system that wraps the data           In this paper, we present the first release of our system for
                                         loading, data processing, and state-of-the-art action recogni-    Automated Video Action Recognition, namely AutoVideo.
                                         tion models with the standard pipeline language. Following        It is designed under D3M infrastructure (Milutinovic et al.
                                         the pipeline language, we introduce data-driven searchers to      2020; Lai et al. 2021), i.e., data-driven model discovery via
                                         automatically tune the hyperparameters to reduce human ef-        automated machine learning. In AutoVideo, the data load-
                                         forts. While there are some other open-sourced video under-       ing, data processing, and learning models are all wrapped
                                                                                                           as primitives, and their combinations form various machine
                                            * Those   authors contribute equally to this project           learning pipelines. Following this design, AutoVideo sup-
AUTOVIDEO: AN AUTOMATED VIDEO ACTION RECOGNITION SYSTEM
HMDB-6     HMDB-51
                                                                 # Build pipeline based on config
             Manually-Tuned        70.35%     46.8%              pipeline = build_pipeline({
             Random Search         89.63%    53.92%                 "algorithm": args.alg,
             Hyperopt              92.6%     56.14%                 "load_pretrained": args.pretrained,
            Table 1: Accuracy of tuners in AutoVideo.            })

ports data-driven searchers that automatically explore differ-   # Fit
ent pipelines and hyperparameters to identify the best model     fit(train_dataset, train_media_dir,
design. AutoVideo is highly modular and extensible since             ,→ target_index=target_index,
it follows the standard pipeline language. As such, practi-          ,→ pipeline=pipeline)
tioners can easily add new modules for action recognition or
beyond, such as other video understanding tasks.                         Example 1: Code snippet of fitting a model.

                       AutoVideo System                          Here, train dataset is the dataframe describing the
Figure 1 shows an overview of AutoVideo system. It wraps         video file names and labels, train media dir con-
the pre-processing and action recognition modules as prim-       tains the directory of the folders with video files,
itives based on the standard pipeline language. Then a tuner     and target index specifies which column is label.
can be used to automatically discover the best pipelines. Our    Users can customize the configuration dictionary in
first release supports seven state-of-the-art action recogni-    build pipeline function to train different algorithms
tion models, including TSN (Wang et al. 2016), TSM (Lin,         with different hyperparameters. The fit function will re-
Gan, and Han 2019), I3D (Carreira and Zisserman 2017),           turn the fitted pipeline that could be pickled and saved for
ECO (Zolfaghari, Singh, and Brox 2018), C3D (Tran et al.         future use. Similar function calls can be executed for a test-
2015), R2P1D (Hou et al. 2019), and R3D (Hara, Kataoka,          ing dataset to make predictions.
and Satoh 2017). This section first introduces the standard         In addition to fitting models with specified configura-
pipeline language and then elaborates on the interface.          tions, users can also use tuners to automatically search the
                                                                 pipelines (i.e., models and hyperparameters) in a data-driven
Primitives and Pipelines                                         manner. An example interface is as below.
While there are many other AutoML systems (Jin, Song, and
                                                                 from autovideo.searcher import RaySearcher
Hu 2019; Thornton et al. 2013; Li et al. 2020), the standard
pipeline language used in D3M program provides generic           searcher = RaySearcher(train_dataset,
and extendable descriptions for modules and pipelines. The           ,→ train_media_dir,
interface can be easily adapted to various tasks. As such, we        ,→ valid_dataset=valid_dataset,
build AutoVideo upon D3M infrastructure.                             ,→ valid_media_dir=valid_media_dir)
   Primitive is the basic build block. It is an implementation
of a function with some hyperparameters. A pipeline is a         best_config = searcher.search(
Directed Acyclic Graph (DAG) consisting of several prim-            search_space=search_space,
itive steps. Data types, such as NumPy and ndarrays, can            config=config
be passed between steps in a pipeline. Each of the above         )
concepts is associated with metadata to describe parameters,
                                                                       Example 2: Code snippet of automated tuning.
hyper-parameters, etc. More details of pipeline languages
can be found in (Milutinovic et al. 2020). While our cur-
rent scope focuses on action recognition, the pipeline lan-      Here, search space is a dictionary specifying which
guage allows AutoVideo to be easily extended to support          models and the ranges of hyperparameters the tuner will
other video-related tasks, which will be our future work.        cover, and config specifies some tuning configurations.

Interface Design
                                                                 Evaluation of the Automated Tuners
The interface is based upon Axolotl1 , our high-level abstrac-
tion of D3M. We further wrap and adapt the Axolotl inter-        Our current system supports two types of tuners, including
faces to video-related functionalities. A minimum example        random search and Hyperopt (Bergstra et al. 2013). We com-
of fitting an AutoVideo model is as follows.                     pare the two tuners as well as the results obtained by manu-
                                                                 ally tuning the hyperparameters on HMDB-51 and a subset
from autovideo import fit, extract_frames                        of the original dataset with six categories named HMDB-6.
                                                                 Both tuners use the same configurations, such as the search
# Extract frames from the video
video_ext = train_dataset.iloc[0, 1                              space. The detailed experimental settings are provided in
    ,→ ].split(’.’)[-1]                                          Appendix. We observe that both the tuners outperform the
extract_frames(train_media_dir, video_ext)                       manually tuned pipeline by a fair margin, and Hyperopt
                                                                 achieves the best accuracy. The results verify the importance
   1
       https://gitlab.com/axolotl1/axolotl                       of hyperparameter tuning in action recognition tasks.
AUTOVIDEO: AN AUTOMATED VIDEO ACTION RECOGNITION SYSTEM
Conclusions and Future Work                             Piergiovanni, A.; Angelova, A.; Toshev, A.; and Ryoo, M. S.
                                                                   2019. Evolving space-time neural architectures for videos.
This paper presents AutoVideo, an automated video action
                                                                   In Proceedings of the IEEE/CVF International Conference
recognition system. Unlike the existing libraries, AutoVideo
                                                                   on Computer Vision, 1793–1802.
is designed for AutoML since it is based on the highly mod-
ular and pipeline language. It also explicitly supports model      Poppe, R. 2010. A survey on vision-based human action
selection and hyperparameter tuning. Our evaluation verifies       recognition. Image and vision computing 28(6): 976–990.
the necessity of tuning in practical use. In the future, we will   Ryoo, M. S.; Piergiovanni, A.; Tan, M.; and Angelova,
include more video understanding tasks into AutoVideo.             A. 2019. Assemblenet: Searching for multi-stream neu-
                                                                   ral connectivity in video architectures. arXiv preprint
                        References                                 arXiv:1905.13209 .
Bergstra, J.; Yamins, D.; Cox, D. D.; et al. 2013. Hyperopt:       Team, P. 2021.      PytorchVideo.         https://github.com/
A python library for optimizing the hyperparameters of ma-         facebookresearch/pytorchvideo.
chine learning algorithms. In Proceedings of the 12th Python       Thornton, C.; Hutter, F.; Hoos, H. H.; and Leyton-Brown, K.
in science conference, volume 13, 20. Citeseer.                    2013. Auto-WEKA: Combined selection and hyperparam-
Carreira, J.; and Zisserman, A. 2017. Quo vadis, action            eter optimization of classification algorithms. In Proceed-
recognition? a new model and the kinetics dataset. In pro-         ings of the 19th ACM SIGKDD international conference on
ceedings of the IEEE Conference on Computer Vision and             Knowledge discovery and data mining, 847–855.
Pattern Recognition, 6299–6308.                                    Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri,
Gao, Y.; Xiang, X.; Xiong, N.; Huang, B.; Lee, H. J.; Alrifai,     M. 2015. Learning spatiotemporal features with 3d convo-
R.; Jiang, X.; and Fang, Z. 2018. Human action monitoring          lutional networks. In Proceedings of the IEEE international
for healthcare based on deep learning. Ieee Access 6: 52277–       conference on computer vision, 4489–4497.
52285.                                                             Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.;
                                                                   and Van Gool, L. 2016. Temporal segment networks: To-
Hara, K.; Kataoka, H.; and Satoh, Y. 2017. Learning
                                                                   wards good practices for deep action recognition. In Euro-
spatio-temporal features with 3d residual networks for ac-
                                                                   pean conference on computer vision, 20–36. Springer.
tion recognition. In Proceedings of the IEEE International
Conference on Computer Vision Workshops, 3154–3160.                Yue Zhao, Yuanjun Xiong, D. L. 2019. MMAction. https:
                                                                   //github.com/open-mmlab/mmaction.
Hou, R.; Chen, C.; Sukthankar, R.; and Shah, M. 2019. An
efficient 3d CNN for action/object segmentation in video.          Zolfaghari, M.; Singh, K.; and Brox, T. 2018. Eco: Efficient
arXiv preprint arXiv:1907.08895 .                                  convolutional network for online video understanding. In
                                                                   Proceedings of the European conference on computer vision
Jin, H.; Song, Q.; and Hu, X. 2019. Auto-keras: An effi-           (ECCV), 695–712.
cient neural architecture search system. In Proceedings of
the 25th ACM SIGKDD International Conference on Knowl-
edge Discovery & Data Mining, 1946–1956.                                                   Appendix
                                                                   To accelerate the evaluation, we sub-sample six categories
Lai, K.-H.; Zha, D.; Wang, G.; Xu, J.; Zhao, Y.; Kumar, D.;
                                                                   from the original HMDB-51 datasets to obtain HMDB-6.
Chen, Y.; Zumkhawaka, P.; Wan, M.; Martinez, D.; et al.
                                                                   The sampled categories are brush hair, cartwheel,
2021. TODS: An Automated Time Series Outlier Detection
                                                                   catch, chew, clap, and climb.
System. In Proceedings of the AAAI Conference on Artificial
                                                                      The common hyperparameters are listed as follows: 50
Intelligence, volume 35, 16060–16062.
                                                                   training epochs, initial learning rate 0.001 (decays by a fac-
Li, Y.; Zha, D.; Venugopal, P.; Zou, N.; and Hu, X. 2020.          tor of 10 at epoch 20 & 40), weight decay 5e-4, batch size
Pyodds: An end-to-end outlier detection system with auto-          4, momentum 0.9, number of segments 16, modality RGB,
mated machine learning. In Companion Proceedings of the            and dropout 0.5. Since HMDB-51 is a small dataset that can
Web Conference 2020, 153–157.                                      highly prone to overfitting, we have followed the widely ac-
                                                                   cepted practice (Lin, Gan, and Han 2019) to fine-tune the
Lin, J.; Gan, C.; and Han, S. 2019. Tsm: Temporal shift
                                                                   weights pre-trained on the Kinetics dataset and freeze the
module for efficient video understanding. In Proceedings
                                                                   Batch Normalisation layers.
of the IEEE/CVF International Conference on Computer Vi-
                                                                      For the tuners, the following search space was used:
sion, 7083–7093.
                                                                   (Continuous) - learning rate [0.001,0.0001], momentum
Meng, H.; Pears, N.; and Bailey, C. 2007. A human action           [0.9,0.99], weight decay [5e-4,1e-3] and (Discrete) - num-
recognition system for embedded computer vision applica-           ber of segments 8,16,32. We report the results on the vali-
tion. In 2007 IEEE Conference on Computer Vision and               dation set. 50 samples were drawn for HMDB6 and 15 sam-
Pattern Recognition, 1–6. IEEE.                                    ples were drawn for HMDB-51. In HMDB-51, each sample
                                                                   drawn takes roughly 10 GPU hours to train.
Milutinovic, M.; Schoenfeld, B.; Martinez-Garcia, D.; Ray,
S.; Shah, S.; and Yan, D. 2020. On Evaluation of AutoML
Systems. In ICML Workshop.
AUTOVIDEO: AN AUTOMATED VIDEO ACTION RECOGNITION SYSTEM
You can also read