Video Summarization Using Deep Neural Networks: A Survey - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
UNDER REVIEW 1 Video Summarization Using Deep Neural Networks: A Survey Evlampios Apostolidis , Eleni Adamantidou , Alexandros I. Metsai , Vasileios Mezaris , Senior Member, IEEE, and Ioannis Patras , Senior Member, IEEE Abstract—Video summarization technologies aim to create a organizations that host large volumes of video content. So, arXiv:2101.06072v1 [cs.CV] 15 Jan 2021 concise and complete synopsis by selecting the most informative how is it possible for someone to efficiently navigate within parts of the video content. Several approaches have been de- endless collections of videos, and find the video content that veloped over the last couple of decades and the current state of the art is represented by methods that rely on modern s/he is looking for? The answer to this question comes not only deep neural network architectures. This work focuses on the from video retrieval technologies but also from technologies recent advances in the area and provides a comprehensive for automatic video summarization. The latter allow generating survey of the existing deep-learning-based methods for generic a concise synopsis that conveys the important parts of the full- video summarization. After presenting the motivation behind length video. Given the plethora of video content on the Web, the development of technologies for video summarization, we formulate the video summarization task and discuss the main effective video summarization facilitates viewers’ browsing characteristics of a typical deep-learning-based analysis pipeline. of and navigation in large video collections, thus increasing Then, we suggest a taxonomy of the existing algorithms and viewers’ engagement and content consumption. provide a systematic review of the relevant literature that shows The application domain of automatic video summarization the evolution of the deep-learning-based video summarization is wide and includes (but is not limited to) the use of such technologies and leads to suggestions for future developments. We then report on protocols for the objective evaluation of video technologies by media organizations (after integrating such summarization algorithms and we compare the performance of techniques into their content management systems), to allow several deep-learning-based approaches. Based on the outcomes effective indexing, browsing, retrieval and promotion of their of these comparisons, as well as some documented considerations media assets; and video sharing platforms, to improve viewing about the suitability of evaluation protocols, we indicate potential experience, enhance viewers’ engagement and increase content future research directions. consumption. In addition, video summarization that is tailored Index Terms—Video summarization, Deep neural networks, to the requirements of particular content presentation scenarios Supervised learning, Unsupervised learning, Summarization can be used for e.g., generating trailers or teasers of movies datasets, Evaluation protocols. and episodes of a TV series; presenting the highlights of an event (e.g., a sports game, a music band performance, I. I NTRODUCTION or a public debate); and creating a video synopsis with the In July 2015, YouTube revealed that it receives over 400 main activities that took place over e.g., the last 24hrs of hours of video content every single minute, which translates recordings of a surveillance camera, for time-efficient progress to 65.7 years’ worth of content uploaded every day1 . Since monitoring or security purposes. then, we are experiencing an even stronger engagement of A number of surveys on video summarization have already consumers with both online video platforms and devices appeared in the literature. In one of the first works, Barbieri et (e.g., smart-phones, wearables etc.) that carry powerful video al. [1] classify the relevant bibliography according to several recording sensors and allow instant uploading of the cap- aspects of the summarization process, namely the targeted tured video on the Web. According to newer estimates2 , scenario, the type of visual content, and the characteristics of YouTube now receives 500 hours of video per minute; and the summarization approach. In another early study, Li et al. YouTube is just one of the many video hosting platforms [2] divide the existing summarization approaches into utility- (e.g., DailyMotion, Vimeo), social networks (e.g., Facebook, based methods that use attention models to identify the salient Twitter, Instagram), and online repositories of media and news objects and scenes, and structure-based methods that build on the video shots and scenes. Truong et al. [3] discuss a variety E. Apostolidis is with the Information Technologies Institute / Centre for of attributes that affect the outcome of a summarization pro- Research and Technology Hellas, GR-57001, Thermi, Thessaloniki, Greece, and with the Queen Mary University of London, Mile End Campus, E14NS, cess, such as the video domain, the granularity of the employed London, UK, e-mail: apostolid@iti.gr. video fragments, the utilized summarization methodology, and E. Adamantidou, A. I. Metsai and V. Mezaris are with the Informa- the targeted type of summary. Money et al. [4] divide the tion Technologies Institute / Centre for Research and Technology Hellas, GR-57001, Thermi, Thessaloniki, Greece, e-mail: {adamelen, alexmetsai, bibliography into methods that rely on the analysis of the bmezaris}@iti.gr. video stream, methods that process contextual video metadata, I. Patras is with the Queen Mary University of London, Mile End Campus, and hybrid approaches that rely on both types of the afore- E14NS, London, UK, e-mail: i.patras@qmul.ac.uk. 1 https://www.tubefilter.com/2015/07/26/youtube-400-hours-content-every- mentioned data. Jiang et al. [5] discuss a few characteristic minute/ video summarization approaches, that include the extraction 2 https://www.statista.com/ of low-level visual features for assessing frame similarity or
UNDER REVIEW 2 performing clustering-based key-frame selection; the detection categorization is made according to the use or not of human- of the main events of the video using motion descriptors; and generated ground-truth data for learning, and a secondary the identification of the video structure using Eigen-features. categorization is made based on the adopted learning objective Hu et al. [6] classify the summarization methods into those that or the utilized data modalities by each different class of target minimum visual redundancy, those that rely on object methods. For each one of the defined classes, we illustrate or event detection, and others that are based on multimodal the main processing pipeline and report on the specifications integration. Ajmal et al. [7] similarly classify the relevant of the associated summarization algorithms. After presenting literature in clustering-based methods, approaches that rely on the relevant bibliography, we provide some general remarks detecting the main events of the story, etc. Nevertheless, all that reflect how the field has evolved, especially over the last the aforementioned works report on early approaches to video five years, highlighting the pros and cons of each class of summarization; they do not present how the summarization methods. Section IV continues with an in-depth discussion landscape has evolved over the last years and especially after on the utilized datasets and the different evaluation protocols the introduction of deep learning algorithms. of the literature. Following, Section V discusses the findings The more recent study of Molino et al. [8] focuses on of extensive performance comparisons that are based on the egocentric video summarization and discusses the specifica- results reported in the relevant papers, indicates the most tions and the challenges of this task. In another recent work, competitive methods in the fields of (weakly-)supervised and Basavarajaiah et al. [9] provide a classification of various unsupervised video summarization, and examines whether summarization approaches, including some recently-proposed there is a performance gap between these two main types of deep-learning-based methods; however, it mainly focuses on approaches. Based on the surveyed bibliography, in Section summarization algorithms that are directly applicable on the VI we propose potential future directions to further advance compressed domain. Finally, another recent survey of Vivekraj the current state of the art in video summarization. Finally, et al. [10] presents the relevant bibliography based on a two- Section VII concludes this work by briefly outlining the core way categorization, that relates to the utilized data modalities findings of the conducted study. during the analysis and the incorporation of human aspects. With respect to the latter, it further splits the relevant literature II. P ROBLEM STATEMENT into methods that create summaries by modeling the human understanding and preferences (e.g., using attention models, Video summarization aims to generate a short synopsis that the semantics of the visual content, or ground-truth anno- summarizes the video content by selecting its most informa- tations and machine-learning algorithms), and conventional tive and important parts. The produced summary is usually approaches that rely on the statistical processing of low-level composed of a set of representative video frames (a.k.a. video features of the video. Nevertheless, none of the above surveys key-frames), or video fragments (a.k.a. video key-fragments) presents in a comprehensive manner the current developments that have been stitched in chronological order to form a shorter towards generic video summarization, that are tightly related to video. The former type of a video summary is known as the growing use of advanced deep neural network architectures video storyboard, and the latter type is known as video skim. for learning the summarization task. As a matter of fact, the One advantage of video skims over static sets of frames is relevant research area is a very active one as more than 35 the ability to include audio and motion elements that offer different deep-learning-based video summarization algorithms a more natural story narration and potentially enhance the have been proposed over the last five years; a subset of these expressiveness and the amount of information conveyed by the methods represents the current state of the art on automatic video summary. Furthermore, it is often more entertaining and video summarization. Motivated by this observation, we aim interesting for the viewer to watch a skim rather than a slide to fill this gap in the literature by presenting the relevant show of frames [11]. On the other hand, storyboards are not bibliography on deep-learning-based video summarization and restricted by timing or synchronization issues and, therefore, also discussing other aspects that are associated with it, such they offer more flexibility in terms of data organization for as the protocols used for evaluating video summarization. browsing and navigation purposes [12], [13]. This article provides a comprehensive survey of the existing A high-level representation of the typical deep-learning- deep-learning-based video summarization approaches. Section based video summarization pipeline is depicted in Fig. 1. The II begins by defining the problem of automatic video sum- first step of the analysis involves the representation of the marization and presenting the most prominent types of video visual content of the video with the help of feature vectors. summary. Then, it provides a high-level description of the Most commonly, such vectors are extracted at the frame-level, analysis pipeline of deep-learning-based video summarization for all frames or for a subset of them selected via a frame- algorithms, and introduces a taxonomy of the relevant liter- sampling strategy, e.g., processing 2 frames per second. In ature according to the utilized data modalities, the adopted this way, the extracted feature vectors store information at a training strategy, and the implemented learning approaches. very detailed level and capture the dynamics of the visual Finally, it discusses aspects that relate to the generated sum- content that are of high significance when selecting the video mary, such as the desired properties of a static (frame-based) parts that form the summary. Typically, in most deep-learning- video summary and the length of a dynamic (fragment-based) based video summarization techniques the visual content of the video summary. Section III builds on the introduced taxonomy video frames is represented by deep feature vectors extracted to systematically review the relevant bibliography. A primary with the help of pre-trained neural networks. For this, a
UNDER REVIEW 3 Fig. 1. High-level representation of the analysis pipeline of the deep-learning-based video summarization methods for generating a video storyboard and a video skim. variety of Convolutional Neural Networks (CNNs) and Deep predefined length L. For experimentation and comparison Convolutional Neural Networks (DCNNs) have been used in purposes, this is most often set as L = p · T where T is the bibliography, that include GoogleNet (Inception V1) [14], the video duration and p is the ratio of the summary to video Inception V3 [15], AlexNet [16], variations of ResNet [17] length; p = 0.15 is a typical value, in which case the summary and variations of VGGnet [18]. Nevertheless, the GoogleNet should not exceed 15% of the original video’s duration. As appears to be the most commonly used one thus far. The a side note, the production of a video skim (which is the extracted features are then utilized by a deep summarizer ultimate goal of the majority of the proposed deep-learning- network, which is trained by trying to minimize an objective based summarization algorithms) requires the segmentation of function or a set of objective functions. the video into consecutive and non-overlapping fragments that The output of the trained Deep Summarizer Network can exhibit visual and temporal coherence, thus offering a seamless be either a set of selected video frames (key-frames) that form presentation of a part of the story. Given this segmentation and a static video storyboard, or a set of selected video fragments the estimated frames’ importance scores by the trained Deep (key-fragments) that are concatenated in chronological order Summarizer Network, video-segment-level importance scores and form a short video skim. With respect to the generated are computed by averaging the importance scores of the frames video storyboard, this should be similar with the sets of key- that lie within each video segment. These segment-level scores frames that would be selected by humans and must exhibit are then used to select the key-fragments given the summary minimal visual redundancy. With regards to the produced length L, and most methods tackle this step by solving the video skim, this typically should be equal or less than a Knapsack problem.
UNDER REVIEW 4 With regards to the utilized type of data, the current bibli- proaches cast summarization as a structured prediction prob- ography on deep-learning-based video summarization can be lem and try to make estimates about the frames’ importance divided between: by modeling their temporal dependency. As illustrated in Fig. • Unimodal approaches that utilize only the visual modality 3, during the training phase the Summarizer gets as input the of the videos for feature extraction, and learn summariza- sequence of the video frames and the available ground-truth tion in a (weakly-)supervised or unsupervised manner. data that indicate the importance of each frame according to • Multimodal methods that exploit the available textual the users’ preferences. These data are then used to model the metadata and learn semantic/category-driven summariza- dependencies among the video frames in time (illustrated with tion in a supervised way by increasing the relevance solid arched lines) and estimate the frames’ importance. The between the semantics of the summary and the semantics predicted importance scores are compared with the ground- of the associated metadata or video category. truth data and the outcome of this comparison guides the Concerning the adopted training strategy, the existing training of the Summarizer. The first proposed approach to this deep-learning-based video summarization algorithms can be direction [21] uses Long Short-Term Memory (LSTM) units coarsely categorized in the following categories: [22] to model variable-range temporal dependency among • Supervised approaches that rely on datasets with human- video frames. Frames’ importance is estimated using a multi- labeled ground-truth annotations (either in the form of layer perceptron (MLP), and the diversity of the visual content video summaries, as in the case of the SumMe dataset of the generated summary is increased based on the De- [19], or in the form of frame-level importance scores, as terminantal Point Process (DPP) [23]. [24] proposes a two- in the case of the TVSum dataset [20]), based on which layer LSTM architecture. The first layer extracts and encodes they try to discover the underlying criterion for video data about the video structure. The second layer uses this frame/fragment selection and video summarization. information to estimate fragment-level importance, and select • Unsupervised approaches that overcome the need for the key-fragments of the video. [25] extends the previous ground-truth data (whose production requires time- method by introducing a component that is trained to identify demanding and laborious manual annotation procedures), the shot-level temporal structure of the video. This knowledge based on learning mechanisms that require only an is then utilized for estimating importance at the shot-level and adequately large collection of original videos for their producing a key-shot-based video summary. [26] builds on training. [21] and introduces an attention mechanism to model the tem- • Weakly-supervised approaches that, similarly to unsu- poral evolution of the users’ interest. Following, it uses this in- pervised approaches, aim to alleviate the need for large formation to estimate frames’ importance and select the video sets of hand-labeled data. Less-expensive weak labels are key-frames to build a video storyboard. In the same direction, utilized with the understanding that they are imperfect a few methods utilize sequence-to-sequence (a.k.a. seq2seq) compared to a full set of human annotations, but can architectures in combination with attention mechanisms. [27] nonetheless be used to create strong predictive models. proposes a seq2seq network for video summarization, that is composed of a soft, self-attention mechanism and a two- Building on the above described categorizations, a more layer fully connected network for regression of the frames’ detailed taxonomy of the relevant bibliography is depicted in importance scores. [28] formulates video summarization as Fig. 2. The penultimate layer of this arboreal illustration shows a seq2seq learning problem and proposes an LSTM-based the different learning approaches that have been adopted. The encoder-decoder network with an intermediate attention layer. leafs of each node of this layer show the utilized techniques for The summarization model of this work is extended in [29] implementing each learning approach, and contain references by integrating a semantic preserving embedding network that to the most relevant works in the bibliography. This taxonomy evaluates the output of the decoder with respect to the preser- will be the basis for presenting the relevant bibliography in the vation of the video’s semantics using a tailored semantic pre- following section. serving loss, and replacing the previously used Mean Square III. D EEP LEARNING APPROACHES Error (MSE) loss by the Huber loss to enhance its robustness to outliers. [30] proposes a hierarchical approach which uses a This section provides a systematic review of the bibli- generator-discriminator architecture (similar to the one in [31]) ography. It starts by presenting the different classes of su- as an internal mechanism to estimate the representativeness pervised (Section III-A), unsupervised (Section III-B), and of each shot and define a set of candidate key-frames. Then, weakly-supervised (Section III-C) video summarization meth- it employs a multi-head attention model to further assess ods, which rely solely on the analysis of the visual content. candidates’ importance and select the key-frames that form Following, it reports on multimodal approaches (Section III-D) the summary. [32] tackles video summarization as a semantic that process also the available text-based metadata. Finally, it segmentation task where the input video is seen as a 1D provides some general remarks (Section III-E) that outline how image (of size equal to the number of video frames) with K the field has evolved, especially over the last five years. channels that correspond to the K dimensions of the frames’ representation vectors (either containing raw pixel values or A. Supervised Video Summarization being precomputed feature vectors). Then, it uses popular 1. Learn frame importance by modeling the temporal semantic segmentation models, such as Fully Convolutional dependency among frames. Early deep-learning-based ap- Netorks (FCN) [33] and an adaptation of DeepLab [34],
UNDER REVIEW 5 Fig. 2. A taxonomy of the existing deep-learning-based video summarization algorithms.
UNDER REVIEW 6 tional LSTMs, that models the spatiotemporal relationship among parts of the video. In addition to the estimates about the frames’ importance, the algorithm enhances the visual diversity of the summary via next frame prediction and shot detection mechanisms, based on the intuition that the first frames of a shot generally have high likelihood of being part of the summary. [38] extracts deep and shallow features from the video content using a trainable 3D-CNN and builds a new representation through a fusion strategy. Then, it uses this representation in combination with convolutional LSTMs to model the spatial and temporal structure of the video. Finally, summarization is learned with the help of a new loss function (called Sobolev loss) that aims to define a series of frame- level importance scores that is close to the series of ground- truth scores by minimizing the distance of the derivatives of these sequential data, and to exploit the temporal structure of the video. [39] extracts spatial and temporal information by processing the raw frames and their optical flow maps with CNNs, and learns how to estimate frames’ importance based on human annotations and a label distribution learning process. [40] combines CNNs and Gated Recurrent Units [41] (a type Fig. 3. High-level representation of the analysis pipeline of supervised algo- of Recurrent Neural Network (RNN)) to form spatiotemporal rithms that perform summarization by learning the frames’ importance after feature vectors, that are then used to estimate the level of modeling their temporal or spatiotemporal dependency. For the latter class of methods (i.e., modeling the spatiotemporal dependency among frames), object activity and importance of each frame. [42] trains a neural bounding boxes and object relations in time shown with dashed rectangles and network for spatiotemporal data extraction and creates an inter- lines, are used to illustrate the extension that models both the temporal and frames motion curve. The latter is then used as input to a spatial dependency among frames. transition effects detection method that segments the video into shots. Finally, a self-attention model exploits the human- generated ground-truth data to learn how to estimate the intra- and builds a network (called Fully Convolutional Sequence shot importance and select the key-frames/fragments of the Network) for video summarization. The latter consists of a video to form a static/dynamic video summary. stack of convolutions with increasing effective context size 3. Learn summarization by fooling a discriminator as we go deeper in the network, that enable the network to when trying to discriminate a machine-generated from a effectively model long-range dependency among frames and human-generated summary. Following a completely differ- learn frames’ importance. Finally, to address issues related to ent approach to minimizing the distance between the machine- the limited capacity of LSTMs, some techniques use additional generated and the ground-truth summaries, a couple of meth- memory. [35] stores information about the entire video in ods use Generative Adversarial Networks (GANs). As pre- an external memory and predicts each shot’s importance by sented in Fig. 4, the Summarizer (which acts as the Generator learning an embedding space that enables matching of each of the GAN) gets as input the sequence of the video frames and shot with the entire memory information. [36] stacks multiple generates a summary by computing frame-level importance LSTM and memory layers hierarchically to derive long-term scores. The generated summary (i.e., the predicted frame-level temporal context, and uses this information to estimate the importance scores) along with an optimal video summary for frames’ importance. this video (i.e., frame-level importance scores according to the 2. Learn frame importance by modeling the spa- users’ preferences) are given as input to a trainable Discrimi- tiotemporal structure of the video. Aiming to learn how nator which outputs a score that quantifies their similarity. The to make better estimates about the importance of video training of the entire summarization architecture is performed frames/fragments, some techniques pay attention to both the in an adversarial manner. The Summarizer tries to fool the spatial and temporal structure of the video. Again, the Sum- Discriminator to not distinguish the predicted from the user- marizer gets as input the sequence of the video frames and generated summary, and the Discriminator aims to learn how the available ground-truth data that indicate the importance of to make this distinction. When the Discriminator’s confidence each frame according to the users’ preferences. But, extending is very low (i.e., the classification error is approximately equal the analysis pipeline of the previously described group of for both the machine- and the user-generated summary), then methods, it then also models the spatiotemporal dependencies the Summarizer is able to generate a summary that is very among frames (shown with dashed rectangles and lines in Fig. close to the users’ expectations. In this context, [43] com- 3). Once again, the predicted importance scores are compared bines LSTMs and Dilated Temporal Relational (DTR) units with the ground-truth data and the outcome of this comparison to estimate temporal dependencies among frames at different guides the training of the Summarizer. From this perspective, temporal windows, and learns summarization by trying to fool [37] presents an encoder-decoder architecture with convolu- a trainable discriminator when distinguishing the machine-
UNDER REVIEW 7 distinguishing the summary-based reconstructed video from the original one, while the Discriminator aims to learn how to make the distinction. When this discrimination is not possible (i.e., the classification error is approximately equal for both the reconstructed and the original video), the Summarizer is considered to be able to build a video summary that is highly- representative of the overall video content. To this direction, the work of Mahasseni et al. [31] is the first that combines an LSTM-based key-frame selector with a Variational Auto- Encoder (VAE) and a trainable Discriminator, and learns video summarization through an adversarial learning process that aims to minimize the distance between the original video and Fig. 4. High-level representation of the analysis pipeline of supervised algorithms that learn summarization with the help of ground-truth data and the summary-based reconstructed version of it. [46] builds adversarial learning. on the network architecture of [31] and suggests a stepwise, label-based approach for training the adversarial part of the network, that leads to improved summarization performance. based summary from the ground-truth and a randomly-created [47] also relies on a VAE-GAN architecture but extends it with one. [44] suggests an adversarial learning approach for (semi- a chunk and stride network (CSNet) and a tailored difference )supervised video summarization. The Generator/Summarizer attention mechanism for assessing the frames’ dependency is an attention-based Pointer Network [45] that defines the at different temporal granularities when selecting the video start and end point of each video fragment that is used to key-frames. [48] aims to maximize the mutual information form the summary. The Discriminator is a 3D-CNN classifier between the summary and the video using a trainable couple that judges whether a fragment is from a ground-truth or a of discriminators and a cycle-consistent adversarial learning machine-generated summary. Instead of using the typical ad- objective. The frame selector (bi-directional LSTM) learns a versarial loss, in this algorithm the output of the Discriminator video representation that captures the long-range dependency is used as a reward to train the Generator/Summarizer based among frames. The evaluator is composed of two GANs where on reinforcement learning. Thus far, the use of GANs for the forward GAN learns to reconstruct the original video from supervised video summarization is limited. Nevertheless, this the video summary and the backward GAN learns to invert the machine learning framework has been widely used for unsu- processing, and evaluates the consistency between the output pervised video summarization, as discussed in the following of such cycle learning as a measure that quantifies information section. preservation between the original video and the generated sum- mary. Using this measure, the evaluator guides the frame selec- tor to identify the most informative frames and form the video B. Unsupervised Video Summarization summary. [49] introduces a variation of [46] that replaces 1. Learn summarization by fooling a discriminator when the Variational Auto-Encoder with a deterministic Attention trying to discriminate the original video (or set of key- Auto-Encoder for learning an attention-driven reconstruction frames) from a summary-based reconstruction of it. Given of the original video, which subsequently improves the key- the lack of any guidance (in the form of ground-truth data) fragment selection process. [50] presents a self-attention-based for learning video summarization, most existing unsupervised conditional GAN. The Generator produces weighted frame approaches rely on the rule that a representative summary features and predicts frame-level importance scores, while ought to assist the viewer to infer the original video content. the Discriminator tries to distinguish between the weighted In this context, these techniques utilize GANs to learn how and the raw frame features. A conditional feature selector is to create a video summary that allows a good reconstruction used to guide the GAN model to focus on more important of the original video. The main concept of this training temporal regions of the whole video frames, while long-range approach is depicted in Fig. 5. The Summarizer is usually temporal dependencies along the whole video sequence are composed of a Key-frame Selector that estimates the frames’ modeled by a multi-head self-attention mechanism. Finally, importance and generates a summary, and a Generator that [51] learns video summarization from unpaired data based reconstructs the video based on the generated summary. It on an adversarial process that relies on GANs and a Fully- gets as input the sequence of the video frames and, through Convolutional Sequence Network (FCSN) encoder-decoder. the aforementioned internal processing steps, reconstructs the The model of [51] aims to learn a mapping function of a original video based on the generated summary (which is raw video to a human-like summary, such that the distribution represented by the predicted frame-level importance scores). of the generated summary is similar to the distribution of The reconstructed video along with the original one are given human-created summaries, while content diversity is forced by as input to a trainable Discriminator which outputs a score that applying a relevant constraint on the learned mapping function. quantifies their similarity. Similarly to the supervised GAN- 2. Learn summarization by targeting specific desired based methods, the training of the entire summarization ar- properties for the summary. Aiming to deal with the un- chitecture is performed in an adversarial manner. However, in stable training [52] and the restricted evaluation criteria of this case the Summarizer tries to fool the Discriminator when GAN-based methods (that mainly focus on the summary’s
UNDER REVIEW 8 Fig. 6. High-level representation of the analysis pipeline of supervised algorithms that learn summarization based on hand-crafted rewards and Fig. 5. High-level representation of the analysis pipeline of unsupervised reinforcement learning. algorithms that learn summarization by increasing the similarity between the summary and the video. aims to find important objects and their key-motions. Based on this step, it represents the whole video by creating super- ability to lead to a good reconstruction of the original video), segmented object motion clips. Each one of these clips is some unsupervised approaches perform summarization by then given to the Summarizer, which uses an online motion targeting specific properties of an optimal video summary. auto-encoder model (Stacked Sparse LSTM Auto-Encoder) To this direction, they utilize the principles of reinforcement to memorize past states of object motions by continuously learning in combination with hand-crafted reward functions updating a tailored recurrent auto-encoder network. The latter that quantify the existence of desired characteristics in the is responsible for reconstructing object-level motion clips, and generated summary. As presented in Fig. 6, the Summarizer the reconstruction loss between the input and the output of this gets as input the sequence of the video frames and creates component is used to guide the training of the Summarizer. a summary by predicting frame-level importance scores. The Based on this training process the Summarizer is able to created (predicted) summary is then forwarded to an Evaluator generate summaries that show the representative objects in the which is responsible to quantify the existence of specific video and the key-motions of each of these objects. desired characteristics with the help of hand-crafted reward functions. The computed score(s) are then combined to form an overall reward value, that is finally used to guide the C. Weakly-supervised Video Summarization training of the Summarizer. In this context, [52] formulates Weakly-supervised video summarization methods try to video summarization as a sequential decision-making process mitigate the need for extensive human-generated ground-truth and trains a Summarizer to produce diverse and representative data, similarly to unsupervised learning methods. Instead of video summaries using a diversity-representativeness reward. not using any ground-truth data, they use less-expensive weak The diversity reward measures the dissimilarity among the labels (such as video-level metadata for video categorization selected key-frames and the representativeness reward com- and category-driven summarization, or ground-truth annota- putes the distance (expressing the visual resemblance) of tions for a small subset of video frames for learning sum- the selected key-frames from the remaining frames of the marization through sparse reinforcement learning and tailored video. [53] utilizes Temporal Segment Networks (proposed reward functions) under the main hypothesis that these labels in [54] for action recognition in videos) to extract spatial are imperfect compared to a full set of human annotations, but and temporal information about the video frames, and trains can nonetheless allow the training of effective summarization the Summarizer through a reward function that assesses the models. We avoid the illustration of a typical analysis pipeline preservation of the video’s main spatiotemporal patterns in the for this class of methods, as there is a limited overlap in produced summary. [55] presents a mechanism for both video the way that these methods conceptualize the learning of the summarization and reconstruction. Video reconstruction aims summarization task. The first work that adopts an intermediate to estimate the extent to which the summary allows the viewer way between fully-supervised and fully-unsupervised learning to infer the original video (similarly to some of the above- for video summarization is [57]. It uses video-level meta- presented GAN-based methods), and video summarization is data (e.g., the video title “A man is cooking.”) to define a learned based on the reconstructor’s feedback and the output of categorization of videos. Then, it leverages multiple videos trained models that assess the representativeness and diversity of a category and extracts 3D-CNN features to automatically of the visual content of the generated summary. learn a parametric model for categorizing new (unseen during 3. Build object-oriented summaries by modeling the key- training) videos. Finally, it adopts the learned model to select motion of important visual objects. Building on a different the video segments that maximize the relevance between basis, [56] focuses on the preservation in the summary of the summary and the video category. The authors of [57] the underlying fine-grained semantic and motion information investigate different ways to tackle issues related to the limited of the video. For this, it performs a preprocessing step that size of available datasets, that include cross-dataset training,
UNDER REVIEW 9 the use of web-crawled videos, as well as data augmentation video [69], [70], [71], and extend this output by extracting one methods for obtaining sufficient training data. [58] proposes a representative key-frame [72]. A multimodal deep-learning- deep learning framework for summarizing first-person videos; based algorithm for summarizing videos of soccer games was however, we report on this method here, as it is also evaluated presented in [73], while another multimodal approach for key- on a dataset used to assess generic video summarization frame extraction from first-person videos that exploits sensor methods. Driven by the observation that the collection of a data was described in [74]. Nevertheless, all these works are sufficiently large amount of fully-annotated first-person video outside the scope of this survey, which focuses on deep- data with ground-truth annotations is a difficult task, the learning-based methods for generic video summarization; i.e., algorithm in [58] exploits the principles of transfer learning methods that learn summarization with the help of deep neural and uses annotated third-person videos (which can be found network architectures and/or the visual content is represented more easily) to learn how to summarize first-person videos. by deep features. The algorithm performs cross-domain feature embedding and The majority of multimodal approaches that are within the transfer learning for domain adaptation (across third- and first- focus of this study tackle video summarization by utilizing person videos) in a semi-supervised manner. In particular, also the textual video metadata, such as the video title and training is performed based on a set of third-person videos description. As depicted in Fig. 7, during training the Sum- with fully annotated highlight scores and a set of first-person marizer gets as input: i) the sequence of the video frames that videos where only a small portion of them comes with needs to be summarized, ii) the relevant ground-truth data ground-truth scores. [59] suggests a weakly-supervised setting (i.e., frame-level importance scores according to the users), of learning summarization models from a large number of and iii) the associated video metadata. Following, it estimates web videos. Its architecture combines a Variational Auto- the frames’ importance and generates (predicts) a summary. Encoder that learns the latent semantics from web videos, Then, the generated summary is compared with the ground- and a sequence encoder-decoder with attention mechanism that truth summary and the video metadata. The former comparison performs summarization. The decoding part of the VAE aims produces a similarity score. The latter comparison involves the to reconstruct the input videos using samples from the learned semantic analysis of the summary and the video metadata. latent semantics; while, the most important video fragments Given the output of this analysis, most algorithms try to are identified through the soft attention mechanism of the minimize the distance of the generated representations in a encoder-decoder network, where the attention vectors of raw learned latent space, or the classification loss when the aim is videos are obtained by integrating the learned latent semantics to define a summary that maintains the core characteristics of from the collected web videos. The overall architecture is the video category. The combined outcome of this assessment trained by a weakly-supervised semantic matching loss to is finally used to guide the training of the Summarizer. In learn the topic-associated summaries. Finally, [60] uses the this context, [75], [76] learn category-driven summarization principles of reinforcement learning to learn summarization by rewarding the preservation of the core parts found in based on a limited set of human annotations and a set of hand- video summaries from the same category (e.g., the main crafted rewards. The latter relate to the similarity between parts of a wedding ceremony when summarizing a wedding the machine- and the human-selected fragments, as well as video). Similarly, [77] trains action classifiers with video- to specific characteristics of the created summary (e.g., its level labels for action-driven video fragmentation and labeling; representativeness). More specifically, this method applies a then, it extracts a fixed number of key-frames and applies hierarchical key-fragment selection process that is divided into reinforcement learning to select the ones with the highest sub-tasks. Each task is learned through sparse reinforcement categorization accuracy, thus performing category-driven sum- learning (thus avoiding the need for exhaustive annotations marization. [78], [79] define a video summary by maximizing about the entire set of frames, and using annotations only for its relevance with the available video metadata, after projecting a subset of frames) and the final summary is formed based on both the visual and the textual information in a common rewards about its diversity and representativeness. latent space. Finally, [80] applies a visual-to-text mapping and a semantic-based video fragment selection according to the relevance between the automatically-generated and the original D. Multimodal Approaches video description, with the help of semantic attended networks. A number of works investigated the potential of exploiting However, most of these methods examine only the visual cues additional modalities (besides the visual stream) for learning without considering the sequential structure of the video and summarization, such as the audio stream, the video captions the temporal dependency among frames. or ASR transcripts, any available textual metadata (video title and/or abstract) or other contextual data (e.g., viewers’ comments). Several of these multimodal approaches were E. General remarks on deep learning approaches proposed before the so-called “deep learning era”, targeting Based on the review of the relevant bibliography, we saw either generic or domain/task-specific video summarization. that early deep-learning-based approaches for video summa- Some indicative and recent examples can be found in [61], rization utilize combinations of CNNs and RNNs. The former [62], [63], [64], [65], [66], [67], [68]. Addressing the video are used as pre-trained components (e.g., CNNs trained on summarization task from a slightly different perspective, other ImageNet for visual concept detection) to represent the visual multimodal algorithms generate a text-based summary of the content, and the latter (mostly LSTMs) are used to model the
UNDER REVIEW 10 main intuition behind the use of GANs is that the produced summary should allow the viewer to infer the original video, and thus the unsupervised GAN-based methods are trying to build a summary that enables a good reconstruction of the original video. In most cases the generative part is composed of an LSTM that estimates the frames’ importance according to their temporal dependency, thus indicating the most appro- priate video parts for the summary. Then, the reconstruction of the video based on the predicted summary is performed with the help of Auto-Encoders ([31], [46], [48]) that in some cases are combined with tailored attention mechanisms ([47], [49]). Another, but less popular, approach for unsupervised video summarization is the use of reinforcement learning in combination with hand-crafted rewards about specific proper- ties of the generated summary. These rewards usually aim to increase the representativeness and diversity of the summary [52], retain the statiotemporal patterns of the video [53], or secure a good summary-based reconstruction of the video [55]. Finally, one unsupervised method learns how to build object- oriented summaries by modeling the key-motion of important visual objects using a stacked sparse LSTM Auto-Encoder [56]. Fig. 7. High-level representation of the analysis pipeline of supervised Last but not least, a few weakly-supervised methods have algorithms that perform semantic/category-driven summarization. also been proposed. These methods learn video summarization by exploiting the semantics of the video metadata [57] or the summary structure of semantically-similar web videos [59], temporal dependency among video frames. The majority of exploiting annotations from a similar domain and transferring these approaches are supervised and try to learn how to make the gained knowledge via cross-domain feature embedding and estimates about the importance of video frames/fragments transfer learning techniques [58], or using weakly/sparsely- based on human-generated ground-truth annotations. Architec- labeled data under a reinforcement learning framework [60]. tures of RNNs can be used in the typical or in a hierarchical With respect to the potential of deep-learning-based video form ([21], [24], [25]) to also model the temporal structure and summarization algorithms, we argue that, despite the fact utilize this knowledge when selecting the video fragments of that currently the major research direction is towards the the summary. In some cases such architectures are combined development of supervised algorithms, the exploration of with attention mechanisms to model the evolution of the users’ the learning capability of unsupervised methods is highly interest ([26], [27], [28]). Alternatively, they are extended by recommended. The reasoning behind this claim is based on memory networks to increase the memorization capacity of the fact that: i) the generation of ground-truth training data the network and capture long-range temporal dependencies (summaries) can be a really expensive and laborious process; among parts of the video ([35], [36]). Going one step further, ii) video summarization is a subjective task, thus multiple a group of techniques try to learn importance by paying different summaries can be proposed for a video from different attention to both the spatial and temporal structure of the human annotators; and iii) these ground-truth summaries can video, using convolutional LSTMs [37], [38], optical flow vary a lot, thus making it hard to train a method with the maps [39], combinations of CNNs and RNNs [40], or motion typical supervised training approaches. On the other hand, extraction mechanisms [42]. Following a different approach, unsupervised video summarization algorithms overcome the a couple of supervised methods learn how to generate video need for ground-truth data as they can be trained using summaries that are aligned with the human preferences with only an adequately large collection of original, full-length the help of GANs ([43], [44]). Finally, multimodal algorithms videos. Moreover, unsupervised learning allows to easily train extract the high-level semantics of the visual content using different models of the same deep network architecture using pre-trained CNNs/DCNNs and learn summarization in a su- different types of video content (TV shows, news), thus facil- pervised manner by maximizing the semantic similarity among itating the domain-adaptation of video summarization. Given the visual summary and the contextual video metadata or the the above, we believe that unsupervised video summarization video category. However, the latter methods focus mostly on has great advantages, and thus its potential should be further the visual content and disregard the sequential nature of the investigated. video, which is essential when summarizing the presented story. IV. E VALUATING VIDEO SUMMARIZATION To train a video summarization network in a fully unsu- A. Datasets pervised manner, the use of GANs seems to be the central Four datasets prevail in the video summarization bibliog- direction thus far ([31], [46], [47], [48], [49], [50], [51]). The raphy: SumMe [19], TVSum [20], OVP [81] and Youtube
UNDER REVIEW 11 [81]. SumMe consists of 25 videos of various genres such utilized by the deep-learning-based approaches for video sum- as holidays, events and sports, with duration ranging from marization. From this table it is clear that the SumMe and 1 to 6 minutes. The annotations are obtained from 15 to TVSum datasets are, by far, the most commonly used ones. 18 users per video, in the form of multiple sets of key- OVP and Youtube are also utilized in a few works, but mainly fragments. Each summary has a length between 5% and 15% for data augmentation purposes. of the initial video duration. TVSum contains 50 videos from 10 categories of the TRECVid MED dataset (5 videos per category), including news, how-to’s, user-generated-content B. Evaluation protocols and measures and documentaries. Videos’ duration ranges from 2 to 11 1) Evaluating video storyboards: Early video summariza- minutes, and each video has been annotated by 20 users in tion techniques created a static summary of the video content the form of shot- and frame-level importance scores (ranging with the help of representative key-frames. First attempts from 1 to 5). OVP and Youtube both contain 50 videos, whose towards the evaluation of the created key-frame-based sum- annotations are sets of key-frames, produced by 5 users. The maries were based on user studies that made use of human video duration ranges from 1 to 4 minutes for OVP, and from judges for evaluating the resulting quality of the summaries. 1 to 10 minutes for Youtube. Both datasets are comprised Judgment was based on specific criteria, such as the relevance of videos with diverse video content, such as documentaries, of each key-frame with the video content, redundant or missing educational, ephemeral, historical, and lecture videos (OVP information [86], or the informativeness and enjoyability of the dataset) and cartoons, news, sports, commercials, TV-shows summary [87]. Although this can be a useful way to evaluate and home videos (Youtube dataset). results, the procedure of evaluating each resulting summary by Some less-commonly used datasets for video summariza- users can be time-consuming and the evaluation results can not tion are CoSum [82], MED-summaries [83], Video Titles in be easily reproduced or used in future comparisons. To over- the Wild (VTW) [84], League of Legends (LoL) [85] and come the deficiencies of user studies, more recent works evalu- FVPSum [58]. CoSum has been created to evaluate video co- ate their key-frame-based summaries using objective measures summarization. It consists of 51 videos that were downloaded and ground-truth summaries. In this context, [88] estimates the from YouTube using 10 query terms that relate to the video quality of the produced summaries using the Fidelity measure content of the SumMe dataset. Each video is approximately 4 [89] and the Shot Reconstruction Degree criterion [90]. A minutes long and it is annotated by 3 different users, who different approach, termed “Comparison of User Summaries”, have selected sets of key-fragments. The MED-summaries is proposed in [81] and evaluates the generated summary dataset contains 160 annotated videos from the TRECVID according to its overlap with predefined key-frame-based user 2011 MED dataset. 60 videos form the validation set (from summaries. Comparison is performed at a key-frame-basis 15 event categories) and the remaining 100 videos form the and the quality of the generated summary is quantified by test set (from 10 event categories), with most of them being computing the accuracy and error rates based on the number 1 to 5 minutes long. The annotations come as one set of of matched and non-matched key-frames respectively. This importance scores, averaged over 1 to 4 annotators. As far approach was also used in [91], [92], [93], [94]. A similar as the VTW dataset is concerned, it includes 18100 open methodology was followed in [95]. Instead of Accuracy and domain videos, with 2000 of them being annotated in the Error rates, in [95] the evaluation relies on the well-known form of sub-shot level highlight scores. The videos are user- Precision, Recall and F-Score measures. This protocol was generated, untrimmed videos that contain a highlight event and also used in the supervised video summarization approach have an average duration of 1.5 minutes. Regarding LoL, it has of [96], while a variation of it was used in [97], [98], [99]. 218 long videos (30 to 50 minutes), displaying game matches In the latter case, additionally to their visual similarity, two from the North American League of Legends Championship frames are considered a match only if they are temporally no Series (NALCS). The annotations derive from a Youtube more than a specific number of frames apart. channel that provides community-generated highlights (videos 2) Evaluating video skims: Most recent algorithms tackle with duration 5 to 7 minutes). Therefore, one set of key- video summarization by creating a dynamic video summary fragments is available for each video. Finally, FPVSum is (video skim). For this, they select the most representative video a first-person video summarization dataset. It contains 98 fragments and join them in a sequence to form a shorter video. videos from 14 categories of GoPro viewer-friendly videos The evaluation methodologies of these works assess the quality with varying lengths (over 7 hours in total). For each category, of video skims according to their alignment with human pref- about 35% of the video sequences are annotated with ground- erences, based mostly on objective measures. A first attempt truth scores by at least 10 users, while the remaining are was made in [19], where an evaluation approach along with viewed as the unlabeled examples. the SumMe dataset for video summarization were introduced. It is worth mentioning that in this work we focus only According to this approach, the videos are first segmented on datasets that are fit for evaluating video summarization into consecutive and non-overlapping fragments in order to methods, namely datasets that contain ground truth annotations enable matching between key-fragment-based summaries (i.e., regarding the summary or the frame/fragment-level importance to compare the user-generated with the automatically-created of each video. Other datasets might be also used by some summary). Then, based on the scores computed by a video works for network pre-training purposes, but they do not summarization algorithm for the fragments of a given video, an concern the present work. Table II summarizes the datasets optimal subset of them (key-fragments) is selected and forms
UNDER REVIEW 12 Dataset # of videos duration (min) content type of annotations # of annotators per video SumMe [19] 25 1-6 holidays, events, sports multiple sets of key-fragments 15 - 18 news, how-to’s, user-generated, TVSum [20] 50 2 - 10 documentaries multiple fragment-level scores 20 (10 categories - 5 videos each) documentary, educational, OVP [81] 50 1-4 multiple sets of key-frames 5 ephemeral, historical, lecture cartoons, sports, tv-shows, Youtube [81] 50 1 - 10 multiple sets of key-frames 5 commercial, home videos holidays, events, sports CoSum [82] 51 ∼4 multiple sets of key-fragments 3 (10 categories) MED [83] 160 1-5 15 categories of various genres one set of imp. scores 1-4 user-generated videos that VTW [84] 2000 1.5 (avg) sub-shot level highlight scores - contain a highlight event matches from a League LoL [85] 218 30 - 50 one set of key-fragments 1 of Legends tournament first-person videos FPVSum [58] 98 4.3 (avg) multiple frame-level scores 10 (14 categories) TABLE I D ATASETS FOR VIDEO SUMMARIZATION AND THEIR MAIN CHARACTERISTICS . Method SumMe TVSum OVP Youtube CoSum MED VTW LoL FPVSum VS-DSF [78] ✓ vsLSTM, dppLSTM [21] ✓ ✓ ✓ ✓ DeSumNet [57] ✓ ✓ H-RNN [24] ✓ ✓ ✓ ✓ SUM-GAN [31] ✓ ✓ ✓ ✓ SUM-FCN [32] ✓ ✓ ✓ ✓ FPVSF [58] ✓ ✓ ✓ VESD [59] ✓ ✓ VASNet [27] ✓ ✓ ✓ ✓ HSA-RNN [25] ✓ ✓ ✓ ✓ DTR-GAN [43] ✓ ✓ MAVS [35] ✓ ✓ DR-DSN [52] ✓ ✓ ✓ ✓ SA-SUM [80] ✓ ✓ ✓ DQSN [76] ✓ ✓ Online Motion-AE [56] ✓ ✓ ✓ CSNet [47] ✓ ✓ ✓ ✓ A-AVS, M-AVS [28] ✓ ✓ ✓ ✓ Ptr-Net [44] ✓ ✓ ✓ ✓ Cycle-SUM [48] ✓ ✓ DSSE [79] ✓ EDSN [53] ✓ ✓ SMLD [39] ✓ ✓ UnpairedVSN [51] ✓ ✓ ✓ ✓ ACGAN [50] ✓ ✓ ✓ ✓ SUM-GAN-sl [46] ✓ ✓ SMN [36] ✓ ✓ ✓ ✓ PCDL [55] ✓ ✓ ✓ ✓ ActionRanking [40] ✓ ✓ ✓ ✓ vsLSTM+Att [26] ✓ ✓ ✓ ✓ dppLSTM+Att [26] ✓ ✓ ✓ ✓ WS-HRL [60] ✓ ✓ ✓ ✓ SF-CVS [42] ✓ ✓ ✓ CRSum [38] ✓ ✓ ✓ H-MAN [30] ✓ ✓ ✓ ✓ SUM-GAN-AAE [49] ✓ ✓ DASP [29] ✓ ✓ ✓ TABLE II D ATASETS USED BY EACH DEEP - LEARNING - BASED METHOD FOR EVALUATING VIDEO SUMMARIZATION PERFORMANCE .
You can also read