Video Summarization Using Deep Neural Networks: A Survey - arXiv

Page created by Alexander Goodman

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Video Summarization Using Deep Neural Networks: A Survey - arXiv

UNDER REVIEW                                                                                                                                    1

                                                         Video Summarization Using Deep Neural
                                                                  Networks: A Survey
                                                                Evlampios Apostolidis , Eleni Adamantidou , Alexandros I. Metsai ,
                                                        Vasileios Mezaris , Senior Member, IEEE, and Ioannis Patras , Senior Member, IEEE

                                            Abstract—Video summarization technologies aim to create a                    organizations that host large volumes of video content. So,
arXiv:2101.06072v1 [cs.CV] 15 Jan 2021

                                         concise and complete synopsis by selecting the most informative                 how is it possible for someone to efficiently navigate within
                                         parts of the video content. Several approaches have been de-                    endless collections of videos, and find the video content that
                                         veloped over the last couple of decades and the current state
                                         of the art is represented by methods that rely on modern                        s/he is looking for? The answer to this question comes not only
                                         deep neural network architectures. This work focuses on the                     from video retrieval technologies but also from technologies
                                         recent advances in the area and provides a comprehensive                        for automatic video summarization. The latter allow generating
                                         survey of the existing deep-learning-based methods for generic                  a concise synopsis that conveys the important parts of the full-
                                         video summarization. After presenting the motivation behind                     length video. Given the plethora of video content on the Web,
                                         the development of technologies for video summarization, we
                                         formulate the video summarization task and discuss the main                     effective video summarization facilitates viewers’ browsing
                                         characteristics of a typical deep-learning-based analysis pipeline.             of and navigation in large video collections, thus increasing
                                         Then, we suggest a taxonomy of the existing algorithms and                      viewers’ engagement and content consumption.
                                         provide a systematic review of the relevant literature that shows                  The application domain of automatic video summarization
                                         the evolution of the deep-learning-based video summarization                    is wide and includes (but is not limited to) the use of such
                                         technologies and leads to suggestions for future developments.
                                         We then report on protocols for the objective evaluation of video               technologies by media organizations (after integrating such
                                         summarization algorithms and we compare the performance of                      techniques into their content management systems), to allow
                                         several deep-learning-based approaches. Based on the outcomes                   effective indexing, browsing, retrieval and promotion of their
                                         of these comparisons, as well as some documented considerations                 media assets; and video sharing platforms, to improve viewing
                                         about the suitability of evaluation protocols, we indicate potential            experience, enhance viewers’ engagement and increase content
                                         future research directions.
                                                                                                                         consumption. In addition, video summarization that is tailored
                                           Index Terms—Video summarization, Deep neural networks,                        to the requirements of particular content presentation scenarios
                                         Supervised learning, Unsupervised learning, Summarization                       can be used for e.g., generating trailers or teasers of movies
                                         datasets, Evaluation protocols.
                                                                                                                         and episodes of a TV series; presenting the highlights of
                                                                                                                         an event (e.g., a sports game, a music band performance,
                                                                   I. I NTRODUCTION                                      or a public debate); and creating a video synopsis with the
                                            In July 2015, YouTube revealed that it receives over 400                     main activities that took place over e.g., the last 24hrs of
                                         hours of video content every single minute, which translates                    recordings of a surveillance camera, for time-efficient progress
                                         to 65.7 years’ worth of content uploaded every day1 . Since                     monitoring or security purposes.
                                         then, we are experiencing an even stronger engagement of                           A number of surveys on video summarization have already
                                         consumers with both online video platforms and devices                          appeared in the literature. In one of the first works, Barbieri et
                                         (e.g., smart-phones, wearables etc.) that carry powerful video                  al. [1] classify the relevant bibliography according to several
                                         recording sensors and allow instant uploading of the cap-                       aspects of the summarization process, namely the targeted
                                         tured video on the Web. According to newer estimates2 ,                         scenario, the type of visual content, and the characteristics of
                                         YouTube now receives 500 hours of video per minute; and                         the summarization approach. In another early study, Li et al.
                                         YouTube is just one of the many video hosting platforms                         [2] divide the existing summarization approaches into utility-
                                         (e.g., DailyMotion, Vimeo), social networks (e.g., Facebook,                    based methods that use attention models to identify the salient
                                         Twitter, Instagram), and online repositories of media and news                  objects and scenes, and structure-based methods that build on
                                                                                                                         the video shots and scenes. Truong et al. [3] discuss a variety
                                            E. Apostolidis is with the Information Technologies Institute / Centre for   of attributes that affect the outcome of a summarization pro-
                                         Research and Technology Hellas, GR-57001, Thermi, Thessaloniki, Greece,
                                         and with the Queen Mary University of London, Mile End Campus, E14NS,
                                                                                                                         cess, such as the video domain, the granularity of the employed
                                         London, UK, e-mail: apostolid@iti.gr.                                           video fragments, the utilized summarization methodology, and
                                            E. Adamantidou, A. I. Metsai and V. Mezaris are with the Informa-            the targeted type of summary. Money et al. [4] divide the
                                         tion Technologies Institute / Centre for Research and Technology Hellas,
                                         GR-57001, Thermi, Thessaloniki, Greece, e-mail: {adamelen, alexmetsai,
                                                                                                                         bibliography into methods that rely on the analysis of the
                                         bmezaris}@iti.gr.                                                               video stream, methods that process contextual video metadata,
                                            I. Patras is with the Queen Mary University of London, Mile End Campus,      and hybrid approaches that rely on both types of the afore-
                                         E14NS, London, UK, e-mail: i.patras@qmul.ac.uk.
                                            1 https://www.tubefilter.com/2015/07/26/youtube-400-hours-content-every-     mentioned data. Jiang et al. [5] discuss a few characteristic
                                         minute/                                                                         video summarization approaches, that include the extraction
                                            2 https://www.statista.com/                                                  of low-level visual features for assessing frame similarity or

UNDER REVIEW 2

performing clustering-based key-frame selection; the detection categorization is made according to the use or not of human-
of the main events of the video using motion descriptors; and generated ground-truth data for learning, and a secondary
the identification of the video structure using Eigen-features. categorization is made based on the adopted learning objective
Hu et al. [6] classify the summarization methods into those that or the utilized data modalities by each different class of
target minimum visual redundancy, those that rely on object methods. For each one of the defined classes, we illustrate
or event detection, and others that are based on multimodal the main processing pipeline and report on the specifications
integration. Ajmal et al. [7] similarly classify the relevant of the associated summarization algorithms. After presenting
literature in clustering-based methods, approaches that rely on the relevant bibliography, we provide some general remarks
detecting the main events of the story, etc. Nevertheless, all that reflect how the field has evolved, especially over the last
the aforementioned works report on early approaches to video five years, highlighting the pros and cons of each class of
summarization; they do not present how the summarization methods. Section IV continues with an in-depth discussion
landscape has evolved over the last years and especially after on the utilized datasets and the different evaluation protocols
the introduction of deep learning algorithms. of the literature. Following, Section V discusses the findings
The more recent study of Molino et al. [8] focuses on of extensive performance comparisons that are based on the
egocentric video summarization and discusses the specifica- results reported in the relevant papers, indicates the most
tions and the challenges of this task. In another recent work, competitive methods in the fields of (weakly-)supervised and
Basavarajaiah et al. [9] provide a classification of various unsupervised video summarization, and examines whether
summarization approaches, including some recently-proposed there is a performance gap between these two main types of
deep-learning-based methods; however, it mainly focuses on approaches. Based on the surveyed bibliography, in Section
summarization algorithms that are directly applicable on the VI we propose potential future directions to further advance
compressed domain. Finally, another recent survey of Vivekraj the current state of the art in video summarization. Finally,
et al. [10] presents the relevant bibliography based on a two- Section VII concludes this work by briefly outlining the core
way categorization, that relates to the utilized data modalities findings of the conducted study.
during the analysis and the incorporation of human aspects.
With respect to the latter, it further splits the relevant literature
II. P ROBLEM STATEMENT
into methods that create summaries by modeling the human
understanding and preferences (e.g., using attention models, Video summarization aims to generate a short synopsis that
the semantics of the visual content, or ground-truth anno- summarizes the video content by selecting its most informa-
tations and machine-learning algorithms), and conventional tive and important parts. The produced summary is usually
approaches that rely on the statistical processing of low-level composed of a set of representative video frames (a.k.a. video
features of the video. Nevertheless, none of the above surveys key-frames), or video fragments (a.k.a. video key-fragments)
presents in a comprehensive manner the current developments that have been stitched in chronological order to form a shorter
towards generic video summarization, that are tightly related to video. The former type of a video summary is known as
the growing use of advanced deep neural network architectures video storyboard, and the latter type is known as video skim.
for learning the summarization task. As a matter of fact, the One advantage of video skims over static sets of frames is
relevant research area is a very active one as more than 35 the ability to include audio and motion elements that offer
different deep-learning-based video summarization algorithms a more natural story narration and potentially enhance the
have been proposed over the last five years; a subset of these expressiveness and the amount of information conveyed by the
methods represents the current state of the art on automatic video summary. Furthermore, it is often more entertaining and
video summarization. Motivated by this observation, we aim interesting for the viewer to watch a skim rather than a slide
to fill this gap in the literature by presenting the relevant show of frames [11]. On the other hand, storyboards are not
bibliography on deep-learning-based video summarization and restricted by timing or synchronization issues and, therefore,
also discussing other aspects that are associated with it, such they offer more flexibility in terms of data organization for
as the protocols used for evaluating video summarization. browsing and navigation purposes [12], [13].
This article provides a comprehensive survey of the existing A high-level representation of the typical deep-learning-
deep-learning-based video summarization approaches. Section based video summarization pipeline is depicted in Fig. 1. The
II begins by defining the problem of automatic video sum- first step of the analysis involves the representation of the
marization and presenting the most prominent types of video visual content of the video with the help of feature vectors.
summary. Then, it provides a high-level description of the Most commonly, such vectors are extracted at the frame-level,
analysis pipeline of deep-learning-based video summarization for all frames or for a subset of them selected via a frame-
algorithms, and introduces a taxonomy of the relevant liter- sampling strategy, e.g., processing 2 frames per second. In
ature according to the utilized data modalities, the adopted this way, the extracted feature vectors store information at a
training strategy, and the implemented learning approaches. very detailed level and capture the dynamics of the visual
Finally, it discusses aspects that relate to the generated sum- content that are of high significance when selecting the video
mary, such as the desired properties of a static (frame-based) parts that form the summary. Typically, in most deep-learning-
video summary and the length of a dynamic (fragment-based) based video summarization techniques the visual content of the
video summary. Section III builds on the introduced taxonomy video frames is represented by deep feature vectors extracted
to systematically review the relevant bibliography. A primary with the help of pre-trained neural networks. For this, a

UNDER REVIEW                                                                                                                                           3

Fig. 1. High-level representation of the analysis pipeline of the deep-learning-based video summarization methods for generating a video storyboard and a
video skim.

variety of Convolutional Neural Networks (CNNs) and Deep                      predefined length L. For experimentation and comparison
Convolutional Neural Networks (DCNNs) have been used in                       purposes, this is most often set as L = p · T where T is
the bibliography, that include GoogleNet (Inception V1) [14],                 the video duration and p is the ratio of the summary to video
Inception V3 [15], AlexNet [16], variations of ResNet [17]                    length; p = 0.15 is a typical value, in which case the summary
and variations of VGGnet [18]. Nevertheless, the GoogleNet                    should not exceed 15% of the original video’s duration. As
appears to be the most commonly used one thus far. The                        a side note, the production of a video skim (which is the
extracted features are then utilized by a deep summarizer                     ultimate goal of the majority of the proposed deep-learning-
network, which is trained by trying to minimize an objective                  based summarization algorithms) requires the segmentation of
function or a set of objective functions.                                     the video into consecutive and non-overlapping fragments that
   The output of the trained Deep Summarizer Network can                      exhibit visual and temporal coherence, thus offering a seamless
be either a set of selected video frames (key-frames) that form               presentation of a part of the story. Given this segmentation and
a static video storyboard, or a set of selected video fragments               the estimated frames’ importance scores by the trained Deep
(key-fragments) that are concatenated in chronological order                  Summarizer Network, video-segment-level importance scores
and form a short video skim. With respect to the generated                    are computed by averaging the importance scores of the frames
video storyboard, this should be similar with the sets of key-                that lie within each video segment. These segment-level scores
frames that would be selected by humans and must exhibit                      are then used to select the key-fragments given the summary
minimal visual redundancy. With regards to the produced                       length L, and most methods tackle this step by solving the
video skim, this typically should be equal or less than a                     Knapsack problem.

UNDER REVIEW                                                                                                                         4

   With regards to the utilized type of data, the current bibli-    proaches cast summarization as a structured prediction prob-
ography on deep-learning-based video summarization can be           lem and try to make estimates about the frames’ importance
divided between:                                                    by modeling their temporal dependency. As illustrated in Fig.
   • Unimodal approaches that utilize only the visual modality      3, during the training phase the Summarizer gets as input the
     of the videos for feature extraction, and learn summariza-     sequence of the video frames and the available ground-truth
     tion in a (weakly-)supervised or unsupervised manner.          data that indicate the importance of each frame according to
   • Multimodal methods that exploit the available textual          the users’ preferences. These data are then used to model the
     metadata and learn semantic/category-driven summariza-         dependencies among the video frames in time (illustrated with
     tion in a supervised way by increasing the relevance           solid arched lines) and estimate the frames’ importance. The
     between the semantics of the summary and the semantics         predicted importance scores are compared with the ground-
     of the associated metadata or video category.                  truth data and the outcome of this comparison guides the
   Concerning the adopted training strategy, the existing           training of the Summarizer. The first proposed approach to this
deep-learning-based video summarization algorithms can be           direction [21] uses Long Short-Term Memory (LSTM) units
coarsely categorized in the following categories:                   [22] to model variable-range temporal dependency among
   • Supervised approaches that rely on datasets with human-
                                                                    video frames. Frames’ importance is estimated using a multi-
     labeled ground-truth annotations (either in the form of        layer perceptron (MLP), and the diversity of the visual content
     video summaries, as in the case of the SumMe dataset           of the generated summary is increased based on the De-
     [19], or in the form of frame-level importance scores, as      terminantal Point Process (DPP) [23]. [24] proposes a two-
     in the case of the TVSum dataset [20]), based on which         layer LSTM architecture. The first layer extracts and encodes
     they try to discover the underlying criterion for video        data about the video structure. The second layer uses this
     frame/fragment selection and video summarization.              information to estimate fragment-level importance, and select
   • Unsupervised approaches that overcome the need for
                                                                    the key-fragments of the video. [25] extends the previous
     ground-truth data (whose production requires time-             method by introducing a component that is trained to identify
     demanding and laborious manual annotation procedures),         the shot-level temporal structure of the video. This knowledge
     based on learning mechanisms that require only an              is then utilized for estimating importance at the shot-level and
     adequately large collection of original videos for their       producing a key-shot-based video summary. [26] builds on
     training.                                                      [21] and introduces an attention mechanism to model the tem-
   • Weakly-supervised approaches that, similarly to unsu-
                                                                    poral evolution of the users’ interest. Following, it uses this in-
     pervised approaches, aim to alleviate the need for large       formation to estimate frames’ importance and select the video
     sets of hand-labeled data. Less-expensive weak labels are      key-frames to build a video storyboard. In the same direction,
     utilized with the understanding that they are imperfect        a few methods utilize sequence-to-sequence (a.k.a. seq2seq)
     compared to a full set of human annotations, but can           architectures in combination with attention mechanisms. [27]
     nonetheless be used to create strong predictive models.        proposes a seq2seq network for video summarization, that
                                                                    is composed of a soft, self-attention mechanism and a two-
   Building on the above described categorizations, a more
                                                                    layer fully connected network for regression of the frames’
detailed taxonomy of the relevant bibliography is depicted in
                                                                    importance scores. [28] formulates video summarization as
Fig. 2. The penultimate layer of this arboreal illustration shows
                                                                    a seq2seq learning problem and proposes an LSTM-based
the different learning approaches that have been adopted. The
                                                                    encoder-decoder network with an intermediate attention layer.
leafs of each node of this layer show the utilized techniques for
                                                                    The summarization model of this work is extended in [29]
implementing each learning approach, and contain references
                                                                    by integrating a semantic preserving embedding network that
to the most relevant works in the bibliography. This taxonomy
                                                                    evaluates the output of the decoder with respect to the preser-
will be the basis for presenting the relevant bibliography in the
                                                                    vation of the video’s semantics using a tailored semantic pre-
following section.
                                                                    serving loss, and replacing the previously used Mean Square
             III. D EEP   LEARNING APPROACHES
                                                                    Error (MSE) loss by the Huber loss to enhance its robustness
                                                                    to outliers. [30] proposes a hierarchical approach which uses a
  This section provides a systematic review of the bibli-           generator-discriminator architecture (similar to the one in [31])
ography. It starts by presenting the different classes of su-       as an internal mechanism to estimate the representativeness
pervised (Section III-A), unsupervised (Section III-B), and         of each shot and define a set of candidate key-frames. Then,
weakly-supervised (Section III-C) video summarization meth-         it employs a multi-head attention model to further assess
ods, which rely solely on the analysis of the visual content.       candidates’ importance and select the key-frames that form
Following, it reports on multimodal approaches (Section III-D)      the summary. [32] tackles video summarization as a semantic
that process also the available text-based metadata. Finally, it    segmentation task where the input video is seen as a 1D
provides some general remarks (Section III-E) that outline how      image (of size equal to the number of video frames) with K
the field has evolved, especially over the last five years.         channels that correspond to the K dimensions of the frames’
                                                                    representation vectors (either containing raw pixel values or
A. Supervised Video Summarization                                   being precomputed feature vectors). Then, it uses popular
  1. Learn frame importance by modeling the temporal                semantic segmentation models, such as Fully Convolutional
dependency among frames. Early deep-learning-based ap-              Netorks (FCN) [33] and an adaptation of DeepLab [34],

UNDER REVIEW                                                                             5

Fig. 2. A taxonomy of the existing deep-learning-based video summarization algorithms.

UNDER REVIEW                                                                                                                                     6

                                                                                 tional LSTMs, that models the spatiotemporal relationship
                                                                                 among parts of the video. In addition to the estimates about
                                                                                 the frames’ importance, the algorithm enhances the visual
                                                                                 diversity of the summary via next frame prediction and shot
                                                                                 detection mechanisms, based on the intuition that the first
                                                                                 frames of a shot generally have high likelihood of being part
                                                                                 of the summary. [38] extracts deep and shallow features from
                                                                                 the video content using a trainable 3D-CNN and builds a new
                                                                                 representation through a fusion strategy. Then, it uses this
                                                                                 representation in combination with convolutional LSTMs to
                                                                                 model the spatial and temporal structure of the video. Finally,
                                                                                 summarization is learned with the help of a new loss function
                                                                                 (called Sobolev loss) that aims to define a series of frame-
                                                                                 level importance scores that is close to the series of ground-
                                                                                 truth scores by minimizing the distance of the derivatives of
                                                                                 these sequential data, and to exploit the temporal structure of
                                                                                 the video. [39] extracts spatial and temporal information by
                                                                                 processing the raw frames and their optical flow maps with
                                                                                 CNNs, and learns how to estimate frames’ importance based
                                                                                 on human annotations and a label distribution learning process.
                                                                                 [40] combines CNNs and Gated Recurrent Units [41] (a type
Fig. 3. High-level representation of the analysis pipeline of supervised algo-   of Recurrent Neural Network (RNN)) to form spatiotemporal
rithms that perform summarization by learning the frames’ importance after       feature vectors, that are then used to estimate the level of
modeling their temporal or spatiotemporal dependency. For the latter class of
methods (i.e., modeling the spatiotemporal dependency among frames), object      activity and importance of each frame. [42] trains a neural
bounding boxes and object relations in time shown with dashed rectangles and     network for spatiotemporal data extraction and creates an inter-
lines, are used to illustrate the extension that models both the temporal and    frames motion curve. The latter is then used as input to a
spatial dependency among frames.
                                                                                 transition effects detection method that segments the video
                                                                                 into shots. Finally, a self-attention model exploits the human-
                                                                                 generated ground-truth data to learn how to estimate the intra-
and builds a network (called Fully Convolutional Sequence                        shot importance and select the key-frames/fragments of the
Network) for video summarization. The latter consists of a                       video to form a static/dynamic video summary.
stack of convolutions with increasing effective context size                        3. Learn summarization by fooling a discriminator
as we go deeper in the network, that enable the network to                       when trying to discriminate a machine-generated from a
effectively model long-range dependency among frames and                         human-generated summary. Following a completely differ-
learn frames’ importance. Finally, to address issues related to                  ent approach to minimizing the distance between the machine-
the limited capacity of LSTMs, some techniques use additional                    generated and the ground-truth summaries, a couple of meth-
memory. [35] stores information about the entire video in                        ods use Generative Adversarial Networks (GANs). As pre-
an external memory and predicts each shot’s importance by                        sented in Fig. 4, the Summarizer (which acts as the Generator
learning an embedding space that enables matching of each                        of the GAN) gets as input the sequence of the video frames and
shot with the entire memory information. [36] stacks multiple                    generates a summary by computing frame-level importance
LSTM and memory layers hierarchically to derive long-term                        scores. The generated summary (i.e., the predicted frame-level
temporal context, and uses this information to estimate the                      importance scores) along with an optimal video summary for
frames’ importance.                                                              this video (i.e., frame-level importance scores according to the
   2. Learn frame importance by modeling the spa-                                users’ preferences) are given as input to a trainable Discrimi-
tiotemporal structure of the video. Aiming to learn how                          nator which outputs a score that quantifies their similarity. The
to make better estimates about the importance of video                           training of the entire summarization architecture is performed
frames/fragments, some techniques pay attention to both the                      in an adversarial manner. The Summarizer tries to fool the
spatial and temporal structure of the video. Again, the Sum-                     Discriminator to not distinguish the predicted from the user-
marizer gets as input the sequence of the video frames and                       generated summary, and the Discriminator aims to learn how
the available ground-truth data that indicate the importance of                  to make this distinction. When the Discriminator’s confidence
each frame according to the users’ preferences. But, extending                   is very low (i.e., the classification error is approximately equal
the analysis pipeline of the previously described group of                       for both the machine- and the user-generated summary), then
methods, it then also models the spatiotemporal dependencies                     the Summarizer is able to generate a summary that is very
among frames (shown with dashed rectangles and lines in Fig.                     close to the users’ expectations. In this context, [43] com-
3). Once again, the predicted importance scores are compared                     bines LSTMs and Dilated Temporal Relational (DTR) units
with the ground-truth data and the outcome of this comparison                    to estimate temporal dependencies among frames at different
guides the training of the Summarizer. From this perspective,                    temporal windows, and learns summarization by trying to fool
[37] presents an encoder-decoder architecture with convolu-                      a trainable discriminator when distinguishing the machine-

UNDER REVIEW                                                                                                                               7

                                                                             distinguishing the summary-based reconstructed video from
                                                                             the original one, while the Discriminator aims to learn how to
                                                                             make the distinction. When this discrimination is not possible
                                                                             (i.e., the classification error is approximately equal for both
                                                                             the reconstructed and the original video), the Summarizer is
                                                                             considered to be able to build a video summary that is highly-
                                                                             representative of the overall video content. To this direction,
                                                                             the work of Mahasseni et al. [31] is the first that combines
                                                                             an LSTM-based key-frame selector with a Variational Auto-
                                                                             Encoder (VAE) and a trainable Discriminator, and learns video
                                                                             summarization through an adversarial learning process that
                                                                             aims to minimize the distance between the original video and
Fig. 4. High-level representation of the analysis pipeline of supervised
algorithms that learn summarization with the help of ground-truth data and
                                                                             the summary-based reconstructed version of it. [46] builds
adversarial learning.                                                        on the network architecture of [31] and suggests a stepwise,
                                                                             label-based approach for training the adversarial part of the
                                                                             network, that leads to improved summarization performance.
based summary from the ground-truth and a randomly-created                   [47] also relies on a VAE-GAN architecture but extends it with
one. [44] suggests an adversarial learning approach for (semi-               a chunk and stride network (CSNet) and a tailored difference
)supervised video summarization. The Generator/Summarizer                    attention mechanism for assessing the frames’ dependency
is an attention-based Pointer Network [45] that defines the                  at different temporal granularities when selecting the video
start and end point of each video fragment that is used to                   key-frames. [48] aims to maximize the mutual information
form the summary. The Discriminator is a 3D-CNN classifier                   between the summary and the video using a trainable couple
that judges whether a fragment is from a ground-truth or a                   of discriminators and a cycle-consistent adversarial learning
machine-generated summary. Instead of using the typical ad-                  objective. The frame selector (bi-directional LSTM) learns a
versarial loss, in this algorithm the output of the Discriminator            video representation that captures the long-range dependency
is used as a reward to train the Generator/Summarizer based                  among frames. The evaluator is composed of two GANs where
on reinforcement learning. Thus far, the use of GANs for                     the forward GAN learns to reconstruct the original video from
supervised video summarization is limited. Nevertheless, this                the video summary and the backward GAN learns to invert the
machine learning framework has been widely used for unsu-                    processing, and evaluates the consistency between the output
pervised video summarization, as discussed in the following                  of such cycle learning as a measure that quantifies information
section.                                                                     preservation between the original video and the generated sum-
                                                                             mary. Using this measure, the evaluator guides the frame selec-
                                                                             tor to identify the most informative frames and form the video
B. Unsupervised Video Summarization                                          summary. [49] introduces a variation of [46] that replaces
   1. Learn summarization by fooling a discriminator when                    the Variational Auto-Encoder with a deterministic Attention
trying to discriminate the original video (or set of key-                    Auto-Encoder for learning an attention-driven reconstruction
frames) from a summary-based reconstruction of it. Given                     of the original video, which subsequently improves the key-
the lack of any guidance (in the form of ground-truth data)                  fragment selection process. [50] presents a self-attention-based
for learning video summarization, most existing unsupervised                 conditional GAN. The Generator produces weighted frame
approaches rely on the rule that a representative summary                    features and predicts frame-level importance scores, while
ought to assist the viewer to infer the original video content.              the Discriminator tries to distinguish between the weighted
In this context, these techniques utilize GANs to learn how                  and the raw frame features. A conditional feature selector is
to create a video summary that allows a good reconstruction                  used to guide the GAN model to focus on more important
of the original video. The main concept of this training                     temporal regions of the whole video frames, while long-range
approach is depicted in Fig. 5. The Summarizer is usually                    temporal dependencies along the whole video sequence are
composed of a Key-frame Selector that estimates the frames’                  modeled by a multi-head self-attention mechanism. Finally,
importance and generates a summary, and a Generator that                     [51] learns video summarization from unpaired data based
reconstructs the video based on the generated summary. It                    on an adversarial process that relies on GANs and a Fully-
gets as input the sequence of the video frames and, through                  Convolutional Sequence Network (FCSN) encoder-decoder.
the aforementioned internal processing steps, reconstructs the               The model of [51] aims to learn a mapping function of a
original video based on the generated summary (which is                      raw video to a human-like summary, such that the distribution
represented by the predicted frame-level importance scores).                 of the generated summary is similar to the distribution of
The reconstructed video along with the original one are given                human-created summaries, while content diversity is forced by
as input to a trainable Discriminator which outputs a score that             applying a relevant constraint on the learned mapping function.
quantifies their similarity. Similarly to the supervised GAN-                   2. Learn summarization by targeting specific desired
based methods, the training of the entire summarization ar-                  properties for the summary. Aiming to deal with the un-
chitecture is performed in an adversarial manner. However, in                stable training [52] and the restricted evaluation criteria of
this case the Summarizer tries to fool the Discriminator when                GAN-based methods (that mainly focus on the summary’s

UNDER REVIEW 8

Fig. 6. High-level representation of the analysis pipeline of supervised
algorithms that learn summarization based on hand-crafted rewards and
Fig. 5. High-level representation of the analysis pipeline of unsupervised reinforcement learning.
algorithms that learn summarization by increasing the similarity between the
summary and the video.
aims to find important objects and their key-motions. Based
on this step, it represents the whole video by creating super-
ability to lead to a good reconstruction of the original video), segmented object motion clips. Each one of these clips is
some unsupervised approaches perform summarization by then given to the Summarizer, which uses an online motion
targeting specific properties of an optimal video summary. auto-encoder model (Stacked Sparse LSTM Auto-Encoder)
To this direction, they utilize the principles of reinforcement to memorize past states of object motions by continuously
learning in combination with hand-crafted reward functions updating a tailored recurrent auto-encoder network. The latter
that quantify the existence of desired characteristics in the is responsible for reconstructing object-level motion clips, and
generated summary. As presented in Fig. 6, the Summarizer the reconstruction loss between the input and the output of this
gets as input the sequence of the video frames and creates component is used to guide the training of the Summarizer.
a summary by predicting frame-level importance scores. The Based on this training process the Summarizer is able to
created (predicted) summary is then forwarded to an Evaluator generate summaries that show the representative objects in the
which is responsible to quantify the existence of specific video and the key-motions of each of these objects.
desired characteristics with the help of hand-crafted reward
functions. The computed score(s) are then combined to form
an overall reward value, that is finally used to guide the C. Weakly-supervised Video Summarization
training of the Summarizer. In this context, [52] formulates Weakly-supervised video summarization methods try to
video summarization as a sequential decision-making process mitigate the need for extensive human-generated ground-truth
and trains a Summarizer to produce diverse and representative data, similarly to unsupervised learning methods. Instead of
video summaries using a diversity-representativeness reward. not using any ground-truth data, they use less-expensive weak
The diversity reward measures the dissimilarity among the labels (such as video-level metadata for video categorization
selected key-frames and the representativeness reward com- and category-driven summarization, or ground-truth annota-
putes the distance (expressing the visual resemblance) of tions for a small subset of video frames for learning sum-
the selected key-frames from the remaining frames of the marization through sparse reinforcement learning and tailored
video. [53] utilizes Temporal Segment Networks (proposed reward functions) under the main hypothesis that these labels
in [54] for action recognition in videos) to extract spatial are imperfect compared to a full set of human annotations, but
and temporal information about the video frames, and trains can nonetheless allow the training of effective summarization
the Summarizer through a reward function that assesses the models. We avoid the illustration of a typical analysis pipeline
preservation of the video’s main spatiotemporal patterns in the for this class of methods, as there is a limited overlap in
produced summary. [55] presents a mechanism for both video the way that these methods conceptualize the learning of the
summarization and reconstruction. Video reconstruction aims summarization task. The first work that adopts an intermediate
to estimate the extent to which the summary allows the viewer way between fully-supervised and fully-unsupervised learning
to infer the original video (similarly to some of the above- for video summarization is [57]. It uses video-level meta-
presented GAN-based methods), and video summarization is data (e.g., the video title “A man is cooking.”) to define a
learned based on the reconstructor’s feedback and the output of categorization of videos. Then, it leverages multiple videos
trained models that assess the representativeness and diversity of a category and extracts 3D-CNN features to automatically
of the visual content of the generated summary. learn a parametric model for categorizing new (unseen during
3. Build object-oriented summaries by modeling the key- training) videos. Finally, it adopts the learned model to select
motion of important visual objects. Building on a different the video segments that maximize the relevance between
basis, [56] focuses on the preservation in the summary of the summary and the video category. The authors of [57]
the underlying fine-grained semantic and motion information investigate different ways to tackle issues related to the limited
of the video. For this, it performs a preprocessing step that size of available datasets, that include cross-dataset training,

UNDER REVIEW                                                                                                                       9

the use of web-crawled videos, as well as data augmentation         video [69], [70], [71], and extend this output by extracting one
methods for obtaining sufficient training data. [58] proposes a     representative key-frame [72]. A multimodal deep-learning-
deep learning framework for summarizing first-person videos;        based algorithm for summarizing videos of soccer games was
however, we report on this method here, as it is also evaluated     presented in [73], while another multimodal approach for key-
on a dataset used to assess generic video summarization             frame extraction from first-person videos that exploits sensor
methods. Driven by the observation that the collection of a         data was described in [74]. Nevertheless, all these works are
sufficiently large amount of fully-annotated first-person video     outside the scope of this survey, which focuses on deep-
data with ground-truth annotations is a difficult task, the         learning-based methods for generic video summarization; i.e.,
algorithm in [58] exploits the principles of transfer learning      methods that learn summarization with the help of deep neural
and uses annotated third-person videos (which can be found          network architectures and/or the visual content is represented
more easily) to learn how to summarize first-person videos.         by deep features.
The algorithm performs cross-domain feature embedding and              The majority of multimodal approaches that are within the
transfer learning for domain adaptation (across third- and first-   focus of this study tackle video summarization by utilizing
person videos) in a semi-supervised manner. In particular,          also the textual video metadata, such as the video title and
training is performed based on a set of third-person videos         description. As depicted in Fig. 7, during training the Sum-
with fully annotated highlight scores and a set of first-person     marizer gets as input: i) the sequence of the video frames that
videos where only a small portion of them comes with                needs to be summarized, ii) the relevant ground-truth data
ground-truth scores. [59] suggests a weakly-supervised setting      (i.e., frame-level importance scores according to the users),
of learning summarization models from a large number of             and iii) the associated video metadata. Following, it estimates
web videos. Its architecture combines a Variational Auto-           the frames’ importance and generates (predicts) a summary.
Encoder that learns the latent semantics from web videos,           Then, the generated summary is compared with the ground-
and a sequence encoder-decoder with attention mechanism that        truth summary and the video metadata. The former comparison
performs summarization. The decoding part of the VAE aims           produces a similarity score. The latter comparison involves the
to reconstruct the input videos using samples from the learned      semantic analysis of the summary and the video metadata.
latent semantics; while, the most important video fragments         Given the output of this analysis, most algorithms try to
are identified through the soft attention mechanism of the          minimize the distance of the generated representations in a
encoder-decoder network, where the attention vectors of raw         learned latent space, or the classification loss when the aim is
videos are obtained by integrating the learned latent semantics     to define a summary that maintains the core characteristics of
from the collected web videos. The overall architecture is          the video category. The combined outcome of this assessment
trained by a weakly-supervised semantic matching loss to            is finally used to guide the training of the Summarizer. In
learn the topic-associated summaries. Finally, [60] uses the        this context, [75], [76] learn category-driven summarization
principles of reinforcement learning to learn summarization         by rewarding the preservation of the core parts found in
based on a limited set of human annotations and a set of hand-      video summaries from the same category (e.g., the main
crafted rewards. The latter relate to the similarity between        parts of a wedding ceremony when summarizing a wedding
the machine- and the human-selected fragments, as well as           video). Similarly, [77] trains action classifiers with video-
to specific characteristics of the created summary (e.g., its       level labels for action-driven video fragmentation and labeling;
representativeness). More specifically, this method applies a       then, it extracts a fixed number of key-frames and applies
hierarchical key-fragment selection process that is divided into    reinforcement learning to select the ones with the highest
sub-tasks. Each task is learned through sparse reinforcement        categorization accuracy, thus performing category-driven sum-
learning (thus avoiding the need for exhaustive annotations         marization. [78], [79] define a video summary by maximizing
about the entire set of frames, and using annotations only for      its relevance with the available video metadata, after projecting
a subset of frames) and the final summary is formed based on        both the visual and the textual information in a common
rewards about its diversity and representativeness.                 latent space. Finally, [80] applies a visual-to-text mapping and
                                                                    a semantic-based video fragment selection according to the
                                                                    relevance between the automatically-generated and the original
D. Multimodal Approaches                                            video description, with the help of semantic attended networks.
    A number of works investigated the potential of exploiting      However, most of these methods examine only the visual cues
additional modalities (besides the visual stream) for learning      without considering the sequential structure of the video and
summarization, such as the audio stream, the video captions         the temporal dependency among frames.
or ASR transcripts, any available textual metadata (video
title and/or abstract) or other contextual data (e.g., viewers’
comments). Several of these multimodal approaches were              E. General remarks on deep learning approaches
proposed before the so-called “deep learning era”, targeting           Based on the review of the relevant bibliography, we saw
either generic or domain/task-specific video summarization.         that early deep-learning-based approaches for video summa-
Some indicative and recent examples can be found in [61],           rization utilize combinations of CNNs and RNNs. The former
[62], [63], [64], [65], [66], [67], [68]. Addressing the video      are used as pre-trained components (e.g., CNNs trained on
summarization task from a slightly different perspective, other     ImageNet for visual concept detection) to represent the visual
multimodal algorithms generate a text-based summary of the          content, and the latter (mostly LSTMs) are used to model the

UNDER REVIEW                                                                                                                           10

                                                                           main intuition behind the use of GANs is that the produced
                                                                           summary should allow the viewer to infer the original video,
                                                                           and thus the unsupervised GAN-based methods are trying to
                                                                           build a summary that enables a good reconstruction of the
                                                                           original video. In most cases the generative part is composed
                                                                           of an LSTM that estimates the frames’ importance according
                                                                           to their temporal dependency, thus indicating the most appro-
                                                                           priate video parts for the summary. Then, the reconstruction
                                                                           of the video based on the predicted summary is performed
                                                                           with the help of Auto-Encoders ([31], [46], [48]) that in some
                                                                           cases are combined with tailored attention mechanisms ([47],
                                                                           [49]). Another, but less popular, approach for unsupervised
                                                                           video summarization is the use of reinforcement learning in
                                                                           combination with hand-crafted rewards about specific proper-
                                                                           ties of the generated summary. These rewards usually aim to
                                                                           increase the representativeness and diversity of the summary
                                                                           [52], retain the statiotemporal patterns of the video [53], or
                                                                           secure a good summary-based reconstruction of the video [55].
                                                                           Finally, one unsupervised method learns how to build object-
                                                                           oriented summaries by modeling the key-motion of important
                                                                           visual objects using a stacked sparse LSTM Auto-Encoder
                                                                           [56].
Fig. 7. High-level representation of the analysis pipeline of supervised      Last but not least, a few weakly-supervised methods have
algorithms that perform semantic/category-driven summarization.            also been proposed. These methods learn video summarization
                                                                           by exploiting the semantics of the video metadata [57] or the
                                                                           summary structure of semantically-similar web videos [59],
temporal dependency among video frames. The majority of                    exploiting annotations from a similar domain and transferring
these approaches are supervised and try to learn how to make               the gained knowledge via cross-domain feature embedding and
estimates about the importance of video frames/fragments                   transfer learning techniques [58], or using weakly/sparsely-
based on human-generated ground-truth annotations. Architec-               labeled data under a reinforcement learning framework [60].
tures of RNNs can be used in the typical or in a hierarchical                 With respect to the potential of deep-learning-based video
form ([21], [24], [25]) to also model the temporal structure and           summarization algorithms, we argue that, despite the fact
utilize this knowledge when selecting the video fragments of               that currently the major research direction is towards the
the summary. In some cases such architectures are combined                 development of supervised algorithms, the exploration of
with attention mechanisms to model the evolution of the users’             the learning capability of unsupervised methods is highly
interest ([26], [27], [28]). Alternatively, they are extended by           recommended. The reasoning behind this claim is based on
memory networks to increase the memorization capacity of                   the fact that: i) the generation of ground-truth training data
the network and capture long-range temporal dependencies                   (summaries) can be a really expensive and laborious process;
among parts of the video ([35], [36]). Going one step further,             ii) video summarization is a subjective task, thus multiple
a group of techniques try to learn importance by paying                    different summaries can be proposed for a video from different
attention to both the spatial and temporal structure of the                human annotators; and iii) these ground-truth summaries can
video, using convolutional LSTMs [37], [38], optical flow                  vary a lot, thus making it hard to train a method with the
maps [39], combinations of CNNs and RNNs [40], or motion                   typical supervised training approaches. On the other hand,
extraction mechanisms [42]. Following a different approach,                unsupervised video summarization algorithms overcome the
a couple of supervised methods learn how to generate video                 need for ground-truth data as they can be trained using
summaries that are aligned with the human preferences with                 only an adequately large collection of original, full-length
the help of GANs ([43], [44]). Finally, multimodal algorithms              videos. Moreover, unsupervised learning allows to easily train
extract the high-level semantics of the visual content using               different models of the same deep network architecture using
pre-trained CNNs/DCNNs and learn summarization in a su-                    different types of video content (TV shows, news), thus facil-
pervised manner by maximizing the semantic similarity among                itating the domain-adaptation of video summarization. Given
the visual summary and the contextual video metadata or the                the above, we believe that unsupervised video summarization
video category. However, the latter methods focus mostly on                has great advantages, and thus its potential should be further
the visual content and disregard the sequential nature of the              investigated.
video, which is essential when summarizing the presented
story.                                                                             IV. E VALUATING VIDEO SUMMARIZATION
   To train a video summarization network in a fully unsu-                 A. Datasets
pervised manner, the use of GANs seems to be the central                     Four datasets prevail in the video summarization bibliog-
direction thus far ([31], [46], [47], [48], [49], [50], [51]). The         raphy: SumMe [19], TVSum [20], OVP [81] and Youtube

UNDER REVIEW 11

[81]. SumMe consists of 25 videos of various genres such utilized by the deep-learning-based approaches for video sum-
as holidays, events and sports, with duration ranging from marization. From this table it is clear that the SumMe and
1 to 6 minutes. The annotations are obtained from 15 to TVSum datasets are, by far, the most commonly used ones.
18 users per video, in the form of multiple sets of key- OVP and Youtube are also utilized in a few works, but mainly
fragments. Each summary has a length between 5% and 15% for data augmentation purposes.
of the initial video duration. TVSum contains 50 videos from
10 categories of the TRECVid MED dataset (5 videos per
category), including news, how-to’s, user-generated-content B. Evaluation protocols and measures
and documentaries. Videos’ duration ranges from 2 to 11 1) Evaluating video storyboards: Early video summariza-
minutes, and each video has been annotated by 20 users in tion techniques created a static summary of the video content
the form of shot- and frame-level importance scores (ranging with the help of representative key-frames. First attempts
from 1 to 5). OVP and Youtube both contain 50 videos, whose towards the evaluation of the created key-frame-based sum-
annotations are sets of key-frames, produced by 5 users. The maries were based on user studies that made use of human
video duration ranges from 1 to 4 minutes for OVP, and from judges for evaluating the resulting quality of the summaries.
1 to 10 minutes for Youtube. Both datasets are comprised Judgment was based on specific criteria, such as the relevance
of videos with diverse video content, such as documentaries, of each key-frame with the video content, redundant or missing
educational, ephemeral, historical, and lecture videos (OVP information [86], or the informativeness and enjoyability of the
dataset) and cartoons, news, sports, commercials, TV-shows summary [87]. Although this can be a useful way to evaluate
and home videos (Youtube dataset). results, the procedure of evaluating each resulting summary by
Some less-commonly used datasets for video summariza- users can be time-consuming and the evaluation results can not
tion are CoSum [82], MED-summaries [83], Video Titles in be easily reproduced or used in future comparisons. To over-
the Wild (VTW) [84], League of Legends (LoL) [85] and come the deficiencies of user studies, more recent works evalu-
FVPSum [58]. CoSum has been created to evaluate video co- ate their key-frame-based summaries using objective measures
summarization. It consists of 51 videos that were downloaded and ground-truth summaries. In this context, [88] estimates the
from YouTube using 10 query terms that relate to the video quality of the produced summaries using the Fidelity measure
content of the SumMe dataset. Each video is approximately 4 [89] and the Shot Reconstruction Degree criterion [90]. A
minutes long and it is annotated by 3 different users, who different approach, termed “Comparison of User Summaries”,
have selected sets of key-fragments. The MED-summaries is proposed in [81] and evaluates the generated summary
dataset contains 160 annotated videos from the TRECVID according to its overlap with predefined key-frame-based user
2011 MED dataset. 60 videos form the validation set (from summaries. Comparison is performed at a key-frame-basis
15 event categories) and the remaining 100 videos form the and the quality of the generated summary is quantified by
test set (from 10 event categories), with most of them being computing the accuracy and error rates based on the number
1 to 5 minutes long. The annotations come as one set of of matched and non-matched key-frames respectively. This
importance scores, averaged over 1 to 4 annotators. As far approach was also used in [91], [92], [93], [94]. A similar
as the VTW dataset is concerned, it includes 18100 open methodology was followed in [95]. Instead of Accuracy and
domain videos, with 2000 of them being annotated in the Error rates, in [95] the evaluation relies on the well-known
form of sub-shot level highlight scores. The videos are user- Precision, Recall and F-Score measures. This protocol was
generated, untrimmed videos that contain a highlight event and also used in the supervised video summarization approach
have an average duration of 1.5 minutes. Regarding LoL, it has of [96], while a variation of it was used in [97], [98], [99].
218 long videos (30 to 50 minutes), displaying game matches In the latter case, additionally to their visual similarity, two
from the North American League of Legends Championship frames are considered a match only if they are temporally no
Series (NALCS). The annotations derive from a Youtube more than a specific number of frames apart.
channel that provides community-generated highlights (videos 2) Evaluating video skims: Most recent algorithms tackle
with duration 5 to 7 minutes). Therefore, one set of key- video summarization by creating a dynamic video summary
fragments is available for each video. Finally, FPVSum is (video skim). For this, they select the most representative video
a first-person video summarization dataset. It contains 98 fragments and join them in a sequence to form a shorter video.
videos from 14 categories of GoPro viewer-friendly videos The evaluation methodologies of these works assess the quality
with varying lengths (over 7 hours in total). For each category, of video skims according to their alignment with human pref-
about 35% of the video sequences are annotated with ground- erences, based mostly on objective measures. A first attempt
truth scores by at least 10 users, while the remaining are was made in [19], where an evaluation approach along with
viewed as the unlabeled examples. the SumMe dataset for video summarization were introduced.
It is worth mentioning that in this work we focus only According to this approach, the videos are first segmented
on datasets that are fit for evaluating video summarization into consecutive and non-overlapping fragments in order to
methods, namely datasets that contain ground truth annotations enable matching between key-fragment-based summaries (i.e.,
regarding the summary or the frame/fragment-level importance to compare the user-generated with the automatically-created
of each video. Other datasets might be also used by some summary). Then, based on the scores computed by a video
works for network pre-training purposes, but they do not summarization algorithm for the fragments of a given video, an
concern the present work. Table II summarizes the datasets optimal subset of them (key-fragments) is selected and forms

UNDER REVIEW                                                                                                                                             12

      Dataset         # of videos   duration (min)                 content                   type of annotations             # of annotators per video
    SumMe [19]            25             1-6              holidays, events, sports      multiple sets of key-fragments               15 - 18
                                                      news, how-to’s, user-generated,
    TVSum [20]            50           2 - 10                  documentaries            multiple fragment-level scores                  20
                                                      (10 categories - 5 videos each)
                                                        documentary, educational,
      OVP [81]            50            1-4                                              multiple sets of key-frames                    5
                                                       ephemeral, historical, lecture
                                                        cartoons, sports, tv-shows,
    Youtube [81]          50           1 - 10                                            multiple sets of key-frames                    5
                                                         commercial, home videos
                                                          holidays, events, sports
     CoSum [82]           51            ∼4                                              multiple sets of key-fragments                  3
                                                               (10 categories)
     MED [83]            160            1-5           15 categories of various genres       one set of imp. scores                     1-4
                                                        user-generated videos that
     VTW [84]            2000         1.5 (avg)                                         sub-shot level highlight scores                  -
                                                         contain a highlight event
                                                          matches from a League
      LoL [85]           218           30 - 50                                            one set of key-fragments                      1
                                                          of Legends tournament
                                                            first-person videos
    FPVSum [58]           98          4.3 (avg)                                          multiple frame-level scores                    10
                                                               (14 categories)
                                                                TABLE I
                                    D ATASETS FOR VIDEO SUMMARIZATION AND THEIR MAIN CHARACTERISTICS .

                   Method                     SumMe      TVSum      OVP     Youtube     CoSum      MED      VTW        LoL     FPVSum
                   VS-DSF [78]                  ✓
                   vsLSTM, dppLSTM [21]         ✓           ✓        ✓         ✓
                   DeSumNet [57]                            ✓                             ✓
                   H-RNN [24]                     ✓         ✓                                       ✓         ✓
                   SUM-GAN [31]                   ✓         ✓        ✓         ✓
                   SUM-FCN [32]                   ✓         ✓        ✓         ✓
                   FPVSF [58]                     ✓         ✓                                                                      ✓
                   VESD [59]                                ✓                             ✓
                   VASNet [27]                    ✓         ✓        ✓         ✓
                   HSA-RNN [25]                   ✓         ✓                             ✓                   ✓
                   DTR-GAN [43]                   ✓         ✓
                   MAVS [35]                      ✓         ✓
                   DR-DSN [52]                    ✓         ✓        ✓         ✓
                   SA-SUM [80]                    ✓         ✓                  ✓
                   DQSN [76]                                ✓                             ✓
                   Online Motion-AE [56]          ✓         ✓                  ✓
                   CSNet [47]                     ✓         ✓        ✓         ✓
                   A-AVS, M-AVS [28]              ✓         ✓        ✓         ✓
                   Ptr-Net [44]                   ✓         ✓                  ✓                                         ✓
                   Cycle-SUM [48]                 ✓         ✓
                   DSSE [79]                                ✓
                   EDSN [53]                      ✓         ✓
                   SMLD [39]                      ✓         ✓
                   UnpairedVSN [51]               ✓         ✓        ✓         ✓
                   ACGAN [50]                     ✓         ✓        ✓         ✓
                   SUM-GAN-sl [46]                ✓         ✓
                   SMN [36]                       ✓         ✓        ✓         ✓
                   PCDL [55]                      ✓         ✓        ✓         ✓
                   ActionRanking [40]             ✓         ✓        ✓         ✓
                   vsLSTM+Att [26]                ✓         ✓        ✓         ✓
                   dppLSTM+Att [26]               ✓         ✓        ✓         ✓
                   WS-HRL [60]                    ✓         ✓        ✓         ✓
                   SF-CVS [42]                    ✓         ✓                  ✓
                   CRSum [38]                     ✓         ✓                                                 ✓
                   H-MAN [30]                     ✓         ✓        ✓         ✓
                   SUM-GAN-AAE [49]               ✓         ✓
                   DASP [29]                      ✓         ✓                  ✓
                                                                TABLE II
               D ATASETS USED BY EACH DEEP - LEARNING - BASED METHOD FOR EVALUATING VIDEO SUMMARIZATION PERFORMANCE .

You can also read