Video Summarization Using Deep Neural Networks: A Survey - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
UNDER REVIEW 1
Video Summarization Using Deep Neural
Networks: A Survey
Evlampios Apostolidis , Eleni Adamantidou , Alexandros I. Metsai ,
Vasileios Mezaris , Senior Member, IEEE, and Ioannis Patras , Senior Member, IEEE
Abstract—Video summarization technologies aim to create a organizations that host large volumes of video content. So,
arXiv:2101.06072v1 [cs.CV] 15 Jan 2021
concise and complete synopsis by selecting the most informative how is it possible for someone to efficiently navigate within
parts of the video content. Several approaches have been de- endless collections of videos, and find the video content that
veloped over the last couple of decades and the current state
of the art is represented by methods that rely on modern s/he is looking for? The answer to this question comes not only
deep neural network architectures. This work focuses on the from video retrieval technologies but also from technologies
recent advances in the area and provides a comprehensive for automatic video summarization. The latter allow generating
survey of the existing deep-learning-based methods for generic a concise synopsis that conveys the important parts of the full-
video summarization. After presenting the motivation behind length video. Given the plethora of video content on the Web,
the development of technologies for video summarization, we
formulate the video summarization task and discuss the main effective video summarization facilitates viewers’ browsing
characteristics of a typical deep-learning-based analysis pipeline. of and navigation in large video collections, thus increasing
Then, we suggest a taxonomy of the existing algorithms and viewers’ engagement and content consumption.
provide a systematic review of the relevant literature that shows The application domain of automatic video summarization
the evolution of the deep-learning-based video summarization is wide and includes (but is not limited to) the use of such
technologies and leads to suggestions for future developments.
We then report on protocols for the objective evaluation of video technologies by media organizations (after integrating such
summarization algorithms and we compare the performance of techniques into their content management systems), to allow
several deep-learning-based approaches. Based on the outcomes effective indexing, browsing, retrieval and promotion of their
of these comparisons, as well as some documented considerations media assets; and video sharing platforms, to improve viewing
about the suitability of evaluation protocols, we indicate potential experience, enhance viewers’ engagement and increase content
future research directions.
consumption. In addition, video summarization that is tailored
Index Terms—Video summarization, Deep neural networks, to the requirements of particular content presentation scenarios
Supervised learning, Unsupervised learning, Summarization can be used for e.g., generating trailers or teasers of movies
datasets, Evaluation protocols.
and episodes of a TV series; presenting the highlights of
an event (e.g., a sports game, a music band performance,
I. I NTRODUCTION or a public debate); and creating a video synopsis with the
In July 2015, YouTube revealed that it receives over 400 main activities that took place over e.g., the last 24hrs of
hours of video content every single minute, which translates recordings of a surveillance camera, for time-efficient progress
to 65.7 years’ worth of content uploaded every day1 . Since monitoring or security purposes.
then, we are experiencing an even stronger engagement of A number of surveys on video summarization have already
consumers with both online video platforms and devices appeared in the literature. In one of the first works, Barbieri et
(e.g., smart-phones, wearables etc.) that carry powerful video al. [1] classify the relevant bibliography according to several
recording sensors and allow instant uploading of the cap- aspects of the summarization process, namely the targeted
tured video on the Web. According to newer estimates2 , scenario, the type of visual content, and the characteristics of
YouTube now receives 500 hours of video per minute; and the summarization approach. In another early study, Li et al.
YouTube is just one of the many video hosting platforms [2] divide the existing summarization approaches into utility-
(e.g., DailyMotion, Vimeo), social networks (e.g., Facebook, based methods that use attention models to identify the salient
Twitter, Instagram), and online repositories of media and news objects and scenes, and structure-based methods that build on
the video shots and scenes. Truong et al. [3] discuss a variety
E. Apostolidis is with the Information Technologies Institute / Centre for of attributes that affect the outcome of a summarization pro-
Research and Technology Hellas, GR-57001, Thermi, Thessaloniki, Greece,
and with the Queen Mary University of London, Mile End Campus, E14NS,
cess, such as the video domain, the granularity of the employed
London, UK, e-mail: apostolid@iti.gr. video fragments, the utilized summarization methodology, and
E. Adamantidou, A. I. Metsai and V. Mezaris are with the Informa- the targeted type of summary. Money et al. [4] divide the
tion Technologies Institute / Centre for Research and Technology Hellas,
GR-57001, Thermi, Thessaloniki, Greece, e-mail: {adamelen, alexmetsai,
bibliography into methods that rely on the analysis of the
bmezaris}@iti.gr. video stream, methods that process contextual video metadata,
I. Patras is with the Queen Mary University of London, Mile End Campus, and hybrid approaches that rely on both types of the afore-
E14NS, London, UK, e-mail: i.patras@qmul.ac.uk.
1 https://www.tubefilter.com/2015/07/26/youtube-400-hours-content-every- mentioned data. Jiang et al. [5] discuss a few characteristic
minute/ video summarization approaches, that include the extraction
2 https://www.statista.com/ of low-level visual features for assessing frame similarity orUNDER REVIEW 2
performing clustering-based key-frame selection; the detection categorization is made according to the use or not of human-
of the main events of the video using motion descriptors; and generated ground-truth data for learning, and a secondary
the identification of the video structure using Eigen-features. categorization is made based on the adopted learning objective
Hu et al. [6] classify the summarization methods into those that or the utilized data modalities by each different class of
target minimum visual redundancy, those that rely on object methods. For each one of the defined classes, we illustrate
or event detection, and others that are based on multimodal the main processing pipeline and report on the specifications
integration. Ajmal et al. [7] similarly classify the relevant of the associated summarization algorithms. After presenting
literature in clustering-based methods, approaches that rely on the relevant bibliography, we provide some general remarks
detecting the main events of the story, etc. Nevertheless, all that reflect how the field has evolved, especially over the last
the aforementioned works report on early approaches to video five years, highlighting the pros and cons of each class of
summarization; they do not present how the summarization methods. Section IV continues with an in-depth discussion
landscape has evolved over the last years and especially after on the utilized datasets and the different evaluation protocols
the introduction of deep learning algorithms. of the literature. Following, Section V discusses the findings
The more recent study of Molino et al. [8] focuses on of extensive performance comparisons that are based on the
egocentric video summarization and discusses the specifica- results reported in the relevant papers, indicates the most
tions and the challenges of this task. In another recent work, competitive methods in the fields of (weakly-)supervised and
Basavarajaiah et al. [9] provide a classification of various unsupervised video summarization, and examines whether
summarization approaches, including some recently-proposed there is a performance gap between these two main types of
deep-learning-based methods; however, it mainly focuses on approaches. Based on the surveyed bibliography, in Section
summarization algorithms that are directly applicable on the VI we propose potential future directions to further advance
compressed domain. Finally, another recent survey of Vivekraj the current state of the art in video summarization. Finally,
et al. [10] presents the relevant bibliography based on a two- Section VII concludes this work by briefly outlining the core
way categorization, that relates to the utilized data modalities findings of the conducted study.
during the analysis and the incorporation of human aspects.
With respect to the latter, it further splits the relevant literature
II. P ROBLEM STATEMENT
into methods that create summaries by modeling the human
understanding and preferences (e.g., using attention models, Video summarization aims to generate a short synopsis that
the semantics of the visual content, or ground-truth anno- summarizes the video content by selecting its most informa-
tations and machine-learning algorithms), and conventional tive and important parts. The produced summary is usually
approaches that rely on the statistical processing of low-level composed of a set of representative video frames (a.k.a. video
features of the video. Nevertheless, none of the above surveys key-frames), or video fragments (a.k.a. video key-fragments)
presents in a comprehensive manner the current developments that have been stitched in chronological order to form a shorter
towards generic video summarization, that are tightly related to video. The former type of a video summary is known as
the growing use of advanced deep neural network architectures video storyboard, and the latter type is known as video skim.
for learning the summarization task. As a matter of fact, the One advantage of video skims over static sets of frames is
relevant research area is a very active one as more than 35 the ability to include audio and motion elements that offer
different deep-learning-based video summarization algorithms a more natural story narration and potentially enhance the
have been proposed over the last five years; a subset of these expressiveness and the amount of information conveyed by the
methods represents the current state of the art on automatic video summary. Furthermore, it is often more entertaining and
video summarization. Motivated by this observation, we aim interesting for the viewer to watch a skim rather than a slide
to fill this gap in the literature by presenting the relevant show of frames [11]. On the other hand, storyboards are not
bibliography on deep-learning-based video summarization and restricted by timing or synchronization issues and, therefore,
also discussing other aspects that are associated with it, such they offer more flexibility in terms of data organization for
as the protocols used for evaluating video summarization. browsing and navigation purposes [12], [13].
This article provides a comprehensive survey of the existing A high-level representation of the typical deep-learning-
deep-learning-based video summarization approaches. Section based video summarization pipeline is depicted in Fig. 1. The
II begins by defining the problem of automatic video sum- first step of the analysis involves the representation of the
marization and presenting the most prominent types of video visual content of the video with the help of feature vectors.
summary. Then, it provides a high-level description of the Most commonly, such vectors are extracted at the frame-level,
analysis pipeline of deep-learning-based video summarization for all frames or for a subset of them selected via a frame-
algorithms, and introduces a taxonomy of the relevant liter- sampling strategy, e.g., processing 2 frames per second. In
ature according to the utilized data modalities, the adopted this way, the extracted feature vectors store information at a
training strategy, and the implemented learning approaches. very detailed level and capture the dynamics of the visual
Finally, it discusses aspects that relate to the generated sum- content that are of high significance when selecting the video
mary, such as the desired properties of a static (frame-based) parts that form the summary. Typically, in most deep-learning-
video summary and the length of a dynamic (fragment-based) based video summarization techniques the visual content of the
video summary. Section III builds on the introduced taxonomy video frames is represented by deep feature vectors extracted
to systematically review the relevant bibliography. A primary with the help of pre-trained neural networks. For this, aUNDER REVIEW 3 Fig. 1. High-level representation of the analysis pipeline of the deep-learning-based video summarization methods for generating a video storyboard and a video skim. variety of Convolutional Neural Networks (CNNs) and Deep predefined length L. For experimentation and comparison Convolutional Neural Networks (DCNNs) have been used in purposes, this is most often set as L = p · T where T is the bibliography, that include GoogleNet (Inception V1) [14], the video duration and p is the ratio of the summary to video Inception V3 [15], AlexNet [16], variations of ResNet [17] length; p = 0.15 is a typical value, in which case the summary and variations of VGGnet [18]. Nevertheless, the GoogleNet should not exceed 15% of the original video’s duration. As appears to be the most commonly used one thus far. The a side note, the production of a video skim (which is the extracted features are then utilized by a deep summarizer ultimate goal of the majority of the proposed deep-learning- network, which is trained by trying to minimize an objective based summarization algorithms) requires the segmentation of function or a set of objective functions. the video into consecutive and non-overlapping fragments that The output of the trained Deep Summarizer Network can exhibit visual and temporal coherence, thus offering a seamless be either a set of selected video frames (key-frames) that form presentation of a part of the story. Given this segmentation and a static video storyboard, or a set of selected video fragments the estimated frames’ importance scores by the trained Deep (key-fragments) that are concatenated in chronological order Summarizer Network, video-segment-level importance scores and form a short video skim. With respect to the generated are computed by averaging the importance scores of the frames video storyboard, this should be similar with the sets of key- that lie within each video segment. These segment-level scores frames that would be selected by humans and must exhibit are then used to select the key-fragments given the summary minimal visual redundancy. With regards to the produced length L, and most methods tackle this step by solving the video skim, this typically should be equal or less than a Knapsack problem.
UNDER REVIEW 4
With regards to the utilized type of data, the current bibli- proaches cast summarization as a structured prediction prob-
ography on deep-learning-based video summarization can be lem and try to make estimates about the frames’ importance
divided between: by modeling their temporal dependency. As illustrated in Fig.
• Unimodal approaches that utilize only the visual modality 3, during the training phase the Summarizer gets as input the
of the videos for feature extraction, and learn summariza- sequence of the video frames and the available ground-truth
tion in a (weakly-)supervised or unsupervised manner. data that indicate the importance of each frame according to
• Multimodal methods that exploit the available textual the users’ preferences. These data are then used to model the
metadata and learn semantic/category-driven summariza- dependencies among the video frames in time (illustrated with
tion in a supervised way by increasing the relevance solid arched lines) and estimate the frames’ importance. The
between the semantics of the summary and the semantics predicted importance scores are compared with the ground-
of the associated metadata or video category. truth data and the outcome of this comparison guides the
Concerning the adopted training strategy, the existing training of the Summarizer. The first proposed approach to this
deep-learning-based video summarization algorithms can be direction [21] uses Long Short-Term Memory (LSTM) units
coarsely categorized in the following categories: [22] to model variable-range temporal dependency among
• Supervised approaches that rely on datasets with human-
video frames. Frames’ importance is estimated using a multi-
labeled ground-truth annotations (either in the form of layer perceptron (MLP), and the diversity of the visual content
video summaries, as in the case of the SumMe dataset of the generated summary is increased based on the De-
[19], or in the form of frame-level importance scores, as terminantal Point Process (DPP) [23]. [24] proposes a two-
in the case of the TVSum dataset [20]), based on which layer LSTM architecture. The first layer extracts and encodes
they try to discover the underlying criterion for video data about the video structure. The second layer uses this
frame/fragment selection and video summarization. information to estimate fragment-level importance, and select
• Unsupervised approaches that overcome the need for
the key-fragments of the video. [25] extends the previous
ground-truth data (whose production requires time- method by introducing a component that is trained to identify
demanding and laborious manual annotation procedures), the shot-level temporal structure of the video. This knowledge
based on learning mechanisms that require only an is then utilized for estimating importance at the shot-level and
adequately large collection of original videos for their producing a key-shot-based video summary. [26] builds on
training. [21] and introduces an attention mechanism to model the tem-
• Weakly-supervised approaches that, similarly to unsu-
poral evolution of the users’ interest. Following, it uses this in-
pervised approaches, aim to alleviate the need for large formation to estimate frames’ importance and select the video
sets of hand-labeled data. Less-expensive weak labels are key-frames to build a video storyboard. In the same direction,
utilized with the understanding that they are imperfect a few methods utilize sequence-to-sequence (a.k.a. seq2seq)
compared to a full set of human annotations, but can architectures in combination with attention mechanisms. [27]
nonetheless be used to create strong predictive models. proposes a seq2seq network for video summarization, that
is composed of a soft, self-attention mechanism and a two-
Building on the above described categorizations, a more
layer fully connected network for regression of the frames’
detailed taxonomy of the relevant bibliography is depicted in
importance scores. [28] formulates video summarization as
Fig. 2. The penultimate layer of this arboreal illustration shows
a seq2seq learning problem and proposes an LSTM-based
the different learning approaches that have been adopted. The
encoder-decoder network with an intermediate attention layer.
leafs of each node of this layer show the utilized techniques for
The summarization model of this work is extended in [29]
implementing each learning approach, and contain references
by integrating a semantic preserving embedding network that
to the most relevant works in the bibliography. This taxonomy
evaluates the output of the decoder with respect to the preser-
will be the basis for presenting the relevant bibliography in the
vation of the video’s semantics using a tailored semantic pre-
following section.
serving loss, and replacing the previously used Mean Square
III. D EEP LEARNING APPROACHES
Error (MSE) loss by the Huber loss to enhance its robustness
to outliers. [30] proposes a hierarchical approach which uses a
This section provides a systematic review of the bibli- generator-discriminator architecture (similar to the one in [31])
ography. It starts by presenting the different classes of su- as an internal mechanism to estimate the representativeness
pervised (Section III-A), unsupervised (Section III-B), and of each shot and define a set of candidate key-frames. Then,
weakly-supervised (Section III-C) video summarization meth- it employs a multi-head attention model to further assess
ods, which rely solely on the analysis of the visual content. candidates’ importance and select the key-frames that form
Following, it reports on multimodal approaches (Section III-D) the summary. [32] tackles video summarization as a semantic
that process also the available text-based metadata. Finally, it segmentation task where the input video is seen as a 1D
provides some general remarks (Section III-E) that outline how image (of size equal to the number of video frames) with K
the field has evolved, especially over the last five years. channels that correspond to the K dimensions of the frames’
representation vectors (either containing raw pixel values or
A. Supervised Video Summarization being precomputed feature vectors). Then, it uses popular
1. Learn frame importance by modeling the temporal semantic segmentation models, such as Fully Convolutional
dependency among frames. Early deep-learning-based ap- Netorks (FCN) [33] and an adaptation of DeepLab [34],UNDER REVIEW 5 Fig. 2. A taxonomy of the existing deep-learning-based video summarization algorithms.
UNDER REVIEW 6
tional LSTMs, that models the spatiotemporal relationship
among parts of the video. In addition to the estimates about
the frames’ importance, the algorithm enhances the visual
diversity of the summary via next frame prediction and shot
detection mechanisms, based on the intuition that the first
frames of a shot generally have high likelihood of being part
of the summary. [38] extracts deep and shallow features from
the video content using a trainable 3D-CNN and builds a new
representation through a fusion strategy. Then, it uses this
representation in combination with convolutional LSTMs to
model the spatial and temporal structure of the video. Finally,
summarization is learned with the help of a new loss function
(called Sobolev loss) that aims to define a series of frame-
level importance scores that is close to the series of ground-
truth scores by minimizing the distance of the derivatives of
these sequential data, and to exploit the temporal structure of
the video. [39] extracts spatial and temporal information by
processing the raw frames and their optical flow maps with
CNNs, and learns how to estimate frames’ importance based
on human annotations and a label distribution learning process.
[40] combines CNNs and Gated Recurrent Units [41] (a type
Fig. 3. High-level representation of the analysis pipeline of supervised algo- of Recurrent Neural Network (RNN)) to form spatiotemporal
rithms that perform summarization by learning the frames’ importance after feature vectors, that are then used to estimate the level of
modeling their temporal or spatiotemporal dependency. For the latter class of
methods (i.e., modeling the spatiotemporal dependency among frames), object activity and importance of each frame. [42] trains a neural
bounding boxes and object relations in time shown with dashed rectangles and network for spatiotemporal data extraction and creates an inter-
lines, are used to illustrate the extension that models both the temporal and frames motion curve. The latter is then used as input to a
spatial dependency among frames.
transition effects detection method that segments the video
into shots. Finally, a self-attention model exploits the human-
generated ground-truth data to learn how to estimate the intra-
and builds a network (called Fully Convolutional Sequence shot importance and select the key-frames/fragments of the
Network) for video summarization. The latter consists of a video to form a static/dynamic video summary.
stack of convolutions with increasing effective context size 3. Learn summarization by fooling a discriminator
as we go deeper in the network, that enable the network to when trying to discriminate a machine-generated from a
effectively model long-range dependency among frames and human-generated summary. Following a completely differ-
learn frames’ importance. Finally, to address issues related to ent approach to minimizing the distance between the machine-
the limited capacity of LSTMs, some techniques use additional generated and the ground-truth summaries, a couple of meth-
memory. [35] stores information about the entire video in ods use Generative Adversarial Networks (GANs). As pre-
an external memory and predicts each shot’s importance by sented in Fig. 4, the Summarizer (which acts as the Generator
learning an embedding space that enables matching of each of the GAN) gets as input the sequence of the video frames and
shot with the entire memory information. [36] stacks multiple generates a summary by computing frame-level importance
LSTM and memory layers hierarchically to derive long-term scores. The generated summary (i.e., the predicted frame-level
temporal context, and uses this information to estimate the importance scores) along with an optimal video summary for
frames’ importance. this video (i.e., frame-level importance scores according to the
2. Learn frame importance by modeling the spa- users’ preferences) are given as input to a trainable Discrimi-
tiotemporal structure of the video. Aiming to learn how nator which outputs a score that quantifies their similarity. The
to make better estimates about the importance of video training of the entire summarization architecture is performed
frames/fragments, some techniques pay attention to both the in an adversarial manner. The Summarizer tries to fool the
spatial and temporal structure of the video. Again, the Sum- Discriminator to not distinguish the predicted from the user-
marizer gets as input the sequence of the video frames and generated summary, and the Discriminator aims to learn how
the available ground-truth data that indicate the importance of to make this distinction. When the Discriminator’s confidence
each frame according to the users’ preferences. But, extending is very low (i.e., the classification error is approximately equal
the analysis pipeline of the previously described group of for both the machine- and the user-generated summary), then
methods, it then also models the spatiotemporal dependencies the Summarizer is able to generate a summary that is very
among frames (shown with dashed rectangles and lines in Fig. close to the users’ expectations. In this context, [43] com-
3). Once again, the predicted importance scores are compared bines LSTMs and Dilated Temporal Relational (DTR) units
with the ground-truth data and the outcome of this comparison to estimate temporal dependencies among frames at different
guides the training of the Summarizer. From this perspective, temporal windows, and learns summarization by trying to fool
[37] presents an encoder-decoder architecture with convolu- a trainable discriminator when distinguishing the machine-UNDER REVIEW 7
distinguishing the summary-based reconstructed video from
the original one, while the Discriminator aims to learn how to
make the distinction. When this discrimination is not possible
(i.e., the classification error is approximately equal for both
the reconstructed and the original video), the Summarizer is
considered to be able to build a video summary that is highly-
representative of the overall video content. To this direction,
the work of Mahasseni et al. [31] is the first that combines
an LSTM-based key-frame selector with a Variational Auto-
Encoder (VAE) and a trainable Discriminator, and learns video
summarization through an adversarial learning process that
aims to minimize the distance between the original video and
Fig. 4. High-level representation of the analysis pipeline of supervised
algorithms that learn summarization with the help of ground-truth data and
the summary-based reconstructed version of it. [46] builds
adversarial learning. on the network architecture of [31] and suggests a stepwise,
label-based approach for training the adversarial part of the
network, that leads to improved summarization performance.
based summary from the ground-truth and a randomly-created [47] also relies on a VAE-GAN architecture but extends it with
one. [44] suggests an adversarial learning approach for (semi- a chunk and stride network (CSNet) and a tailored difference
)supervised video summarization. The Generator/Summarizer attention mechanism for assessing the frames’ dependency
is an attention-based Pointer Network [45] that defines the at different temporal granularities when selecting the video
start and end point of each video fragment that is used to key-frames. [48] aims to maximize the mutual information
form the summary. The Discriminator is a 3D-CNN classifier between the summary and the video using a trainable couple
that judges whether a fragment is from a ground-truth or a of discriminators and a cycle-consistent adversarial learning
machine-generated summary. Instead of using the typical ad- objective. The frame selector (bi-directional LSTM) learns a
versarial loss, in this algorithm the output of the Discriminator video representation that captures the long-range dependency
is used as a reward to train the Generator/Summarizer based among frames. The evaluator is composed of two GANs where
on reinforcement learning. Thus far, the use of GANs for the forward GAN learns to reconstruct the original video from
supervised video summarization is limited. Nevertheless, this the video summary and the backward GAN learns to invert the
machine learning framework has been widely used for unsu- processing, and evaluates the consistency between the output
pervised video summarization, as discussed in the following of such cycle learning as a measure that quantifies information
section. preservation between the original video and the generated sum-
mary. Using this measure, the evaluator guides the frame selec-
tor to identify the most informative frames and form the video
B. Unsupervised Video Summarization summary. [49] introduces a variation of [46] that replaces
1. Learn summarization by fooling a discriminator when the Variational Auto-Encoder with a deterministic Attention
trying to discriminate the original video (or set of key- Auto-Encoder for learning an attention-driven reconstruction
frames) from a summary-based reconstruction of it. Given of the original video, which subsequently improves the key-
the lack of any guidance (in the form of ground-truth data) fragment selection process. [50] presents a self-attention-based
for learning video summarization, most existing unsupervised conditional GAN. The Generator produces weighted frame
approaches rely on the rule that a representative summary features and predicts frame-level importance scores, while
ought to assist the viewer to infer the original video content. the Discriminator tries to distinguish between the weighted
In this context, these techniques utilize GANs to learn how and the raw frame features. A conditional feature selector is
to create a video summary that allows a good reconstruction used to guide the GAN model to focus on more important
of the original video. The main concept of this training temporal regions of the whole video frames, while long-range
approach is depicted in Fig. 5. The Summarizer is usually temporal dependencies along the whole video sequence are
composed of a Key-frame Selector that estimates the frames’ modeled by a multi-head self-attention mechanism. Finally,
importance and generates a summary, and a Generator that [51] learns video summarization from unpaired data based
reconstructs the video based on the generated summary. It on an adversarial process that relies on GANs and a Fully-
gets as input the sequence of the video frames and, through Convolutional Sequence Network (FCSN) encoder-decoder.
the aforementioned internal processing steps, reconstructs the The model of [51] aims to learn a mapping function of a
original video based on the generated summary (which is raw video to a human-like summary, such that the distribution
represented by the predicted frame-level importance scores). of the generated summary is similar to the distribution of
The reconstructed video along with the original one are given human-created summaries, while content diversity is forced by
as input to a trainable Discriminator which outputs a score that applying a relevant constraint on the learned mapping function.
quantifies their similarity. Similarly to the supervised GAN- 2. Learn summarization by targeting specific desired
based methods, the training of the entire summarization ar- properties for the summary. Aiming to deal with the un-
chitecture is performed in an adversarial manner. However, in stable training [52] and the restricted evaluation criteria of
this case the Summarizer tries to fool the Discriminator when GAN-based methods (that mainly focus on the summary’sUNDER REVIEW 8
Fig. 6. High-level representation of the analysis pipeline of supervised
algorithms that learn summarization based on hand-crafted rewards and
Fig. 5. High-level representation of the analysis pipeline of unsupervised reinforcement learning.
algorithms that learn summarization by increasing the similarity between the
summary and the video.
aims to find important objects and their key-motions. Based
on this step, it represents the whole video by creating super-
ability to lead to a good reconstruction of the original video), segmented object motion clips. Each one of these clips is
some unsupervised approaches perform summarization by then given to the Summarizer, which uses an online motion
targeting specific properties of an optimal video summary. auto-encoder model (Stacked Sparse LSTM Auto-Encoder)
To this direction, they utilize the principles of reinforcement to memorize past states of object motions by continuously
learning in combination with hand-crafted reward functions updating a tailored recurrent auto-encoder network. The latter
that quantify the existence of desired characteristics in the is responsible for reconstructing object-level motion clips, and
generated summary. As presented in Fig. 6, the Summarizer the reconstruction loss between the input and the output of this
gets as input the sequence of the video frames and creates component is used to guide the training of the Summarizer.
a summary by predicting frame-level importance scores. The Based on this training process the Summarizer is able to
created (predicted) summary is then forwarded to an Evaluator generate summaries that show the representative objects in the
which is responsible to quantify the existence of specific video and the key-motions of each of these objects.
desired characteristics with the help of hand-crafted reward
functions. The computed score(s) are then combined to form
an overall reward value, that is finally used to guide the C. Weakly-supervised Video Summarization
training of the Summarizer. In this context, [52] formulates Weakly-supervised video summarization methods try to
video summarization as a sequential decision-making process mitigate the need for extensive human-generated ground-truth
and trains a Summarizer to produce diverse and representative data, similarly to unsupervised learning methods. Instead of
video summaries using a diversity-representativeness reward. not using any ground-truth data, they use less-expensive weak
The diversity reward measures the dissimilarity among the labels (such as video-level metadata for video categorization
selected key-frames and the representativeness reward com- and category-driven summarization, or ground-truth annota-
putes the distance (expressing the visual resemblance) of tions for a small subset of video frames for learning sum-
the selected key-frames from the remaining frames of the marization through sparse reinforcement learning and tailored
video. [53] utilizes Temporal Segment Networks (proposed reward functions) under the main hypothesis that these labels
in [54] for action recognition in videos) to extract spatial are imperfect compared to a full set of human annotations, but
and temporal information about the video frames, and trains can nonetheless allow the training of effective summarization
the Summarizer through a reward function that assesses the models. We avoid the illustration of a typical analysis pipeline
preservation of the video’s main spatiotemporal patterns in the for this class of methods, as there is a limited overlap in
produced summary. [55] presents a mechanism for both video the way that these methods conceptualize the learning of the
summarization and reconstruction. Video reconstruction aims summarization task. The first work that adopts an intermediate
to estimate the extent to which the summary allows the viewer way between fully-supervised and fully-unsupervised learning
to infer the original video (similarly to some of the above- for video summarization is [57]. It uses video-level meta-
presented GAN-based methods), and video summarization is data (e.g., the video title “A man is cooking.”) to define a
learned based on the reconstructor’s feedback and the output of categorization of videos. Then, it leverages multiple videos
trained models that assess the representativeness and diversity of a category and extracts 3D-CNN features to automatically
of the visual content of the generated summary. learn a parametric model for categorizing new (unseen during
3. Build object-oriented summaries by modeling the key- training) videos. Finally, it adopts the learned model to select
motion of important visual objects. Building on a different the video segments that maximize the relevance between
basis, [56] focuses on the preservation in the summary of the summary and the video category. The authors of [57]
the underlying fine-grained semantic and motion information investigate different ways to tackle issues related to the limited
of the video. For this, it performs a preprocessing step that size of available datasets, that include cross-dataset training,UNDER REVIEW 9
the use of web-crawled videos, as well as data augmentation video [69], [70], [71], and extend this output by extracting one
methods for obtaining sufficient training data. [58] proposes a representative key-frame [72]. A multimodal deep-learning-
deep learning framework for summarizing first-person videos; based algorithm for summarizing videos of soccer games was
however, we report on this method here, as it is also evaluated presented in [73], while another multimodal approach for key-
on a dataset used to assess generic video summarization frame extraction from first-person videos that exploits sensor
methods. Driven by the observation that the collection of a data was described in [74]. Nevertheless, all these works are
sufficiently large amount of fully-annotated first-person video outside the scope of this survey, which focuses on deep-
data with ground-truth annotations is a difficult task, the learning-based methods for generic video summarization; i.e.,
algorithm in [58] exploits the principles of transfer learning methods that learn summarization with the help of deep neural
and uses annotated third-person videos (which can be found network architectures and/or the visual content is represented
more easily) to learn how to summarize first-person videos. by deep features.
The algorithm performs cross-domain feature embedding and The majority of multimodal approaches that are within the
transfer learning for domain adaptation (across third- and first- focus of this study tackle video summarization by utilizing
person videos) in a semi-supervised manner. In particular, also the textual video metadata, such as the video title and
training is performed based on a set of third-person videos description. As depicted in Fig. 7, during training the Sum-
with fully annotated highlight scores and a set of first-person marizer gets as input: i) the sequence of the video frames that
videos where only a small portion of them comes with needs to be summarized, ii) the relevant ground-truth data
ground-truth scores. [59] suggests a weakly-supervised setting (i.e., frame-level importance scores according to the users),
of learning summarization models from a large number of and iii) the associated video metadata. Following, it estimates
web videos. Its architecture combines a Variational Auto- the frames’ importance and generates (predicts) a summary.
Encoder that learns the latent semantics from web videos, Then, the generated summary is compared with the ground-
and a sequence encoder-decoder with attention mechanism that truth summary and the video metadata. The former comparison
performs summarization. The decoding part of the VAE aims produces a similarity score. The latter comparison involves the
to reconstruct the input videos using samples from the learned semantic analysis of the summary and the video metadata.
latent semantics; while, the most important video fragments Given the output of this analysis, most algorithms try to
are identified through the soft attention mechanism of the minimize the distance of the generated representations in a
encoder-decoder network, where the attention vectors of raw learned latent space, or the classification loss when the aim is
videos are obtained by integrating the learned latent semantics to define a summary that maintains the core characteristics of
from the collected web videos. The overall architecture is the video category. The combined outcome of this assessment
trained by a weakly-supervised semantic matching loss to is finally used to guide the training of the Summarizer. In
learn the topic-associated summaries. Finally, [60] uses the this context, [75], [76] learn category-driven summarization
principles of reinforcement learning to learn summarization by rewarding the preservation of the core parts found in
based on a limited set of human annotations and a set of hand- video summaries from the same category (e.g., the main
crafted rewards. The latter relate to the similarity between parts of a wedding ceremony when summarizing a wedding
the machine- and the human-selected fragments, as well as video). Similarly, [77] trains action classifiers with video-
to specific characteristics of the created summary (e.g., its level labels for action-driven video fragmentation and labeling;
representativeness). More specifically, this method applies a then, it extracts a fixed number of key-frames and applies
hierarchical key-fragment selection process that is divided into reinforcement learning to select the ones with the highest
sub-tasks. Each task is learned through sparse reinforcement categorization accuracy, thus performing category-driven sum-
learning (thus avoiding the need for exhaustive annotations marization. [78], [79] define a video summary by maximizing
about the entire set of frames, and using annotations only for its relevance with the available video metadata, after projecting
a subset of frames) and the final summary is formed based on both the visual and the textual information in a common
rewards about its diversity and representativeness. latent space. Finally, [80] applies a visual-to-text mapping and
a semantic-based video fragment selection according to the
relevance between the automatically-generated and the original
D. Multimodal Approaches video description, with the help of semantic attended networks.
A number of works investigated the potential of exploiting However, most of these methods examine only the visual cues
additional modalities (besides the visual stream) for learning without considering the sequential structure of the video and
summarization, such as the audio stream, the video captions the temporal dependency among frames.
or ASR transcripts, any available textual metadata (video
title and/or abstract) or other contextual data (e.g., viewers’
comments). Several of these multimodal approaches were E. General remarks on deep learning approaches
proposed before the so-called “deep learning era”, targeting Based on the review of the relevant bibliography, we saw
either generic or domain/task-specific video summarization. that early deep-learning-based approaches for video summa-
Some indicative and recent examples can be found in [61], rization utilize combinations of CNNs and RNNs. The former
[62], [63], [64], [65], [66], [67], [68]. Addressing the video are used as pre-trained components (e.g., CNNs trained on
summarization task from a slightly different perspective, other ImageNet for visual concept detection) to represent the visual
multimodal algorithms generate a text-based summary of the content, and the latter (mostly LSTMs) are used to model theUNDER REVIEW 10
main intuition behind the use of GANs is that the produced
summary should allow the viewer to infer the original video,
and thus the unsupervised GAN-based methods are trying to
build a summary that enables a good reconstruction of the
original video. In most cases the generative part is composed
of an LSTM that estimates the frames’ importance according
to their temporal dependency, thus indicating the most appro-
priate video parts for the summary. Then, the reconstruction
of the video based on the predicted summary is performed
with the help of Auto-Encoders ([31], [46], [48]) that in some
cases are combined with tailored attention mechanisms ([47],
[49]). Another, but less popular, approach for unsupervised
video summarization is the use of reinforcement learning in
combination with hand-crafted rewards about specific proper-
ties of the generated summary. These rewards usually aim to
increase the representativeness and diversity of the summary
[52], retain the statiotemporal patterns of the video [53], or
secure a good summary-based reconstruction of the video [55].
Finally, one unsupervised method learns how to build object-
oriented summaries by modeling the key-motion of important
visual objects using a stacked sparse LSTM Auto-Encoder
[56].
Fig. 7. High-level representation of the analysis pipeline of supervised Last but not least, a few weakly-supervised methods have
algorithms that perform semantic/category-driven summarization. also been proposed. These methods learn video summarization
by exploiting the semantics of the video metadata [57] or the
summary structure of semantically-similar web videos [59],
temporal dependency among video frames. The majority of exploiting annotations from a similar domain and transferring
these approaches are supervised and try to learn how to make the gained knowledge via cross-domain feature embedding and
estimates about the importance of video frames/fragments transfer learning techniques [58], or using weakly/sparsely-
based on human-generated ground-truth annotations. Architec- labeled data under a reinforcement learning framework [60].
tures of RNNs can be used in the typical or in a hierarchical With respect to the potential of deep-learning-based video
form ([21], [24], [25]) to also model the temporal structure and summarization algorithms, we argue that, despite the fact
utilize this knowledge when selecting the video fragments of that currently the major research direction is towards the
the summary. In some cases such architectures are combined development of supervised algorithms, the exploration of
with attention mechanisms to model the evolution of the users’ the learning capability of unsupervised methods is highly
interest ([26], [27], [28]). Alternatively, they are extended by recommended. The reasoning behind this claim is based on
memory networks to increase the memorization capacity of the fact that: i) the generation of ground-truth training data
the network and capture long-range temporal dependencies (summaries) can be a really expensive and laborious process;
among parts of the video ([35], [36]). Going one step further, ii) video summarization is a subjective task, thus multiple
a group of techniques try to learn importance by paying different summaries can be proposed for a video from different
attention to both the spatial and temporal structure of the human annotators; and iii) these ground-truth summaries can
video, using convolutional LSTMs [37], [38], optical flow vary a lot, thus making it hard to train a method with the
maps [39], combinations of CNNs and RNNs [40], or motion typical supervised training approaches. On the other hand,
extraction mechanisms [42]. Following a different approach, unsupervised video summarization algorithms overcome the
a couple of supervised methods learn how to generate video need for ground-truth data as they can be trained using
summaries that are aligned with the human preferences with only an adequately large collection of original, full-length
the help of GANs ([43], [44]). Finally, multimodal algorithms videos. Moreover, unsupervised learning allows to easily train
extract the high-level semantics of the visual content using different models of the same deep network architecture using
pre-trained CNNs/DCNNs and learn summarization in a su- different types of video content (TV shows, news), thus facil-
pervised manner by maximizing the semantic similarity among itating the domain-adaptation of video summarization. Given
the visual summary and the contextual video metadata or the the above, we believe that unsupervised video summarization
video category. However, the latter methods focus mostly on has great advantages, and thus its potential should be further
the visual content and disregard the sequential nature of the investigated.
video, which is essential when summarizing the presented
story. IV. E VALUATING VIDEO SUMMARIZATION
To train a video summarization network in a fully unsu- A. Datasets
pervised manner, the use of GANs seems to be the central Four datasets prevail in the video summarization bibliog-
direction thus far ([31], [46], [47], [48], [49], [50], [51]). The raphy: SumMe [19], TVSum [20], OVP [81] and YoutubeUNDER REVIEW 11 [81]. SumMe consists of 25 videos of various genres such utilized by the deep-learning-based approaches for video sum- as holidays, events and sports, with duration ranging from marization. From this table it is clear that the SumMe and 1 to 6 minutes. The annotations are obtained from 15 to TVSum datasets are, by far, the most commonly used ones. 18 users per video, in the form of multiple sets of key- OVP and Youtube are also utilized in a few works, but mainly fragments. Each summary has a length between 5% and 15% for data augmentation purposes. of the initial video duration. TVSum contains 50 videos from 10 categories of the TRECVid MED dataset (5 videos per category), including news, how-to’s, user-generated-content B. Evaluation protocols and measures and documentaries. Videos’ duration ranges from 2 to 11 1) Evaluating video storyboards: Early video summariza- minutes, and each video has been annotated by 20 users in tion techniques created a static summary of the video content the form of shot- and frame-level importance scores (ranging with the help of representative key-frames. First attempts from 1 to 5). OVP and Youtube both contain 50 videos, whose towards the evaluation of the created key-frame-based sum- annotations are sets of key-frames, produced by 5 users. The maries were based on user studies that made use of human video duration ranges from 1 to 4 minutes for OVP, and from judges for evaluating the resulting quality of the summaries. 1 to 10 minutes for Youtube. Both datasets are comprised Judgment was based on specific criteria, such as the relevance of videos with diverse video content, such as documentaries, of each key-frame with the video content, redundant or missing educational, ephemeral, historical, and lecture videos (OVP information [86], or the informativeness and enjoyability of the dataset) and cartoons, news, sports, commercials, TV-shows summary [87]. Although this can be a useful way to evaluate and home videos (Youtube dataset). results, the procedure of evaluating each resulting summary by Some less-commonly used datasets for video summariza- users can be time-consuming and the evaluation results can not tion are CoSum [82], MED-summaries [83], Video Titles in be easily reproduced or used in future comparisons. To over- the Wild (VTW) [84], League of Legends (LoL) [85] and come the deficiencies of user studies, more recent works evalu- FVPSum [58]. CoSum has been created to evaluate video co- ate their key-frame-based summaries using objective measures summarization. It consists of 51 videos that were downloaded and ground-truth summaries. In this context, [88] estimates the from YouTube using 10 query terms that relate to the video quality of the produced summaries using the Fidelity measure content of the SumMe dataset. Each video is approximately 4 [89] and the Shot Reconstruction Degree criterion [90]. A minutes long and it is annotated by 3 different users, who different approach, termed “Comparison of User Summaries”, have selected sets of key-fragments. The MED-summaries is proposed in [81] and evaluates the generated summary dataset contains 160 annotated videos from the TRECVID according to its overlap with predefined key-frame-based user 2011 MED dataset. 60 videos form the validation set (from summaries. Comparison is performed at a key-frame-basis 15 event categories) and the remaining 100 videos form the and the quality of the generated summary is quantified by test set (from 10 event categories), with most of them being computing the accuracy and error rates based on the number 1 to 5 minutes long. The annotations come as one set of of matched and non-matched key-frames respectively. This importance scores, averaged over 1 to 4 annotators. As far approach was also used in [91], [92], [93], [94]. A similar as the VTW dataset is concerned, it includes 18100 open methodology was followed in [95]. Instead of Accuracy and domain videos, with 2000 of them being annotated in the Error rates, in [95] the evaluation relies on the well-known form of sub-shot level highlight scores. The videos are user- Precision, Recall and F-Score measures. This protocol was generated, untrimmed videos that contain a highlight event and also used in the supervised video summarization approach have an average duration of 1.5 minutes. Regarding LoL, it has of [96], while a variation of it was used in [97], [98], [99]. 218 long videos (30 to 50 minutes), displaying game matches In the latter case, additionally to their visual similarity, two from the North American League of Legends Championship frames are considered a match only if they are temporally no Series (NALCS). The annotations derive from a Youtube more than a specific number of frames apart. channel that provides community-generated highlights (videos 2) Evaluating video skims: Most recent algorithms tackle with duration 5 to 7 minutes). Therefore, one set of key- video summarization by creating a dynamic video summary fragments is available for each video. Finally, FPVSum is (video skim). For this, they select the most representative video a first-person video summarization dataset. It contains 98 fragments and join them in a sequence to form a shorter video. videos from 14 categories of GoPro viewer-friendly videos The evaluation methodologies of these works assess the quality with varying lengths (over 7 hours in total). For each category, of video skims according to their alignment with human pref- about 35% of the video sequences are annotated with ground- erences, based mostly on objective measures. A first attempt truth scores by at least 10 users, while the remaining are was made in [19], where an evaluation approach along with viewed as the unlabeled examples. the SumMe dataset for video summarization were introduced. It is worth mentioning that in this work we focus only According to this approach, the videos are first segmented on datasets that are fit for evaluating video summarization into consecutive and non-overlapping fragments in order to methods, namely datasets that contain ground truth annotations enable matching between key-fragment-based summaries (i.e., regarding the summary or the frame/fragment-level importance to compare the user-generated with the automatically-created of each video. Other datasets might be also used by some summary). Then, based on the scores computed by a video works for network pre-training purposes, but they do not summarization algorithm for the fragments of a given video, an concern the present work. Table II summarizes the datasets optimal subset of them (key-fragments) is selected and forms
UNDER REVIEW 12
Dataset # of videos duration (min) content type of annotations # of annotators per video
SumMe [19] 25 1-6 holidays, events, sports multiple sets of key-fragments 15 - 18
news, how-to’s, user-generated,
TVSum [20] 50 2 - 10 documentaries multiple fragment-level scores 20
(10 categories - 5 videos each)
documentary, educational,
OVP [81] 50 1-4 multiple sets of key-frames 5
ephemeral, historical, lecture
cartoons, sports, tv-shows,
Youtube [81] 50 1 - 10 multiple sets of key-frames 5
commercial, home videos
holidays, events, sports
CoSum [82] 51 ∼4 multiple sets of key-fragments 3
(10 categories)
MED [83] 160 1-5 15 categories of various genres one set of imp. scores 1-4
user-generated videos that
VTW [84] 2000 1.5 (avg) sub-shot level highlight scores -
contain a highlight event
matches from a League
LoL [85] 218 30 - 50 one set of key-fragments 1
of Legends tournament
first-person videos
FPVSum [58] 98 4.3 (avg) multiple frame-level scores 10
(14 categories)
TABLE I
D ATASETS FOR VIDEO SUMMARIZATION AND THEIR MAIN CHARACTERISTICS .
Method SumMe TVSum OVP Youtube CoSum MED VTW LoL FPVSum
VS-DSF [78] ✓
vsLSTM, dppLSTM [21] ✓ ✓ ✓ ✓
DeSumNet [57] ✓ ✓
H-RNN [24] ✓ ✓ ✓ ✓
SUM-GAN [31] ✓ ✓ ✓ ✓
SUM-FCN [32] ✓ ✓ ✓ ✓
FPVSF [58] ✓ ✓ ✓
VESD [59] ✓ ✓
VASNet [27] ✓ ✓ ✓ ✓
HSA-RNN [25] ✓ ✓ ✓ ✓
DTR-GAN [43] ✓ ✓
MAVS [35] ✓ ✓
DR-DSN [52] ✓ ✓ ✓ ✓
SA-SUM [80] ✓ ✓ ✓
DQSN [76] ✓ ✓
Online Motion-AE [56] ✓ ✓ ✓
CSNet [47] ✓ ✓ ✓ ✓
A-AVS, M-AVS [28] ✓ ✓ ✓ ✓
Ptr-Net [44] ✓ ✓ ✓ ✓
Cycle-SUM [48] ✓ ✓
DSSE [79] ✓
EDSN [53] ✓ ✓
SMLD [39] ✓ ✓
UnpairedVSN [51] ✓ ✓ ✓ ✓
ACGAN [50] ✓ ✓ ✓ ✓
SUM-GAN-sl [46] ✓ ✓
SMN [36] ✓ ✓ ✓ ✓
PCDL [55] ✓ ✓ ✓ ✓
ActionRanking [40] ✓ ✓ ✓ ✓
vsLSTM+Att [26] ✓ ✓ ✓ ✓
dppLSTM+Att [26] ✓ ✓ ✓ ✓
WS-HRL [60] ✓ ✓ ✓ ✓
SF-CVS [42] ✓ ✓ ✓
CRSum [38] ✓ ✓ ✓
H-MAN [30] ✓ ✓ ✓ ✓
SUM-GAN-AAE [49] ✓ ✓
DASP [29] ✓ ✓ ✓
TABLE II
D ATASETS USED BY EACH DEEP - LEARNING - BASED METHOD FOR EVALUATING VIDEO SUMMARIZATION PERFORMANCE .You can also read