Report on UG2+ Challenge Track 1: Assessing Algorithms to Improve Video Object Detection and Classification from Unconstrained Mobility Platforms

Page created by Jim Barker
 
CONTINUE READING
Report on UG2 + Challenge Track 1: Assessing Algorithms to Improve Video
                                            Object Detection and Classification from Unconstrained Mobility Platforms

                                                   Sreya Banerjee*1 , Rosaura G. VidalMata*1 , Zhangyang Wang2 ,and Walter J. Scheirer1
                                                                1
                                                                    Dept. of Computer Science & Engineering, University of Notre Dame, USA
                                                                                2
                                                                                  Texas A&M University, College Station, TX
                                                                           {sbanerj2, rvidalma, walter.scheirer}@nd.edu
arXiv:1907.11529v3 [cs.CV] 19 Feb 2020

                                                                                       atlaswang@tamu.edu

                                                                    Abstract                               To do this, one’s first inclination might be to turn to the
                                                                                                        state-of-the-art visual recognition systems which, trained
                                             How can we effectively engineer a computer vision sys-     with millions of images crawled from the web, would be
                                         tem that is able to interpret videos from unconstrained mo-    able to identify objects, events, and human identities from
                                         bility platforms like UAVs? One promising option is to make    a massive pool of irrelevant frames. However, such ap-
                                         use of image restoration and enhancement algorithms from       proaches do not take into account the artifacts unique to
                                         the area of computational photography to improve the qual-     the operation of the sensors used to capture outdoor data,
                                         ity of the underlying frames in a way that also improves au-   as well as the visual aberrations that are a product of
                                         tomatic visual recognition. Along these lines, exploratory     the environment. While there have been important ad-
                                         work is needed to find out which image pre-processing al-      vances in the area of computational photography [24, 32],
                                         gorithms, in combination with the strongest features and       their incorporation as a pre-processing step for higher-level
                                         supervised machine learning approaches, are good candi-        tasks has received only limited attention over the past few
                                         dates for difficult scenarios like motion blur, weather, and   years [41, 49]. The impact many transformations have on
                                         mis-focus — all common artifacts in UAV acquired im-           visual recognition algorithms remains unknown.
                                         ages. This paper summarizes the protocols and results of          Following the success of the UG2 challenge on this topic
                                         Track 1 of the UG2 + Challenge held in conjunction with        held at IEEE/CVF CVPR 2018 [49, 51], a new challenge
                                         IEEE/CVF CVPR 2019. The challenge looked at two sepa-          with an emphasis on video was organized at CVPR 2019.
                                         rate problems: (1) object detection improvement in video,      The UG2 + 2019 Challenge provided an integrated forum
                                         and (2) object classification improvement in video. The        for researchers to evaluate recent progress in handling var-
                                         challenge made use of new protocols for the UG2 (UAV,          ious adverse visual conditions in real-world scenes, in ro-
                                         Glider, Ground) dataset, which is an established benchmark     bust, effective and task-oriented ways. 16 novel algorithms
                                         for assessing the interplay between image restoration and      were submitted by academic and corporate teams from the
                                         enhancement and visual recognition. 16 algorithms were         University of the Chinese Academy of Sciences (UCAS),
                                         submitted by academic and corporate teams, and a detailed      Northeastern University (NEU), Institute of Microelectron-
                                         analysis of them is reported here.                             ics of the Chinese Academy of Sciences (IMECAS), Uni-
                                                                                                        versity of Macau (UMAC), Honeywell International, Tech-
                                         1. Introduction                                                nical University of Munich (TUM), Chinese Academy of
                                                                                                        Sciences (CAS), Sunway.AI, and Meitu’s MTlab.
                                            The use of mobile video capturing devices in uncon-
                                         strained scenarios offers clear advantages in a variety of         In this paper, we review the changes made to the origi-
                                         areas where autonomy is just beginning to be deployed.         nal dataset, evaluation protocols, algorithms submitted, and
                                         For instance, a camera installed on a platform like an un-     experimental analysis for Track 1 of the challenge, primar-
                                         manned aerial vehicle (UAV) could provide valuable infor-      ily concerned with video object detection and classification
                                         mation about a disaster zone without endangering human         from unconstrained mobility platforms. (A separate paper
                                         lives. And a flock of such devices can facilitate the prompt   has been prepared describing Track 2, which focused on im-
                                         identification of dangerous hazards as well as the location    proving poor visibility environments.) The novelty of this
                                         of survivors, the mapping of terrain, and much more. How-      work lies in the evaluation protocols we use to assess algo-
                                         ever, the abundance of frames captured in a single session     rithms, which quantify the mutual influence between low-
                                         makes the automation of their analysis a necessity.            level computational photography approaches and high-level
tasks such as detection and classification. Moreover, it is the   3. The UG2 + Challenge
first such work to rigorously evaluate video object detection
and classification after task-specific image pre-processing.         The main goal of this work is to provide insights re-
                                                                  lated to the impact image restoration and enhancement tech-
                                                                  niques have on visual recognition tasks performed on video
                                                                  captured in unconstrained scenarios. For this, we introduce
2. Related Work                                                   two visual recognition tasks: (1) object detection improve-
                                                                  ment in video, where algorithms produce enhanced images
    Datasets. There is an ample number of datasets de-            to improve the localization and identification of an object
signed for the qualitative evaluation of image enhancement        of interest within a frame, and (2) object classification im-
algorithms in the area of computational photography. Such         provement in video, where algorithms analyze a group of
datasets are often designed to fix a particular type of aber-     consecutive frames in order to create a better video se-
ration such as blur [25, 23, 45, 43, 32], noise [3, 5, 35], or    quence to improve classification of a given object of interest
low resolution [4]. Datasets containing more diverse sce-         within those frames.
narios [20, 42, 55, 44] have also been proposed. However,
these datasets were designed for image quality assessment         3.1. Object Detection Improvement in Video
purposes, rather than for a quantitative evaluation of the en-        For Track 1.1 of the challenge, the UG2 dataset [49] was
hancement algorithm on a higher-level task like recognition.      adapted to be used for localizing and identifying objects of
   Datasets with a similar type of data to the one em-            interest1 . This new dataset exceeds PASCAL VOC [8] in
ployed for this challenge include large-scale video surveil-      terms of the number of classes used, as well as in the diffi-
lance datasets such as [9, 12, 40, 64], which provide video       culty of recognizing some classes due to image artifacts.
captured from a single fixed overhead viewpoint. As for           93, 668 object-level annotations were extracted from 195
datasets collected by aerial vehicles, the VIRAT [34] and         videos coming from the three original UG2 collections [49]
VisDrone2018 [63] datasets have been designed for event           (UAV, Glider, Ground), spanning 46 classes inspired by Im-
recognition and object detection respectively. Other aerial       ageNet [39] (see Supp. Table 1 for dataset statistics and
datasets include the UCF Aerial Action Data Set [1], UCF-         Supp. Fig. 1 for the shared class distribution). There are
ARG [2], UAV123 [31], and the multi-purpose dataset in-           86, 484 video frames, each having a corresponding annota-
troduced by Yao et al. [56]. Similarly, none of these datasets    tion file in .xml format, similar to PASCAL VOC.
provide protocols that introduce image enhancement tech-              Each annotation file includes the dataset collection the
niques to improve the performance of visual recognition.          image frame belongs to, its relative path, width, height,
                                                                  depth, objects present in the image, the bounding box coor-
    Restoration and Enhancement to Improve Visual
                                                                  dinates indicating the location of each object in the image,
Recognition. Intuitively, improving the visual quality of
                                                                  and segmentation and difficulty indicators. (Note that dif-
a corrupted image should, in turn, improve the performance
                                                                  ferent videos have different resolutions.) Since we are pri-
of object recognition for a classifier analyzing the image.
                                                                  marily concerned with localizing and recognizing the ob-
As such, one could assume a correlation between the per-
                                                                  ject, the indicator for segmentation in the annotation file
ceptual quality of an image and its quality for object recog-
                                                                  is kept at 0 meaning “no segmentation data available.” Be-
nition purposes, as has been observed by Gondal et al. [52]
                                                                  cause the objects in our dataset are fairly recognizable, we
and Tahboub et al. [46].
                                                                  kept the indicator for difficulty set to 0 to indicate “easy.”
   Early attempts at unifying visual recognition and vi-              Similar to the original UG2 dataset, the UG2 + object de-
sual enhancement tasks included deblurring [60, 61], super-       tection dataset is divided into the following three categories:
resolution [13], denoising [58], and dehazing [26]. These         (1) 30 Creative Commons tagged videos taken by fixed-
approaches tend to overlook the qualitative appearance of         wing UAVs obtained from YouTube; (2) 29 glider videos
the images and instead focus on improving the performance         recorded by pilots of fixed-wing gliders; and (3) 136 con-
of object recognition. In contrast, the approach proposed by      trolled videos captured on the ground using handheld cam-
Sharma et al. [41] incorporates two loss functions for en-        eras. Unlike the original UG2 dataset, we do not crop out
hancement and classification into an end-to-end processing        the objects from the frames, and instead, use the whole
and classification pipeline.                                      frames for the detection task.
                                                                      Sequestered Test Dataset. The test set has a total of
   Visual enhancement techniques have been of interest for
                                                                  2, 858 images and annotations from all three collections.
unconstrained face recognition [57, 33, 62, 10, 53, 54, 27,
                                                                  Supp. Table 1 shows the details for individual collections.
28, 16, 59, 19, 48, 36, 21] through the incorporation of
deblurring, super-resolution, hallucination techniques, and          1 The object detection dataset (including the train-validation split) and

person re-identification for video surveillance data.             evaluation kit is available from: http://bit.ly/UG2Detection
The classes were selected based on the difficulty of detect-      this we adapted the evaluation method and metrics provided
ing them on the validation set (see Sec. 5.1, Supp. Fig. 2,       in [51] to take into account the temporal factor of the data
and the description in Supp. Sec. 1, 1.1 for details related      present in the UG2 dataset. Below we introduce the adapted
to this). The evaluation for the formal challenge was se-         training and testing datasets, as well as the evaluation met-
questered, meaning participants did not have access to the        rics and baseline classification results for this task2 .
test data prior to the submission of their algorithms.                UG2 + Classification Dataset. To leverage both the tem-
    Evaluation Protocol for Detection. The objective of           poral and visual features of a given scene, we divided each
this challenge is to detect objects from a number of visual       of the 196 videos of the original UG2 dataset into multiple
object classes in unconstrained environments. It is funda-        object sequences (for a total of 576 object sequences). We
mentally a supervised learning problem in that a training         define an object sequence as a collection of multiple frames
set of labeled images is provided. Participants are not ex-       in which a given object of interest is present in the camera
pected to develop novel object detection models. They are         view. For each of these sequences, we provide frame-level
encouraged to use a pre-processing step (for instance, super-     annotations detailing the location of the specified object (a
resolution, denoising, deblurring, or algorithms that jointly     bounding box with its coordinates) as well as the UG2 class.
optimize image quality and recognition performance) in                A UG2 class encompasses a number of ImageNet classes
the detection pipeline. To evaluate the algorithms submit-        belonging to a common hierarchy (e.g., the UG2 class “car”
ted by the participants, the raw frames of UG2 are first          includes the ImageNet classes “jeep”, “taxi”, and “limou-
pre-processed with the submitted algorithms and are then          sine”), and is used in place of such classes to account for
sent to the YOLOv3 detector [38], which was fine-tuned            instances in which it might be impossible to identify the
on the UG2 dataset. In this paper, we also consider an-           fine-grained ImageNet class that an object belongs to. For
other detector, YOLOv2 [37]. The details can be found in          example, it might be impossible to tell what the specific type
Supp. Sec. 1.2.                                                   of car on the ground is from an aerial video where that car
    The metric used for scoring is Mean Average                   is hundreds — if not thousands — of feet away from the
Precision (mAP) at Intersection over Union (IoU)                  sensor.
[0.15, 0.25, 0.5, 0.75, 0.9]. The mAP evaluation is kept the          Supp. Table 4 details the number of frames and object
same as PASCAL VOC [8], except for a single modification          sequences extracted from each of the UG2 collections for
introduced in IoU. Unlike PASCAL VOC, we are evaluating           the training and testing datasets. An important difference
mAP at different IoU values. This is to account for differ-       between the training and testing datasets is that while some
ent sizes and scales of objects in our dataset. We consider       of the collections in the training set have a larger number
predictions to be “a true match” when they share the same         of object sequences, that does not necessarily translate to a
label and an IoU ≥ 0.15, 0.25, 0.5, 0.75, 0.90. The average       larger number of frames (as is the case with the UAV collec-
precision (AP) for each class is calculated as the area under     tion). As such, the number of frames (and thus duration) of
the precision-recall curve. Then, the mean of all AP scores       each object sequence is not uniform across all three collec-
is calculated, resulting in a mAP value from 0 to 100%.           tions. The number of frames per object sequence can range
                                                                  anywhere from five frames to hundreds of frames. How-
3.2. Object Classification Improvement in Video                   ever, for the testing set, all of the object sequences have at
                                                                  least 40 frames. It is important to note that while the testing
    While interest in applying image enhancement tech-
                                                                  set contains imagery similar to that present in the training
niques for classification purposes has started to grow, there
                                                                  set, the quality of the videos might vary. This results in dif-
has not been a direct application of such methods on video
                                                                  ferences in the classification performance (more details on
data. Currently, image enhancement algorithms attempt to
                                                                  this are discussed in Sec. 5.1.
estimate the visual aberrations a of a given image O, in or-
der to establish an aberration-free version I of the scene            Evaluation Protocol for Video Classification. Given
captured (i.e., O = I ⊗ a + n, where n represents additional      the nature of this sub-challenge, each pre-processing algo-
noise that might of be a byproduct of the visual aberration       rithm is provided with a given set of object sequences rather
a). It is natural to assume that the availability of additional   than individual — and possibly unrelated — frames. The al-
information — like the information present in several con-        gorithm is then expected to make use of both temporal and
tiguous video frames — would enable a more accurate esti-         visual information pertaining to each object sequence in or-
mation of such aberrations, and as such a cleaner represen-       der to provide an enhanced version of each of the sequence’s
tation of the captured scene.                                     individual frames. The object of interest is then cropped out
                                                                  of the enhanced frames and used as input to an off-the-shelf
    Taking this into account, we created challenge Track 1.2.
                                                                  classification network. For the challenge evaluation, we
The main goal of this track is to correct visual aberrations
present in video in order to improve the classification results       2 The object classification dataset and evaluation kit are available from:

obtained with out-of-the-box classification algorithms. For       http://bit.ly/UG2Devkit
focused solely on VGG16 trained on ImageNet.However,            applied the VDSR [22] super-resolution algorithm and Fast
we do provide a comparative analysis of the effect of the       Artifact Reduction CNN [6]. For the Glider collection, they
enhancement algorithms on different classifiers in Sec. 5.      chose to do nothing.
There was no fine-tuning on UG2 + in the competition, as            UCAS-NEU: Smart HDR. Team UCAS-NEU concen-
we were interested in making low-quality images achieve         trated on enhancing the resolution and dynamic range of
better classification results on a network trained with gen-    the videos in UG2 via deep learning. Irrespective of the
erally good quality data (as opposed to having to retrain a     collection the images came from, they used linear blending
network to adapt to each possible artifact encountered in the   to add the image with its corresponding Gaussian-blurred
real world). But we do look at the impact of fine-tuning in     counterpart to perform a temporal cross-dissolve between
this paper.                                                     these two images, resulting in a sharpened output. Their
   The network provides us with a 1, 000 × n vector, where      other algorithm used the Fast Artifact Reduction CNN [6]
n corresponds to the number of frames in the object se-         to reduce compression artifacts present in the UG2 dataset,
quence, detailing the confidence score of each of the 1, 000    especially in the UAV collection caused by repeated uploads
ImageNet classes on each of the sampled frames. To eval-        and downloads from Youtube.
uate the classification accuracy of each object sequence we         Honeywell: Camera and Conditions-Relevant En-
use Label Rank Average Precision (LRAP) [47]:                   hancements (CCRE). Team Honeywell used their CCRE
                          n                                     algorithm [50] to closely target image enhancements to
                 ˆ     1 X 1 X |Lij |
         LRAP(y, f ) =                                          avoid the counter-productive results that the UG2 dataset
                       n i=0 |yi | j:y =1 rankij
                                        k                       has highlighted [49]. Their algorithm relies on the fact
                   n                         o                  that not all types of enhancement techniques may be use-
              Lij = k : yik = 1, fˆik > fˆij                    ful for a particular image coming from any of the three
                           n                o                   different collections of the UG2 dataset. To find the use-
               rankij =     k : fˆik ≥ fˆij                     ful subset of image enhancement techniques required for
                                                                a particular image, the CCRE pipeline considers the inter-
    LRAP measures the fraction of highly ranked ImageNet        section of camera-relevant enhancements with conditions-
labels (i.e., labels with the highest confidence score fˆ as-   relevant enhancements. Examples of camera-relevant en-
signed by a given classification network, such as VGG16)        hancements include de-interlacing, rolling shutter removal
Lij that belong to the true label UG2 class yi of a given se-   (both depending on the sensor hardware), and de-vignetting
quence i containing n frames. A perfect score (LRAP = 1)        (for fisheye lenses). Example conditions-relevant enhance-
would then mean that all of the highly ranked labels belong     ments include de-hazing (when imaging distant objects out-
to the ground-truth UG2 class. For example, if the class        doors) and raindrop removal. While interlacing was the
“shore” has two sub-classes lake-shore and sea-shore, then      largest problem with the Glider images, the Ground and
the top 2 predictions of the network for all of the cropped     UAV collections were degraded by compression artifacts.
frames in the object sequence are in fact lake-shore and sea-   De-interlacing was performed on detected images from the
shore. LRAP is generally used for multi-class classification    Glider dataset with the expectation that the edge-type fea-
tasks where a single object might belong to multiple classes.   tures learned by the VGG network will be impacted by
Given that our object annotations are not as fine-grained as    jagged edges from interlacing artifacts. Detected video
the ImageNet classes (each of the UG2 classes encompasses       frames from the UAV and Ground collections were pro-
several ImageNet classes), we found this metric to be a good    cessed with the Fast Artifact Reduction CNN [6]. For
fit for our classification task.                                their other algorithms, they used an autoencoder trained on
                                                                the UG2 dataset to enhance images, and a combination of
4. Challenge Workshop Entries                                   autoencoder and de-interlacing algorithm to enhance de-
    Here we describe the approaches for one or both of the      interlaced images. The encoder part of the autoencoder fol-
evaluation tasks from each of the challenge participants.       lows the architecture of SRCNN [7]
    IMECAS-UMAC: Intelligent Resolution Restoration.                TUM-CAS: Adaptive Enhancement. Team TUM-
The main objective of the algorithms submitted by team          CAS employed a method similar to IMECAS-UCAS. They
IMECAS-UMAC was to restore resolution based on scene            used a scene classifier to find out which collection the image
content with deep learning. As UG2 contains videos with         came from, or reverted to a “None” failure case. Based on
varying degrees of imaging artifacts coming from three dif-     the collection, they used a de-interlacing technique similar
ferent collections, their method incorporated a scene classi-   to Honeywell’s for the Glider collection, an image sharp-
fier to detect which collection the image came from in order    ening method and histogram equalization for the Ground
to apply targeted enhancement and restoration to that im-       collection, and histogram equalization followed by super-
age. For the UAV and Ground collection respectively, they       resolution [22] for the UAV collection. If the image was
found to not belong to any of the collections in UG2 , they                    UAV              Glider           Ground
chose to do nothing.                                              mAP      Val.   Test       Val.    Test      Val.   Test
    Meitu’s MTLab: Data Driven Enhancement. Meitu’s               @15     96.4% 1.3%        95.1% 5.19%       100% 31.6%
MTLab proposed an end-to-end neural network incorporat-           @25     95.5% 1.3%        94.8% 5.19%       100% 31.6%
ing direct supervision through the use of traditional cross-      @50     88.6% 0.61%       91.1% 0.01%       100% 21.5%
entropy loss in addition to the YOLOv3 detection loss in or-      @75     39.5% 0%          40.3% 0%          96.7% 15.8%
der to improve the detection performance. The motivation          @90     1.9% 0%           2.9% 0%           54.7% 0.04%
behind doing this is to make the YOLO detection perfor-
mance as high as possible, i.e., enhance the features of the     Table 1: mAP scores for the UG2 Object Detection dataset
image required for detection. The proposed network con-          with YOLOv3 fine-tuned on the UG2 dataset. For the mAP
sists of two sub-networks: Base Net and Combine Net. For         scores for YOLOv2, see Table. 2 in the Supp. Mat.
Base Net, they used the convolutional layers of ResNet [15]      tion classes and measured its performance on the reserved
as their backbone network to extract the features at different   validation and test data per collection to establish baseline
levels from the input image. Features from each convolution      performance. Table 1 shows the baseline mAP scores ob-
branch are then passed to the individual layers of Combine       tained using YOLOv3 on raw video frames (i.e., without
Net, which are fused to get an intermediate image. The           any pre-processing). Overall, we observe distinct differ-
final output is created as an addition of the intermediate im-   ences between the results for all three collections, particu-
age from Combine Net and the original image. While they          larly between the airborne collections (UAV and Glider) and
use ResNet as the Base Net for its strong feature extraction     the Ground collection. Since the network was fine-tuned
capability, the Combine Net captures the context at differ-      with UG2 +, we expected the mAP score at 0.5 IoU for vali-
ent levels (low-level features like edges to high-level image    dation to be fairly high for all three collections. The Ground
statistics like object definitions). For training this network   collection receives a perfect score of 100% for mAP at 0.5.
end-to-end, they used cross-entropy and YOLOv3 detection         This is due to the fact that images within the Ground col-
loss and UG2 as the dataset.                                     lection have minimal imaging artifacts and variety, as well
    Sunway.AI: Sharpen and Histogram Equalization                as many pixels on target, compared to the other collections.
with Super-resolution (SHESR). Team Sunway.AI em-                The UAV collection, on the other hand, has the worst per-
ployed a method similar to IMECAS-UCAS and TUM-                  formance due to relatively small object scales and sizes, as
CAS. They also used a scene classifier to find out which         well as compression artifacts resulting from the processing
collection the image came from, or reverted to a “None”          applied by YouTube. It achieves a very low score of 1.93%
failure case. They used histogram equalization followed          for mAP at 0.9.
by super-resolution [22] for the UAV collection, and image
                                                                     For the test dataset, we concentrated more on the classes
sharpening followed by Fast Artifact Reduction CNN [6]
                                                                 of UG2 that were underrepresented in the training dataset
for the Ground collection to remove blocking artifacts due
                                                                 to make it more challenging based on Supp. Fig. 2 (See
to JPEG compression. If the image was from the Glider col-
                                                                 Supp. Sec. 1.1 for details). For example, for the Ground
lection or was found to not belong to any of the collections
                                                                 dataset, we concentrated on objects whose distance from the
in UG2 , they chose to do nothing.
                                                                 camera was maximum (200 ft) or had the highest induced
5. Results & Analysis                                            motion blur (180 rpm) or toughest weather condition (e.g.,
                                                                 rainy day, snowfall). Correspondingly, the mAP scores on
5.1. Baseline Results
                                                                 the test dataset are very low. At operating points of 0.75
   Object Detection Improvement on Video. In order to            and 0.90 IoU, most of the object classes in the Glider and
establish scores for detection performance before and after      UAV collections are unrecognizable. This however, varies
the application of image enhancement and restoration al-         for the Ground collection, which receives a score of 15.75%
gorithms submitted by participants, we use the YOLOv3            and 0.04% respectively. The classes that were readily iden-
object detection model [38] to localize and identify ob-         tified in the ground collection were large objects: “Analog
jects in a frame and then consider the mAP scores at IoU         Clock”, “Arch”, “Street Sign”, and “Car 2” for 0.75 IoU
[0.15, 0.25, 0.5, 0.75, 0.9]. Since the primary goal of our      and only “Arch” for 0.90 IoU. We also fine-tuned a separate
challenge does not involve developing a novel detection          detector, YOLOv2 [37], on UG2 + to assess the impact of
method or comparing the performance among popular ob-            a different detector architecture on our dataset. The details
ject detectors, ideally, any detector could be used for mea-     can be found in Supp. Sec. 1.2.
suring the performance. We chose YOLOv3 because it is                Object Classification Improvement on Video. Table 2
easy to train and is the fastest among the popular off-the-      shows the average LRAP of each of the collections on both
shelf detectors [11, 30, 29].                                    the training and testing datasets without any restoration or
   We fine-tuned YOLOv3 to reflect the UG2 + object detec-       enhancement algorithm applied to them. These scores were
UAV                   Glider         Ground           common classes of the two datasets, we observe a signifi-
          Train Test            Train Test      Train Test          cant improvement in the average LRAP of the training set
  V16     12.2% 12.7%           10.7% 33.7%     46.3% 29.4%         (with an average LRAP of 28.46% for the VGG16 classi-
  R50     13.3% 15.1%           11.7% 28.5%     51.7% 38.7%         fier). More details on this analysis can be found in the Supp.
 D201     3.9% 2.1%             3.9% 1.0%       7.5% 3.8%           Mat. Table 10.
 MV2      1.8% 1.8%             1.5% 1.2%       6.9% 5.4%               To evaluate the impact of the domain transfer between
 NNM      1.2% 2.0%             1.2% 0.5%       1.0% 0.5%           ImageNet features and our dataset, specifically looking at
                                                                    the disparity between the training and testing performance
Table 2: UG2 Object Classification Baseline Statistics for          on the Glider collection, we fine-tuned the VGG16 network
ImageNet pre-trained networks: VGG16 (V16), ResNet50                on the Glider collection training set for 200 epochs with
(R50), DenseNet201 (D201), MobileNetV2 (MV2), NAS-                  a training/validation split of 80/20%, obtaining a training,
NetMobile (NNM)                                                     validation, and testing accuracy of 91.67%, 27.55%, and
                                                                    20.25% respectively. Once the network was able to gather
calculated by averaging the LRAP score of each of the ob-
                                                                    more information about the dataset, the gap between vali-
ject sequences yc of a given UG2 class Ci , for all the k
                                                                    dation and testing was diminished. Nevertheless, the broad
classes in that particular collection D:
                                                                    difference between the training and testing scores indicates
                                      K
                                   1 X                              that the network has problems generalizing to the UG2 +
         AverageLRAP(D) =                LRAP(Ci )                  data, perhaps due to the large number of image aberrations
                                   K i=0
                                                                    present in each image.
                        |Ci |
                    1 X                                             5.2. Challenge Participant Results
  LRAP(Ci ) =                LRAP(yc , fˆ) | Ci ∈ Dclasses
                   |Ci | c=0
                                                                       Object Detection Improvement on Video. For the de-
    As can be observed from the training set, the average           tection task, each participant’s algorithms were evaluated
LRAP scores for each collection tend to be quite low, which         on the mAP score at IoU [0.15, 0.25, 0.5, 0.75, 0.9]. If an
is not surprising given the challenging nature of the dataset.      algorithm had the highest score or the second-highest score
While the Ground dataset presents a higher average LRAP,            (in situations where the baseline had the best performance),
the scores from the two aerial collections are very low. This       in any of these metrics, it was given a score of 1. The best
can be attributed to both aerial collections containing more        performing team was selected based on the scores obtained
severe artifacts as well as a vastly different capture view-        by their algorithms. As in the 2018 challenge, each team
point than the one in the Ground collection (whose images           was allowed to submit three algorithms. Thus, the upper
would have a higher resemblance to the classification net-          bound for the best performing team is 45: 3 (algorithms) ×
work training data). It is important to note the sharp dif-         5 (mAP at IoU intervals) × 3 (collections).
ference in the performance of different classifiers on our             Figs. 1a, 1b, 1c and 2 show the results from the detection
dataset. While the ResNet50 [14] classifier obtained a              challenge for the best performing algorithms submitted by
slightly better but similar performance to the VGG16 classi-        the participants for the different collections, as compared
fier — which was the one used to evaluate the performance           to the baseline. For brevity, only the results of the top-
of the participants in the challenge — other networks such          performing algorithm for each participant are shown. Full
as DenseNet [18], MobileNet [17], and NASNetMobile [65]             results can be found in Supp. Fig. 3. We also provide the
perform poorly when classifying our data. It is likely that         participants’ results on YOLOv2 in Supp. Table 3.
these models are highly oriented to ImageNet-like images,              We found the mAP scores at [0.15, 0.25], [0.25, 0.5], and
and have more trouble generalizing to our data without fur-         [0.75, 0.9] to be most discriminative in determining the win-
ther fine-tuning.                                                   ners. This is primarily due to the fact that objects in the air-
    For the testing set, the UAV collection maintains a low         borne data collections (UAV, Glider) have negligible sizes
score. However, the Ground collection’s score drops signifi-        and different degrees of views compared to the objects in
cantly. This is mainly due to a higher amount of frames with        the Ground data collection. In most cases, none of the al-
problematic conditions (such as rain, snow, motion blur or          gorithms could exceed the performance of the baseline by
just an increased distance to the target objects), compared to      a considerable margin, re-emphasizing that detection in un-
the frames in the training set. A similar effect is shown on        constrained environments is still an unsolved problem.
the Glider collection, for which the majority of the videos            The UCAS-NEU team with their strategy of sharpen-
in the testing set tended to portray either larger objects (e.g.,   ing images won the challenge by having the highest mAP
mountains) or objects closer to the camera view (e.g., other        scores for UAV (0.02% improvement over baseline) and
aircraft flying close to the video-recording glider). When          Ground dataset (0.32% improvement). This shows that im-
exclusively comparing the classification performance of the         age sharpening can re-define object edges and boundaries,
mAP@0.15                  mAP@0.25                                                                            mAP@0.25                         mAP@0.5                                                                          mAP@0.75

                                                                                                             12                                                                                                                      20

             2                                                                                               10
                                                                                                                                                                                                                                     18
                                                                                                               8
  mAP (%)

                                                                                                   mAP (%)

                                                                                                                                                                                                                           mAP (%)

                                                                                                                                                                                                                                          16.07
                  1.32
                  1.32

                                                                                                                        5.19

                                                                                                                                       5.19

                                                                                                                                                     5.19

                                                                                                                                                                   5.19

                                                                                                                                                                                 5.19

                                                                                                                                                                                                             5.19
            1.5

                                                                                                                                                                                                                                                  15.81

                                                                                                                                                                                                                                                                    15.81

                                                                                                                                                                                                                                                                            15.81
                                                                                                                                                                                                                                                          15.75

                                                                                                                                                                                                                                                                                            15.75
                                                 1.26
                                                 1.26                                                          6
                                      1.3
                                      1.3

                                                                                       1.3
                                                                                       1.3
                                                             1.24
                                                             1.24
                               1.12
                               1.12

                                                                                                                                                                                                                                     16
                                                                                                               4

                                                                                                                                                                                               2.57
             1

                                                                                                                           2 · 10−2

                                                                                                                                          1 · 10−2

                                                                                                                                                        1 · 10−2

                                                                                                                                                                      1 · 10−2

                                                                                                                                                                                    1 · 10−2

                                                                                                                                                                                                                1 · 10−2
                                                                                                                                                                                                      1.02
                                                                                                               2

                                                                                                                                                                                                                                                                                    13.08
                                                                                                                                                                                                                                     14
                                                                           0.59
                                                                           0.59

            U    C     Intl -CAS ay.AI Tlab seline                                                             U    C     Intl -CAS ay.AI Tlab seline                                                                                U    C     Intl -CAS ay.AI Tlab seline
         -NE -UMA well     M Sunw     M    Ba                                                               -NE -UMA well     M Sunw     M    Ba                                                                                  -NE -UMA well     M Sunw     M    Ba
      AS    S    y      TU                                                                               AS    S    y      TU                                                                                                  AS    S    y      TU
    UC ECA Hone                                                                                        UC ECA Hone                                                                                                           UC ECA Hone
      IM                                                                                                 IM                                                                                                                    IM

                              (a) UAV collection                                                                                      (b) Glider collection                                                                          (c) Ground collection @ mAP 0.75.

                                Figure 1: Object Detection: Best performing submissions per team across three mAP intervals.

                                                                        mAP@0.90                                                                                      cult benchmarks (mAP @ 0.5 for Glider, mAP @ 0.9 for
                                                                                                                                                                      Ground) by almost 0.9% and 0.15% respectively. As can be
                                                                                                                                                                      seen in Fig. 3, this algorithm creates many visible artifacts.
                                                                                                0.19

                               0.2
                                                                                                                                                                          With the YOLOv2 architecture, the results of the sub-
                              0.15                                                                                                                                    mitted algorithms on the detection challenge varied greatly.
                   mAP (%)

                                                                                                                                                                      Full results can be found in Supp. Table 3. While none of
                               0.1                                                                                                                                    the algorithms could beat the performance of the baseline
                                      4 · 10−2

                                                 4 · 10−2

                                                                          4 · 10−2

                                                                                     4 · 10−2

                                                                                                                                                                      for the Ground collection at mAP@0.15 and mAP@0.25,
                                                                                                                                                                      MTLab (0.41% improvement over baseline) and UCAS-
                                                             1 · 10−2

                                                                                                             1 · 10−2

                     5 · 10−2
                                                                                                                                                                      NEU (2.64% and 0.22% improvement over baseline) had
                                     U    C
                                  -NE -UMA well
                                                Intl -CAS ay.AI Tlab seline                                                                                           the highest performance at mAP@0.50, mAP@0.75 and
                               AS    S    y         M Sunw     M    Ba
                             UC ECA Hone         TU                                                                                                                   mAP@0.90 respectively, reiterating the fact that simple tra-
                               IM
                                                                                                                                                                      ditional methods like image sharpening can improve detec-
Figure 2: Object Detection: Best performing submissions                                                                                                               tion performance in relatively clean images. For Glider,
per team for the Ground collection @ mAP 0.9.                                                                                                                         Honeywell’s SRCNN-based autoencoder had the best per-
                                                                                                                                                                      formance at mAP@0.15 and mAP@0.25 with over 0.66%
thereby helping in object detection. For the Glider collec-                                                                                                           and 0.13% improvement over the baseline respectively, su-
tion, most participants chose to do nothing. Thus the results                                                                                                         perseding their performance of 3.22% with YOLOv3 (see
of several algorithms are an exact replica of the baseline re-                                                                                                        Supp. Fig. 3). Unlike, YOLOv3, YOLOv2 seems to be less
sult. Also, most participants tended to use a scene classifier                                                                                                        affected by JPEG blocking artifacts that could have been
trained on UG2 to identify which collection the input image                                                                                                           enhanced due to their algorithm. For UAV however, UCAS-
came from. The successful execution of such a classifier                                                                                                              NEU and IMECAS-CAS had the highest performance at
determines the processing to be applied to the images be-                                                                                                             mAP@0.15 and mAP@0.25 with marginal improvements
fore sending them to a detector. If the classifier failed to                                                                                                          of 0.02% and 0.01% over the baselines.
detect the collection, no pre-processing would be applied to                                                                                                              Object Classification Improvement in Video. Fig. 3
the image, sending the raw unchanged frame to the detec-                                                                                                              shows a comparison of the visual effects of each of the top-
tor. Large numbers of failed detections likely contributed to                                                                                                         performing algorithms. For most algorithms the enhance-
some results being almost the same as the baseline.                                                                                                                   ment seems to be quite subtle (as is the case of the UCAS-
    An interesting observation here concerns the perfor-                                                                                                              NEU and IMECAS-UMAC examples), when analyzing the
mance of the algorithm from MTLab. As discussed previ-                                                                                                                differences with the raw image it is easier to notice the ef-
ously, MTlab’s algorithm jointly optimized image restora-                                                                                                             fects of their sharpening algorithms. We observe a different
tion with object detection with the aim to maximize detec-                                                                                                            scenario in the images enhanced by the MTLab algorithm,
tion performance over image quality. Although it performs                                                                                                             such images presenting visually noticeable artifacts, and a
poorly for comparatively easier benchmarks (mAP@0.15                                                                                                                  larger area of the image being modified. While this behav-
for UAV, mAP@0.25 for Glider, mAP@0.75 for Ground),                                                                                                                   ior seemed to provide good improvement in the object de-
it exceeds the performance of other algorithms on diffi-                                                                                                              tection task, it proved to be adverse for the object classifi-
cation scenario in which the exact location of the object of                                                                       UAV                 Glider                      Ground

interest is already known.

                                                                                                                                                    33.71
                                                                                                                 34.4

                                                                                                                                                                                 33.23
                                                                                                                                    32.23

                                                                                                                                                                                                    32.11
                                                                                                             31.63

                                                                                                                                29.36

                                                                                                                                                            29.36

                                                                                                                                                                            29.36

                                                                                                                                                                                                29.36
                                                                           Average LRAP (%)
                                                                                                30

                                                                                                                                                                                                                      19.04
                                                                                                                                                                                                                   16.43
                                                                                                20
                        (a) UCAS-NEU

                                                                                                                        12.86
                                                                                                     12.85

                                                                                                                                                                    12.86
                                                                                                                                            12.71

                                                                                                                                                                                         12.5

                                                                                                                                                                                                            8.43
                                                                                                10
                                                                                                        U      C   Intl   CA
                                                                                                                             S
                                                                                                                               ay.A
                                                                                                                                   I                                                                        lab
                                                                                                    -NE -UMA well       M- Sunw                                                                       MT
                                                                                                 AS      AS    ney   TU
                                                                                              UC     E C    Ho
                           (b) MTLab                                                              IM

                                                                 Figure 4: Best performing submissions per team for the ob-
                                                                 ject classification task using VGG16 classifier. The dashed
                                                                 lines represent the baseline for each data collection
                      (c) IMECAS-UMAC                            over different classifiers we observe interesting results.
Figure 3: Visual comparison of enhancement and restora-          Supp. Fig. 6 showcases the performance of the same algo-
tion algorithms submitted by participants. The first image       rithms as in Fig. 4, when evaluated using a ResNet50 classi-
is the original input, second the algorithm’s output and the     fier. It is interesting to note that while some of the enhance-
third is the difference between the input and output.            ment algorithms that presented some improvement in the
    It is important to note that even though the UAV collec-     VGG16 classification, such as the UCAS-NEU algorithm
tion is quite challenging (with a baseline performance of        for the UAV and Ground collections, still had an improve-
12.71% for the VGG16 network; more detailed informa-             ment when using Resnet50, some other algorithms that hurt
tion on the performance of all the submitted algorithms can      the classification accuracy with VGG16 obtained improve-
be found in Supp. Tables 5-9), it tended to be the collec-       ments when being evaluated using a different classifier (as
tion for which most of the evaluated algorithms presented        it was the case for the IMECAS-UMAC, TUM-CAS, and
some kind of improvement over the baseline. The high-            Sunway.AI algorithms on the Glider dataset). Nevertheless,
est improvement, however, was ultimately low (0.15% im-          the enhancement algorithms had difficulties in improving
provement). The intuition behind this is that even though        the performance for the UAV collection when the evalua-
the enhancement techniques employed by all the algorithms        tion classifier was ResNet50, as none of them were able
were quite varied, given the high diversity of optical aberra-   to provide improvement for this type of data. However, a
tions in the UAV collection, a good portion of the enhance-      good number of them were able to improve the classifica-
ment methods do not correct for all of the degradation types     tion results for the VGG16 network, hinting that while the
present in these videos.                                         performance of these networks is similar, the enhancement
    Even though the Glider collection test set appears to be     of features for one does not necessarily translate to an im-
easier than the provided training set (having an average         provement in the features utilized by the other.
LRAP more than 20% higher than that of the training set),
it turned out to be quite challenging for most of the evalu-     6. Discussion
ated methods. Only one of the methods was able to beat the           The results of the challenge led to some surprises. While
baseline classification performance by 0.689% in that case.      the restoration and enhancement algorithms submitted by
We observe similar results for the Ground collection, where      the participants tended to improve the detection and clas-
only two algorithms were able to improve upon the baseline       sification results for the diverse imagery included in our
classification performance for the VGG16 network. Inter-         dataset, no approach was able to improve the results by a
estingly, a majority of the algorithms skipped the process-      significant margin. Moreover, some of the enhancement
ing for the videos from this collection, considering them to     algorithms that improved performance (e.g., MT-Lab’s ap-
be of sufficient quality by default. The highest performance     proach) degraded the image quality, making it almost un-
improvement was present for the Ground collection, with          realistic. This provides evidence that most CNN-based de-
the top-performing algorithm for this set providing a 4.25%      tection methods rely on contextual features for prediction
improvement over the baseline. Fig. 4 shows the top algo-        rather than focusing on the structure of the object itself.
rithm for each team in all three collections, as well as how     So what might seem like a perfect image to the detector
they compare with their respective baselines.                    may not seem realistic to a human observer. Add to this
    When comparing the performance of the algorithms             the complexity of varying scales, sizes, weather conditions
and imaging artifacts like blur due to motion, atmospheric          [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
turbulence, mis-focus, distance, camera characteristics, etc.            for image recognition. CoRR, abs/1512.03385, 2015. 6
Simultaneously correcting the artifacts with the dual goal          [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
of improving recognition and perceptual quality of these                 ing for image recognition. In Proceedings of the IEEE con-
videos is an enormous task — and we have only begun to                   ference on computer vision and pattern recognition, pages
                                                                         770–778, 2016. 5
scratch the surface.
                                                                    [16] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar. Simul-
   Acknowledgement Funding was provided under IARPA                      taneous super-resolution and feature extraction for recogni-
contract #2016-16070500002. This research is based upon                  tion of low-resolution faces. In IEEE CVPR, 2008. 2
work supported in part by the Office of the Director of             [17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
National Intelligence (ODNI), Intelligence Advanced Re-                  T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
search Projects Activity (IARPA). The views and conclu-                  cient convolutional neural networks for mobile vision appli-
sions contained herein are those of the authors and should               cations. CoRR, abs/1704.04861, 2017. 6
not be interpreted as necessarily representing the official         [18] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected
policies, either expressed or implied, of ODNI, IARPA, or                convolutional networks. CoRR, abs/1608.06993, 2016. 6
the U.S. Government. The U.S. Government is authorized              [19] H. Huang and H. He. Super-resolution method for face
to reproduce and distribute reprints for governmental pur-               recognition using nonlinear mappings on coherent features.
poses notwithstanding any copyright annotation therein.                  IEEE T-NN, 22(1):121–130, 2011. 2
                                                                    [20] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-
                                                                         resolution from transformed self-exemplars. In IEEE CVPR,
References                                                               2015. 2
                                                                    [21] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu, and
 [1] UCF Aerial Action data set. http://crcv.ucf.edu/
                                                                         B. Xu. Super-resolution person re-identification with semi-
     data/UCF_Aerial_Action.php. 2
                                                                         coupled low-rank discriminant dictionary learning. In IEEE
 [2] UCF-ARG data set. http://crcv.ucf.edu/data/
                                                                         CVPR, 2015. 2
     UCF-ARG.php. 2
                                                                    [22] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-
 [3] A. Abdelhamed, S. Lin, and M. S. Brown. A high-quality              resolution using very deep convolutional networks. In IEEE
     denoising dataset for smartphone cameras. In IEEE CVPR,             CVPR, 2016. 4, 5
     2018. 2                                                        [23] R. Köhler, M. Hirsch, B. Mohler, B. Schölkopf, and
 [4] E. Agustsson and R. Timofte. Ntire 2017 challenge on single         S. Harmeling.        Recording and playback of camera
     image super-resolution: Dataset and study. In IEEE CVPR             shake: Benchmarking blind deconvolution with a real-world
     Workshops, 2017. 2                                                  database. In ECCV, 2012. 2
 [5] C. Chen, Q. Chen, J. Xu, and V. Koltun. Learning to see in     [24] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
     the dark. In IEEE CVPR, 2018. 2                                     A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al.
 [6] C. Dong, Y. Deng, C. C. Loy, and X. Tang. Compression               Photo-realistic single image super-resolution using a genera-
     artifacts reduction by a deep convolutional network. In IEEE        tive adversarial network. In IEEE CVPR, 2017. 1
     ICCV, 2015. 4, 5                                               [25] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Under-
 [7] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a                  standing and evaluating blind deconvolution algorithms. In
     deep convolutional network for image super-resolution. In           IEEE CVPR, 2009. 2
     European conference on computer vision, pages 184–199.         [26] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng. AOD-Net:
     Springer, 2014. 4                                                   All-in-one dehazing network. In IEEE ICCV, 2017. 2
 [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and       [27] F. Lin, J. Cook, V. Chandran, and S. Sridharan. Face recog-
     A. Zisserman. The Pascal visual object classes (VOC) chal-          nition from super-resolved images. In Proceedings of the
     lenge. IJCV, 88(2):303–338, 2010. 2, 3                              Eighth International Symposium on Signal Processing and
                                                                         Its Applications, volume 2, 2005. 2
 [9] R. B. Fisher. The PETS04 surveillance ground-truth data
                                                                    [28] F. Lin, C. Fookes, V. Chandran, and S. Sridharan. Super-
     sets. In IEEE PETS Workshop, 2004. 2
                                                                         resolved faces for improved face recognition from surveil-
[10] C. Fookes, F. Lin, V. Chandran, and S. Sridharan. Evaluation        lance video. Advances in Biometrics, 2007. 2
     of image resolution and super-resolution on face recognition   [29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
     performance. Journal of Visual Communication and Image              loss for dense object detection. In IEEE ICCV, 2017. 5
     Representation, 23(1):75 – 93, 2012. 2
                                                                    [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
[11] R. Girshick. Fast r-cnn. In IEEE ICCV, 2015. 5                      Fu, and A. C. Berg. SSD: Single shot multibox detector. In
[12] M. Grgic, K. Delac, and S. Grgic. SCface–surveillance               ECCV, 2016. 5
     cameras face database. Multimedia Tools and Applications,      [31] M. Mueller, N. Smith, and B. Ghanem. A benchmark and
     51(3):863–879, 2011. 2                                              simulator for UAV tracking. In ECCV, 2016. 2
[13] M. Haris, G. Shakhnarovich, and N. Ukita. Task-driven su-      [32] S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale con-
     per resolution: Object detection in low-resolution images.          volutional neural network for dynamic scene deblurring. In
     CoRR, abs/1803.11316, 2018. 2                                       IEEE CVPR, 2017. 1, 2
[33] M. Nishiyama, H. Takeshima, J. Shotton, T. Kozakaya, and               S. Ghosh, S. Nagesh, et al. Bridging the gap between compu-
     O. Yamaguchi. Facial deblur inference to improve recogni-              tational photography and visual recognition. arXiv preprint
     tion of blurred faces. In IEEE CVPR, 2009. 2                           arXiv:1901.09482, 2019. 4
[34] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. C. Chen, J. T.       [51]   R. G. VidalMata, S. Banerjee, B. RichardWebster, M. Al-
     Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis,                   bright, P. Davalos, S. McCloskey, B. Miller, A. Tambo,
     E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick,             S. Ghosh, S. Nagesh, Y. Yuan, Y. Hu, J. Wu, W. Yang,
     H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song,              X. Zhang, J. Liu, Z. Wang, H. Chen, T. Huang, W. Chin,
     A. Fong, A. Roy-Chowdhury, and M. Desai. A large-                      Y. Li, M. Lababidi, C. Otto, and W. J. Scheirer. Bridging the
     scale benchmark dataset for event recognition in surveillance          gap between computational photography and visual recogni-
     video. In IEEE CVPR, 2011. 2                                           tion. CoRR, abs/1901.09482, 2019. 1, 3
[35] T. Plotz and S. Roth. Benchmarking denoising algorithms         [52]   M. Waleed Gondal, B. Schölkopf, and M. Hirsch. The un-
     with real photographs. In IEEE CVPR, 2017. 2                           reasonable effectiveness of texture transfer for single image
[36] P. Rasti, T. Uiboupin, S. Escalera, and G. Anbarjafari. Con-           super-resolution. In ECCV Workshops, 2018. 2
     volutional neural network super resolution for face recogni-    [53]   F. W. Wheeler, X. Liu, and P. H. Tu. Multi-frame super-
     tion in surveillance monitoring. In International Conference           resolution for face recognition. In IEEE BTAS, 2007. 2
     on Articulated Motion and Deformable Objects, 2016. 2           [54]   J. Wu, S. Ding, W. Xu, and H. Chao. Deep joint face hallu-
[37] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger.          cination and recognition. CoRR, abs/1611.08091, 2016. 2
     In Proceedings of the IEEE conference on computer vision        [55]   C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super-
     and pattern recognition, pages 7263–7271, 2017. 3, 5                   resolution: A benchmark. In ECCV, 2014. 2
[38] J. Redmon and A. Farhadi. Yolov3: An incremental improve-       [56]   B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large-scale
     ment. arXiv preprint arXiv:1804.02767, 2018. 3, 5                      general purpose ground truth database: Methodology, anno-
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,                tation tool and benchmarks. In 6th International Conference
     S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,                 on Energy Minimization Methods in Computer Vision and
     et al. ImageNet large scale visual recognition challenge.              Pattern Recognition, 2007. 2
     IJCV, 115(3):211–252, 2015. 2                                   [57]   Y. Yao, B. R. Abidi, N. D. Kalka, N. A. Schmid, and M. A.
[40] J. Shao, C. C. Loy, and X. Wang. Scene-independent group               Abidi. Improving long range and high magnification face
     profiling in crowd. In IEEE CVPR, 2014. 2                              recognition: Database acquisition, evaluation, and enhance-
                                                                            ment. CVIU, 111(2):111–125, 2008. 2
[41] V. Sharma, A. Diba, D. Neven, M. S. Brown, L. V. Gool,
     and R. Stiefelhagen. Classification driven dynamic image        [58]   J. Yim and K. Sohn. Enhancing the performance of convolu-
     enhancement. CoRR, abs/1710.07558, 2017. 1, 2                          tional neural networks on quality degraded datasets. CoRR,
                                                                            abs/1710.06805, 2017. 2
[42] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical
                                                                     [59]   J. Yu, B. Bhanu, and N. Thakoor. Face recognition in video
     evaluation of recent full reference image quality assessment
                                                                            with closed-loop super-resolution. In IEEE CVPR Work-
     algorithms. IEEE T-IP, 15(11):3440–3451, 2006. 2
                                                                            shops, 2011. 2
[43] J. Shi, L. Xu, and J. Jia. Discriminative blur detection fea-
                                                                     [60]   M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. De-
     tures. In IEEE CVPR, 2014. 2
                                                                            convolutional networks. In IEEE CVPR, 2010. 2
[44] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and
                                                                     [61]   M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive decon-
     O. Wang. Deep video deblurring. CoRR, abs/1611.08387,
                                                                            volutional networks for mid and high level feature learning.
     2016. 2
                                                                            In IEEE ICCV, 2011. 2
[45] L. Sun, S. Cho, J. Wang, and J. Hays. Edge-based blur kernel    [62]   H. Zhang, J. Yang, Y. Zhang, N. M. Nasrabadi, and T. S.
     estimation using patch priors. In IEEE ICCP, 2013. 2                   Huang. Close the loop: Joint blind image restoration and
[46] K. Tahboub, D. Gera, A. R. Reibman, and E. J. Delp.                    recognition with sparse representation prior. In IEEE ICCV,
     Quality-adaptive deep learning for pedestrian detection. In            2011. 2
     IEEE ICIP, 2017. 2                                              [63]   P. Zhu, L. Wen, X. Bian, L. Haibin, and Q. Hu. Vision meets
[47] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-               drones: A challenge. arXiv preprint arXiv:1804.07437,
     label data. In Data mining and knowledge discovery hand-               2018. 2
     book, pages 667–685. Springer, 2009. 4                          [64]   X. Zhu, C. C. Loy, and S. Gong. Video synopsis by het-
[48] T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel. Facial          erogeneous multi-source correlation. In IEEE ICCV, 2013.
     image super resolution using sparse representation for im-             2
     proving face recognition in surveillance monitoring. In SIU,    [65]   B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learn-
     2016. 2                                                                ing transferable architectures for scalable image recognition.
[49] R. G. Vidal, S. Banerjee, K. Grm, V. Struc, and W. J.                  CoRR, abs/1707.07012, 2017. 6
     Scheirer. UG2 : A video benchmark for assessing the impact
     of image restoration and enhancement on automatic visual
     recognition. In IEEE WACV, 2018. 1, 2, 4
[50] R. G. VidalMata, S. Banerjee, B. RichardWebster, M. Al-
     bright, P. Davalos, S. McCloskey, B. Miller, A. Tambo,
You can also read