Seminar SS2020 - Visual Feature Learning in Autonomous Driving - Visual Feature Learning in Autonomous ...

Page created by Gary Armstrong
 
CONTINUE READING
Seminar SS2020 - Visual Feature Learning in
               Autonomous Driving
                                Erçelik, Emeç
                                  April 2020

1     Introduction
In this seminar, we will be focusing on object detection and tracking problems.
Object detection tries to localize objects with their classes on 2D plane (front-
view or bird’s eye view) or in 3D space. This provides the distance of the
ego-vehicle to objects in the environment as well as their shapes, which is useful
for planning algorithms. Tracking is another aspect of the localization, in which
the position of specific objects are determined in subsequent frames assigning a
tracking ID to these objects. Therefore, these two tasks for autonomous driving
are very related to each other and two very basic problems for autonomous
driving function.
    You will be providing a literature review with one of the topics in Section 2
in a group. The groups will be formed with minimum 2 people and maximum
3 people. Therefore, only some of the topics that receive most attention will
be assigned to the groups. I will ask you in the introduction session to vote
three of the topics according to you or your group would like to review most.
You can send you choices via email on the same day. If you can’t form a group,
the people assigned to the same topic will be a group and will be working on
the same report. The report is a 6-8 page document of your findings in IEEE
format.
    Under the topics I provide a few references that I found useful for starting
to review these topics. You don’t have to include all the provided references
in your research report. Before making a decision on a topic, please read my
explanation on this topic and abstract, introduction, and conclusion sections
of the provided references to have a broad understanding of the topic. It is
better to start reading a paper with the abstract and introduction than directly
going into details of a paper. After the topic assignment, you can continue your
research with different methods. Here are some suggestions:
    • Checking cited papers in the references section of the provided papers (To
      find similar studies)
    • Checking studies that cited the provided papers (To find more recent
      studies, for example using Google Scholar)

                                        1
• Checking leaderboards of the public datasets
    • Checking the conferences related to the topic, i.e. ICCV, CVPR, IV, ITSC

2     Topics
    • 3D Object detection using different data modalities: (max 3 stu-
      dents) For the autonomous driving function in real-world, one of the key
      properties is the ability of 3D object detection, which provides precise loca-
      tions and sizes of objects in the environment in order to do path planning
      for future positions of the ego-vehicle. Due to the weaknesses of several
      sensor types, it is argued to be better when data from several sensors are
      fused to compensate each other. The fusion of sensors is useful when a few
      of the sensors fail to provide proper information. In this seminar topic,
      you are asked to review such challenges brought by the weaknesses of the
      sensors, the sensor fusion methods for 3D object detection using different
      modalities. Additionally, you will be providing your own ideas as solu-
      tions to these challenges with evidences from the literature. You can find
      some of the sensor types you can include below, but you are not limited
      to these.
         – LiDAR
         – Radar
         – RGB or gray-scale camera images
         – Multi-camera images
         – GPS / maps
         – Review on sensor fusion types
      References: [5, 11, 21, 12]
    • Comparison of only LiDAR-based and fusion-based 3D object
      detection methods: (max 2 students) 3D object detection mostly
      relies on LiDAR sensors, which yield highly precise depth information
      of objects in 3D space. Even though it has its own advantages, LiDAR
      information is usually sparse comparing to camera images and doesn’t
      work properly in different weather conditions as well as radar sensors.
      With these disadvantages in mind, the state-of-the-art methods evaluated
      on public datasets currently utilize only LiDAR points. In this seminar
      topic, you are asked to compare methods using only LiDAR information
      for detection with the sensor fusion methods for 3D object detection from
      different perspectives by reviewing the literature. In the end, you will be
      also providing your own ideas about which method to utilize for better
      detections with the evidence from literature. Some of the points of view
      might include:
         – Advantages of one over each other

                                         2
– Safety concerns
     – Accuracy of methods
     – Possible saturation in results with the future improvements in the
       architectures
     – Extendability of architectures based on sensors
     – Upcoming improvements in the sensors
  References: [12, 11]
• Challenges in video object detection and tracking for 2D and
  3D bounding boxes: (max 2 students) Detecting objects might be
  challenging due to the low quality of sensor outputs at the time of sampling
  (reflection etc.) and occluded objects behind others (parked cars, not fully
  seen behind the first few cars). These problems can be relieved by using
  time-series data such as videos, or sequences of LiDAR point clouds. In
  this topic, you are asked to review recent studies about object tracking
  for 2D and 3D bounding boxes and provide current challenges of these
  problems with the solutions suggested in the literature. Additionally, you
  are asked to provide your own solutions for these challenges considering
  the experiments in the literature.
  References: [1, 20, 15]
• Convolutional recurrent neural networks (RNNs) for object de-
  tection and tracking: (max 2 students) RNNs are the most com-
  monly used architecture types to keep track of important temporal features
  to improve accuracy for the task using time-series data. In object detection
  and tracking domains, using RNNs might get tricky due to multi-objects
  in one scene. In this case, the matching problem arises, which aims at
  aligning features from the same objects in time. However, convolutional
  recurrent layers can be considered as a solution to this problem, which
  processes feature maps of the entire scene instead of processing features
  of objects separately in time. In this topic, you are asked to review the
  usage of convolutional recurrent layers in the domain of object detection
  and tracking. In the end, you will be also providing your own ideas for the
  usage of these layers, usefulness depending on data type, and the feature
  map type. Some of the aspects might be, but not limited to the ones
  below.
     – Advantages over linear RNNs
     – Comparison on cell types (e.g. LSTMs vs. GRUs)
     – Accuracies on different data types (camera images, radar data, Li-
       DAR data)
     – Comparison on the efficacy for object detection and for tracking
  References: [25, 14, 19, 15]

                                    3
• Semi-supervised, self-supervised, and unsupervised learning meth-
      ods for object detection: (max 3 students) Supervised learning is
      the most commonly used learning method for training object detection
      networks. However, the more complex data types are, the more difficult
      and expensive become the labelling efforts. Therefore, there have been an
      increasing interest in training networks with small labeled datasets aim-
      ing to reach similar accuracies with the supervised learning. In this topic,
      you are asked to review such methods for object detection especially in
      3D space. Some aspects to be considered while reviewing might be the
      following:
         – Applicability to different data types
         – Comparison of results of these methods with results of methods based
           on supervised learning
         – Differences in loss functions used for these training schemes
      References: [22, 24, 17, 9, 18, 10]
    • Transfer learning for 3D object detection: (max 3 students) Deep
      neural networks are mostly known to be data hungry, which also applies
      to the 3D object detection, for which most studies rely on large networks.
      Even though collecting data from real-world environments is relatively
      easy, synchronizing and calibrating multiple sensors, labeling the collected
      data for the required tasks, and ensuring the diversity in the object classes,
      weather conditions, illumination, road types are difficult, time-consuming,
      and expensive. Therefore, there have been attempts to transfer the infor-
      mation learned from the available datasets to the new datasets to decrease
      efforts on preparing new datasets. Additionally, there have been virtual
      datasets and simulation environments as a solution to this problem. In this
      topic, you are asked to review performances of studies using the available
      real-world and virtual datasets for transfer learning in 3D object detection
      as well as available simulators. You will be also providing what kind of
      sensors and diversity these datasets and simulators have to offer in ad-
      dition to the your ideas for improving such knowledge transfer with the
      evidence from literature.
      References: [17, 23, 3]

3     Other Useful Resources
    • Datasets: KITTI [8], nuScenes [4]
    • Simulation environment: CARLA [6], GTA V [16]
    • Metrics: Detection [13], Tracking [2], Keywords: Average Precision, In-
      tersection over Union, MOTA, MOTP
    • Survey: [7]

                                            4
References
 [1] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. Tracking with-
     out bells and whistles. Proceedings of the IEEE International Conference
     on Computer Vision, 2019-Octob(Cmc):941–951, 2019.
 [2] Keni Bernardin, Alexander Elbs, and Rainer Stiefelhagen. Multiple object
     tracking performance metrics and evaluation in a smart room environment.
     Sixth IEEE International Workshop on Visual Surveillance, in conjunction
     with ECCV, 90:91, 2006.
 [3] Åsmund Brekke, Fredrik Vatsendvik, and Frank Lindseth. Multimodal 3D
     object detection from simulated pretraining. Communications in Computer
     and Information Science, 1056 CCIS:102–113, 2019.

 [4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin
     Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Os-
     car Beijbom. nuScenes: A multimodal dataset for autonomous driving.
     (March), 2019.
 [5] Andreas Danzer, Thomas Griebel, Martin Bach, and Klaus Dietmayer.
     2D Car Detection in Radar Data with PointNets. 2019 IEEE Intelligent
     Transportation Systems Conference, ITSC 2019, pages 61–66, 2019.
 [6] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and
     Vladlen Koltun. CARLA: An Open Urban Driving Simulator. (CoRL):1–
     16, 2017.

 [7] Di Feng, Christian Haase-Schutz, Lars Rosenbaum, Heinz Hertlein,
     Claudius Glaser, Fabian Timm, Werner Wiesbeck, and Klaus Dietmayer.
     Deep Multi-Modal Object Detection and Semantic Segmentation for Au-
     tonomous Driving: Datasets, Methods, and Challenges. IEEE Transactions
     on Intelligent Transportation Systems, pages 1–20, 2020.

 [8] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for au-
     tonomous driving? the KITTI vision benchmark suite. Proceedings of
     the IEEE Computer Society Conference on Computer Vision and Pattern
     Recognition, pages 3354–3361, 2012.
 [9] Tengda Han, Weidi Xie, and Andrew Zisserman. Video Representation
     Learning by Dense Predictive Coding. pages 1483–1492, 2020.
[10] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Mo-
     mentum Contrast for Unsupervised Visual Representation Learning. 2019.
[11] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L.
     Waslander. Joint 3D Proposal Generation and Object Detection from View
     Aggregation. IEEE International Conference on Intelligent Robots and
     Systems, pages 5750–5757, 2018.

                                     5
[12] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang,
     and Oscar Beijbom. Pointpillars: Fast encoders for object detection from
     point clouds. Proceedings of the IEEE Computer Society Conference on
     Computer Vision and Pattern Recognition, 2019-June:12689–12697, 2019.
[13] Lei Liu, Zongxu Pan, and Bin Lei. Learning a Rotation Invariant Detector
     with Rotatable Bounding Box. 2017.
[14] Yongyi Lu, Cewu Lu, and Chi Keung Tang. Online Video Object De-
     tection Using Association LSTM. Proceedings of the IEEE International
     Conference on Computer Vision, 2017-Octob:2363–2371, 2017.

[15] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and Furious: Real Time
     End-to-End 3D Detection, Tracking and Motion Forecasting with a Single
     Convolutional Net. Proceedings of the IEEE Computer Society Conference
     on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
[16] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun.
     Playing for data: Ground truth from computer games. Lecture Notes in
     Computer Science (including subseries Lecture Notes in Artificial Intelli-
     gence and Lecture Notes in Bioinformatics), 9906 LNCS:102–118, 2016.
[17] Yew Siang Tang and Gim Hee Lee. Transferable semi-supervised 3D ob-
     ject detection from RGB-D data. Proceedings of the IEEE International
     Conference on Computer Vision, 2019-Octob:1931–1940, 2019.

[18] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Multiview
     Coding. jun 2019.
[19] Subarna Tripathi, Zachary C. Lipton, Serge Belongie, and Truong Nguyen.
     Context matters: Refining object detection in video with recurrent neural
     networks. British Machine Vision Conference 2016, BMVC 2016, 2016-
     Septe:1–12, 2016.
[20] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip H.S.
     Torr. Fast online object tracking and segmentation: A unifying approach.
     Proceedings of the IEEE Computer Society Conference on Computer Vision
     and Pattern Recognition, 2019-June:1328–1338, 2019.
[21] Bin Yang, Ming Liang, and R. Urtasu. HDNET: Exploiting HD Maps for
     3D Object Detection. CoRL, (CoRL):1–10, 2018.
[22] Zhaohui Yang, Miaojing Shi, Yannis Avrithis, Chao Xu, and Vittorio Fer-
     rari. Training Object Detectors from Few Weakly-Labeled and Many Un-
     labeled Images. (Mil), 2019.
[23] Fuxun Yu, Di Wang, Yinpeng Chen, Nikolaos Karianakis, Pei Yu, Dimitrios
     Lymberopoulos, and Xiang Chen. Unsupervised Domain Adaptation for
     Object Detection via Cross-Domain Semi-Supervised Learning. 2019.

                                      6
[24] Na Zhao, Tat-Seng Chua, and Gim Hee Lee. SESS: Self-Ensembling Semi-
     Supervised 3D Object Detection. 2019.
[25] Menglong Zhu and Mason Liu. Mobile Video Object Detection with
     Temporally-Aware Feature Maps. Proceedings of the IEEE Computer Soci-
     ety Conference on Computer Vision and Pattern Recognition, pages 5686–
     5695, 2018.

                                    7
You can also read